Awesome

CachedTexts

Cached Texts is a powerful and compact cross-platform library aimed at parsing and generating of text data with the maximum possible performance. The library is characterized by the following:

Code stored in the module "CachedTexts.pas" and depends on the two other libraries: CachedBuffers and UniConv.
There are declared the classes TUniConvReReader and TUniConvReWriter which allow to process sequential conversion of text from one encoding to another "on-the-fly".
There are 3 possible basic encodings for parsing and generation of the text: Byte-encoding, UTF-16 and UTF-32. Each of them has its advantages and disadvantages. UTF-16 is the most common encoding. In Delphi it matches such types as string/UnicodeString and WideString. However, it is not the fastest and it requires additional logic to handle surrogate characters. UTF-32 is the most universal, but at the same time the slowest encoding. Byte-encodings is understood as the UTF-8 or any of the supported SBCS (Ansi). This kind of interface is the fastest and it is universal for ASCII-characters, but it can be difficult at the identification of used encoding.
There are 3 types of its own strings used when parsing: ByteString, UTF16String and UTF32String. The peculiarity of these strings lies in the fact that for keeping data they do not take memory in the heap, but they refer to text data (TCachedBuffer), which significantly increases the performance. All CachedString-types have the same interface, which consists of properties, functions and overloaded operators, allowing to carry out a wide range of tasks along with the system string types. Furthermore they are faster than system counterparts. To convert from one CachedString-type to another there is a TTemporaryString type.
For parsing and generation of texts there are standard classes: CachedTextReaders (TByteTextReader, TUTF16TextReader, TUTF32TextReader) and CachedTextWriters (TByteTextWriter, TUTF16TextWriter, TUTF32TextWriter).
(Not yet implemented) There are standard classes for popular markup languages: XML, HTML and JSON. For low level Simple-API interfaces (like "MSXMLSAX2") it is used Byte-encodings. For the Document Object Model (DOM) it is used UnicodeString.
Despite the fact that CachedString-types quite quickly compare with string constants, the problem of identification of strings (such as serialization) is quite demanding to resources. Many people use solutions based on binary-trees or hash-tables, however, CachedTexts library contains the CachedSerializer-utility, allowing to achieve maximum performance at the expense of code generation.

Demo.zip

CachedTextReaders: TByteTextReader/TUTF16TextReader/TUTF32TextReader

There are several classes for sequential reading of text data: TByteTextReader, TUTF16TextReader and TUTF32TextReader. You can choose any class for parsing in dependence which encoding is more comfortable to use. In case the encoding of the source text data is different, the conversion will be executed automatically, but it might significantly slow down the application execution. The most of text files are in the byte-encoding, so it is recommended to use the TByteTextReader-class for parts of a code which are demanding for performance, because the automatic conversion of text will not be made and ByteString is the fastest string type.

Every TCachedTextReader-class has two main constructors: Create and CreateFromFile. In both cases the source encoding is determined by the BOM. If the BOM is absent, it will be considered the parameter DefaultByteEncoding. This parameter may be equal to CODEPAGE_UTF8 or one of SBCS-encoding. The constructor CreateDirect can be used when the source encoding and the conversion context are directly defined and BOM is not considered.

The functionality of the TCachedTextReader-class has much in common with the functionality of the TCachedReader-class. In both classes an access can be carried out with properties Current, Overflow, Margin and the function Flush. There are also high-level functions: ReadData, Skip, ReadChar, Export and two kinds of Readln. It is strongly recommended to use Readln function for text data consisting of many lines.

type
  TByteTextReader/TUTF16TextReader/TUTF32TextReader = class(TCachedTextReader)
  public
    constructor Create({AEncoding for TByteTextReader,} const ASource: TCachedReader; const ADefaultByteEncoding: Word = 0; const AOwner: Boolean = False);
    constructor CreateFromFile({AEncoding for TByteTextReader,} const AFileName: string; const ADefaultByteEncoding: Word = 0);
    constructor CreateDirect(const AContext: PUniConvContext; const ASource: TCachedReader; const AOwner: Boolean = False);
    {for TByteTextReader:} constructor CreateDefault(const ASource: TCachedReader; const ADefaultByteEncoding: Word = 0; const AOwner: Boolean = False);
    {for TByteTextReader:} constructor CreateDefaultFromFile(const AFileName: string; const ADefaultByteEncoding: Word = 0);
  
    procedure ReadData(var ABuffer; const ACount: NativeUInt); 
    procedure Skip(const ACount: NativeUInt);
    function Flush: NativeUInt;
    function Readln(var S: ByteString/UTF16String/UTF32String): Boolean;
    function Readln(var S: UnicodeString): Boolean;
    function ReadChar: UCS4Char;
    procedure Export(const AWriter: TCachedWriter);
    procedure Export(const AFileName: string);

    property Current: PByte read/write
    property Overflow: PByte read
    property Margin: NativeInt read
    property Finishing: Boolean read
    property EOF: Boolean read/write
    property Converter: TUniConvReReader read
    property Source: TCachedReader read
    property Owner: Boolean read/write
    property FileName: string read
  end;

Text parsing example:

const
  URL_ID = '<a href=';
  URL_ID_LENGTH = Length(URL_ID);

// standard (slow) way to parse text file
procedure ParseStd(const FileName: string);
var
  Text: TextFile;
  S, URL: string;
  P: Integer;
begin
  AssignFile(Text, FileName);
  Reset(Text);
  while (not EOF(Text)) do
  begin
    Readln(Text, S);

    // find '<a href='
    P := Pos(URL_ID, S);
    if (P = 0) then Continue;

    // parse URL
    Delete(S, 1, P + URL_ID_LENGTH - 1);
    P := Pos('"', S);
    if (P = 0) then Continue;
    Delete(S, 1, P);
    P := Pos('"', S);
    if (P <> 0) then
    begin
      URL := Copy(S, 1, P - 1);
      Writeln(URL); // display URL to the console
    end;
  end;
  CloseFile(Text);
end;

// cached (extremely fast) way to parse text file
procedure Parse(const FileName: string);
var
  Text: TUTF16TextReader;
  S, URL: CachedTexts.UTF16String;
  P: Integer;
begin
  Text := TUTF16TextReader.CreateFromFile(FileName);
  try
    while (Text.Readln(S)) do
    begin
      // find '<a href='
      P := S.Pos(URL_ID);
      if (P < 0) then Continue;

      // parse URL
      S.Skip(P);
      P := S.CharPos('"');
      if (P < 0) then Continue;
      S.Skip(P + 1);
      P := S.CharPos('"');
      if (P >= 0) then
      begin
        URL := S.SubString(0, P);
        Writeln({Implicit operator}string(URL)); // display URL to the console
      end;
    end;
  finally
    Text.Free;
  end;
end;

CachedStrings: ByteString/UTF16String/UTF32String

CachedString - is a simple structure that contains a pointer to the characters, string length and a set of flags. The peculiarity of these strings lies in the fact that for keeping data they do not take memory in the heap, but they refer to text data (CachedBuffer), which significantly increases the performance

type
  ByteString/UTF16String/UTF32String = record
  public
    property Chars: PChar read/write
    property Length: NativeUInt read/write
    property Ascii: Boolean read/write
    property References: Boolean read/write (*useful for &amp;-like character references*) 
    property Tag: Byte read/write
    property Empty: Boolean read/write
    
    procedure Assign(AChars: PChar; ALength: NativeUInt);
    procedure Assign(const S: string);
    procedure Delete(const From, Count: NativeUInt);
    
    function DetermineAscii: Boolean;
    function TrimLeft: Boolean;
    function TrimRight: Boolean;
    function Trim: Boolean;
    function SubString(const From, Count: NativeUInt): CachedString;
    function SubString(const Count: NativeUInt): CachedString;
    function Skip(const Count: NativeUInt): Boolean;
    function Hash: Cardinal;
    function HashIgnoreCase: Cardinal;
    
    function CharPos(const C: Char; const From: NativeUInt = 0): NativeInt;
    function CharPosIgnoreCase(const C: Char; const From: NativeUInt = 0): NativeInt;
    function Pos(const S: CachedString; const From: NativeUInt = 0): NativeInt;
    function Pos(const AChars: PChar; const ALength: NativeUInt; const From: NativeUInt = 0): NativeInt;
    function Pos(const S: string; const From: NativeUInt = 0): NativeInt;
    function PosIgnoreCase(const S: CachedString; const From: NativeUInt = 0): NativeInt;
    function PosIgnoreCase(const AChars: PChar; const ALength: NativeUInt; const From: NativeUInt = 0): NativeInt;
    function PosIgnoreCase(const S: string; const From: NativeUInt = 0): NativeInt;
  public
    function ToBoolean: Boolean;
    function ToBooleanDef(const Default: Boolean): Boolean;
    function TryToBoolean(out Value: Boolean): Boolean;
    function ToHex: Integer;
    function ToHexDef(const Default: Integer): Integer;
    function TryToHex(out Value: Integer): Boolean;
    function ToInteger: Integer;
    function ToIntegerDef(const Default: Integer): Integer;
    function TryToInteger(out Value: Integer): Boolean;
    function ToCardinal: Cardinal;
    function ToCardinalDef(const Default: Cardinal): Cardinal;
    function TryToCardinal(out Value: Cardinal): Boolean;
    function ToHex64: Int64;
    function ToHex64Def(const Default: Int64): Int64;
    function TryToHex64(out Value: Int64): Boolean;
    function ToInt64: Int64;
    function ToInt64Def(const Default: Int64): Int64;
    function TryToInt64(out Value: Int64): Boolean;
    function ToUInt64: UInt64;
    function ToUInt64Def(const Default: UInt64): UInt64;
    function TryToUInt64(out Value: UInt64): Boolean;
    function ToFloat: Extended;
    function ToFloatDef(const Default: Extended): Extended;
    function TryToFloat(out Value: Single): Boolean;
    function TryToFloat(out Value: Double): Boolean;
    function TryToFloat(out Value: TExtended80Rec): Boolean;
    function ToDate: TDateTime;
    function ToDateDef(const Default: TDateTime): TDateTime;
    function TryToDate(out Value: TDateTime): Boolean;
    function ToTime: TDateTime;
    function ToTimeDef(const Default: TDateTime): TDateTime;
    function TryToTime(out Value: TDateTime): Boolean;
    function ToDateTime: TDateTime;
    function ToDateTimeDef(const Default: TDateTime): TDateTime;
    function TryToDateTime(out Value: TDateTime): Boolean; 
  public
    procedure ToAnsiString/ToLowerAnsiString/ToUpperAnsiString(var S: AnsiString; const CodePage: Word = 0);
    procedure ToAnsiShortString/ToLowerAnsiShortString/ToUpperAnsiShortString(var S: ShortString; const CodePage: Word = 0);
    procedure ToUTF8String/ToLowerUTF8String/ToUpperUTF8String(var S: UTF8String);
    procedure ToUTF8ShortString/ToLowerUTF8ShortString/ToUpperUTF8ShortString(var S: ShortString);
    procedure ToWideString/ToLowerWideString/ToUpperWideString(var S: WideString);
    procedure ToUnicodeString/ToLowerUnicodeString/ToUpperUnicodeString(var S: UnicodeString);
    procedure ToString/ToLowerString/ToUpperString(var S: string); 

    function ToAnsiString/ToLowerAnsiString/ToUpperAnsiString: AnsiString;
    function ToUTF8String/ToLowerUTF8String/ToUpperUTF8String: UTF8String;
    function ToWideString/ToLowerWideString/ToUpperWideString: WideString;
    function ToUnicodeString/ToLowerUnicodeString/ToUpperUnicodeString: UnicodeString;
    function ToString/ToLowerString/ToUpperString: string;
    class operator Implicit(const a: CachedString): string;  
  public
    function Equal/EqualIgnoreCase(const S: CachedString/AnsiString/UTF8String/WideString/UnicodeString): Boolean; 
    function Compare/CompareIgnoreCase(const S: CachedString/AnsiString/UTF8String/WideString/UnicodeString): NativeInt; 
    function Equal/EqualIgnoreCase(const AChars: PAnsiChar/PUTF8Char; const ALength: NativeUInt; const CodePage: Word): Boolean;
    function Compare/CompareIgnoreCase(const AChars: PAnsiChar/PUTF8Char; const ALength: NativeUInt; const CodePage: Word): NativeInt;
    function Equal/EqualIgnoreCase(const AChars: PUnicodeChar; const ALength: NativeUInt; const CodePage: Word): Boolean;
    function Compare/CompareIgnoreCase(const AChars: PUnicodeChar; const ALength: NativeUInt; const CodePage: Word): NativeInt;

    class operator Equal/NotEqual/GreaterThan/GreaterThanOrEqual/LessThan/LessThanOrEqual(const S: CachedString/AnsiString/UTF8String/WideString/UnicodeString): Boolean;  
  end;

Supported date formats: YYYYMMDD, YYYY-MM-DD, -YYYY-MM-DD, DD.MM.YYYY, DD-MM-YYYY, DD/MM/YYYY, DD.MM.YY, DD-MM-YY, DD/MM/YY, YYYY (YYYY-01-01), YYYY-MM (YYYY-MM-01), --MM-DD (2000-MM-DD), --MM-- (2000-MM-01), ---DD (2000-01-DD).

Supported time formats: hh:mm:ss.zzzzzz, hh-mm-ss.zzzzzz, hh:mm:ss.zzz, hh-mm-ss.zzz, hh:mm:ss, hh-mm-ss, hhmmss, hh:mm, hh-mm, hhmm.

CachedTextWriters: TByteTextWriter/TUTF16TextWriter/TUTF32TextWriter

There are several classes for sequential writeing of text data: TByteTextWriter, TUTF16TextWriter and TUTF32TextWriter. The functionality of the TCachedTextWriter-class has much in common with the functionality of the TCachedWriter-class. In both classes an access can be carried out with properties Current, Overflow, Margin and the function Flush. The function WriteData can be used to direct text data writing. Regardless of TCachedTextWriter-class encoding, the text data can be automatically convert to any other encoding - to do this you should specify the BOM-parameter in Create/CreateFromFile constructors, or Context-parameter in CreateDirect constructor.

TCachedTextWriter-classes allow you to write not only string lypes, also other common types: Booleans, Ordinals, Floats, DateTimes and Variants; functions WriteFormat, WriteFormatUTF8 and WriteFormatUnicode are used for the text formatting. It is important to know that all these functions work faster than their SysUtils-analogues. Float and DateTime write parameters set in fields FloatSettings and DateTimeSettings.

type
  TByteTextWriter/TUTF16TextWriter/TUTF32TextWriter = class(TCachedTextWriter)
  public
    constructor Create({AEncoding for TByteTextWriter,} const ATarget: TCachedWriter; const ABOM: TBOM = bomNone; const ADefaultByteEncoding: Word = 0; const AOwner: Boolean = False);
    constructor CreateFromFile({AEncoding for TByteTextWriter,} const AFileName: string; const ABOM: TBOM = bomNone; const ADefaultByteEncoding: Word = 0);
    constructor CreateDirect(const AContext: PUniConvContext; const ATarget: TCachedWriter; const AOwner: Boolean = False);
    procedure WriteData(const ABuffer; const ACount: NativeUInt);
    function Flush: NativeUInt;

    property Current: PByte read/write
    property Overflow: PByte read
    property Margin: NativeInt read
    property EOF: Boolean read/write
    property Converter: TUniConvReWriter read
    property Target: TCachedWriter read
    property Owner: Boolean read/write
    property FileName: string read
  public
    FloatSettings: TFloatSettings;
    DateTimeSettings: TDateTimeSettings;

    procedure WriteCRLF;
    procedure WriteAscii(const AChars: PAnsiChar; const ALength: NativeUInt);
    procedure WriteUnicodeAscii(const AChars: PUnicodeChar; const ALength: NativeUInt);
    procedure WriteUCS4Ascii(const AChars: PUCS4Char; const ALength: NativeUInt);
    procedure WriteAnsiChars(const AChars: PAnsiChar; const ALength: NativeUInt; const ACodePage: Word);
    procedure WriteUTF8Chars(const AChars: PUTF8Char; const ALength: NativeUInt);
    procedure WriteUnicodeChars(const AChars: PUnicodeChar; const ALength: NativeUInt);
    procedure WriteUCS4Chars(const AChars: PUCS4Char; const ALength: NativeUInt);
    
    procedure WriteByteString(const S: ByteString);
    procedure WriteUTF16String(const S: UTF16String);
    procedure WriteUTF32String(const S: UTF32String);
    
    procedure WriteAnsiString(const S: AnsiString);
    procedure WriteShortString(const S: ShortString; const ACodePage: Word = 0);
    procedure WriteUTF8String(const S: UTF8String);
    procedure WriteWideString(const S: WideString);
    procedure WriteUnicodeString(const S: UnicodeString);
    procedure WriteUCS4String(const S: UCS4String; const ANullTerminated: Boolean = True);
    
    procedure WriteFormat(const FmtStr: AnsiString; const Args: array of const);
    procedure WriteFormatUTF8(const FmtStr: UTF8String; const Args: array of const);
    procedure WriteFormatUnicode(const FmtStr: UnicodeString; const Args: array of const);
  public
    procedure WriteBoolean(const AValue: Boolean);
    procedure WriteBooleanOrdinal(const AValue: Boolean);
    procedure WriteInteger(const AValue: Integer; const ADigits: NativeUInt = 0);
    procedure WriteHex(const AValue: Integer; const ADigits: NativeUInt = 0);
    procedure WriteCardinal(const AValue: Cardinal; const ADigits: NativeUInt = 0);
    procedure WriteInt64(const AValue: Int64; const ADigits: NativeUInt = 0);
    procedure WriteHex64(const AValue: Int64; const ADigits: NativeUInt = 0);
    procedure WriteUInt64(const AValue: UInt64; const ADigits: NativeUInt = 0);
    procedure WriteFloat(const AValue: Extended; const ASettings: TFloatSettings);
    procedure WriteFloat(const AValue: Extended);
    procedure WriteDate(const AValue: TDateTime; const ASettings: TDateTimeSettings);
    procedure WriteDate(const AValue: TDateTime);
    procedure WriteTime(const AValue: TDateTime; const ASettings: TDateTimeSettings);
    procedure WriteTime(const AValue: TDateTime);
    procedure WriteDateTime(const AValue: TDateTime; const ASettings: TDateTimeSettings);
    procedure WriteDateTime(const AValue: TDateTime);
    procedure WriteVariant(const AValue: Variant; const AFloatSettings: TFloatSettings; const ADateTimeSettings: TDateTimeSettings);
    procedure WriteVariant(const AValue: Variant);
  end;

TTemporaryString

Memory manager operations and reference counting can take almost all the time during the parsing. All the system string types are served by intenal System.pas module functions and they produce several difficult operations for redistribution of allocated memory, which has a bad influence on performance during such prevalent operations as initialization, concatenation and finalization. Because of this the major emphasis in CachedTexts library is based on static-memory strings: ByteString, UTF16String and UTF32String. There are some goals though difficult to be solved without dynamic memory allocation, e.g. unpacking XML-string, which contains character references, or converting ByteString to UTF16String. Special for such tasks there is TTemporaryString type based on dynamic array of byte and memory reserve principle, which means that memory is meant to be never or rarely reallocated. TTemporaryString can keep only one of three data types at the same time: ByteString, UTF16String or UTF32String. InitByteString, InitUTF16String or InitUTF32String methods can be caused anytime. Data filling, converting and concatenation are executed due to Append methods. Data is added to the end of string or converted to necessary encoding previously. One of the most important feature of TTemporaryString is an opportunity of system string types “emulation”. In this case special system header (which allows compiler Delphi using TTemporaryString as “constant system strings”) is added to character data. It might be useful if your algorithms or functions use system strings, e.g. Writeln, ExtractFileName or StrToDate(FormatSettings). However be careful, because emulated string lifetime is restricted by TTemporaryString data lifetime. Emulated string use as a temporary string constant is highly recommended. For using real system strings choose CachedString.ToString-methods or UniqueString after string variable assignment.

program Produce;
var
  Temp: TTemporaryString;
  P: PUnicodeString;
  S: UnicodeString;
begin
  // initialization
  Temp.InitUTF16String;

  // concatenation and automatic conversion
  Temp.Append(UnicodeString('Delphi'));
  Temp.Append(AnsiString(' is the way to build applications for'));
  Temp.Append(UTF8String(' Windows 10, Mac,'), ccUpper);
  Temp.Append(WideString(' Mobile and more.'), ccLower);

  // system string
  Writeln(Temp.CastUTF16String.ToUnicodeString);

  // constant system string emulation
  P := Temp.EmulateUnicodeString;
  Writeln(P^);

  // copying the constant system string
  S := P^;
  UniqueString(S);
  Writeln(S);

  // reinitialization
  Temp.InitByteString(CODEPAGE_DEFAULT);
  Temp.Append(UTF8String('Delphi is the best.'));
  Writeln(Temp.EmulateAnsiString^);

  Readln;
end.

Output:

Delphi is the way to build applications for WINDOWS 10, MAC, mobile and more.
Delphi is the way to build applications for WINDOWS 10, MAC, mobile and more.
Delphi is the way to build applications for WINDOWS 10, MAC, mobile and more.
Delphi is the best.

CachedSerializer

Utility CachedSerializer works for the one and only aim - to identificate string data with the maximum performance. You can build the project from the source in folder "utilities/CachedSerializer" or download binary with examples. As the first argument of command line utility gets the path of a text file, which contains options and identifiers. It’s necessary to have the following options for serialization:

-<encoding>. "-utf16", "-utf8", "-utf32" and the other code page encodings can act as an encoding option, e.g. "-1250" (you can see the whole list of SBCS-encodings in UniConv library description). "-raw" means CODEPAGE_RAWDATA, "-user" means CODEPAGE_USERDEFINED, "-ansi" means CODEPAGE_DEFAULT. "-ansi" is a default encoding. If your ByteString identifier contains only ASCII-characters, then encoding is unnecessary, you can specify "-ansi" or don't do this at all.
-p"<variable_name>" or -p"<pointer_name>:<length_name>" or -p"<pointer_name>:<length_name>:<code_indent>". Serialization goes for 2 parameters: character pointer and character length. If your identifier is stored in CachedString, then use <variable_name>, so that serialization will be going for parameters <Name>.Chars and <Name>.Length. Default value is "S". Default code indent is 0. To serialize null-terminated strings, the <variable_name> parameter should be the number of bytes, to which the memory is aligned. 0 means the memory is aligned to the size of the character.
-i. This option tells that serialization will be insensitive.
-f"<Name(-SType)>:<Prefix>" or -f"<Name(-SType)>:<Prefix>:<TypeName>". Option -f helps to generate function <Name> with string parameter S. If -SType not defined - CachedString will be used. Ordinal constant (<PREFIX>IDENTIFIERN = N) or enumerate values (<TypeName> = <prefix>Identifier1, <prefix>Identifier2, …) will be generated for each identifier.
-fn"<Name(-SType)>:<Prefix>" or -fn"<Name(-SType)>:<Prefix>:<TypeName>". Option -fn meaning is the same as -f, but only serialization code will be generated.
-s"FileName". This option allows you to save the generated code into a file.

Each of these options can be mentioned as a command line argument. Besides, the following options are permitted:

-nolog. Don't display the generated code in the console.
-nocopy. Don't copy the generated code to the clipboard.
-nowait. Don't wait for press Enter after code generation.

Each text file line can be introduced in several formats:

<identifier>
<identifier>::<implementation>
<identifier>:<marker>:<implementation>
<identifier>:<marker>

Besides <identifier> there's an important code <implementation>, which is called for <identifier> case. If options -f or -fn are mentioned, then <implementation> will be made automatically. If there's a situation where the same <implementation> must be used for several <identifier> define <marker> - some string constant. For writing <identifier>, <marker> or <implementation> the following special symbols are permitted: "\:", "\\", "\n", "\r", "\t" (tab), "\s" (space).

File an example serialization ("examples/simple1.txt"):

-ansi -f"ValueToID-AnsiString:ID_" -p"S:Length(S)"
sheet
row
cell
data
value
style