Home

Awesome

BaseNcoding

There are well-known algorithms for binary data to string encoding exist, such as algorithms with alphabet length with radix of power 2 (base32, base64) and not of power 2 (base85 and base91).

This library contains an implementation of algorithm for general case, that is a custom-length alphabet can be used. Here is online BaseNcoding DEMO.

The idea of the developed algorithm is based on base85 encoding, except block size is not constant, but it's calculation depends on an alphabet length.

Steps of algorithm

  1. Calculation of block size in bits and chars.
  2. Conversation of an input string to the byte array (using UTF8 Encoding).
  3. Splitting byte array on n-bit groups.
  4. Conversation of every group to radix n.
  5. Tail bits processing.

Mathematical justification

For optimal block size calculation the following considerations has been used:

System for an optimal block size calculation

In this system:

See implementation in GetOptimalBitsCount method.

Tail bits processing

The most difficult problem in this project which nevertheless has been solved with only integer calculations.

Calculation of total and tail number of bits and chars during encoding

int mainBitsLength = (data.Length * 8 / BlockBitsCount) * BlockBitsCount;
int tailBitsLength = data.Length * 8 - mainBitsLength;
int totalBitsLength = mainBitsLength + tailBitsLength;
int mainCharsCount = mainBitsLength * BlockCharsCount / BlockBitsCount;
int tailCharsCount = (tailBitsLength * BlockCharsCount + BlockBitsCount - 1) / BlockBitsCount;
int totalCharsCount = mainCharsCount + tailCharsCount;
int iterationCount = mainCharsCount / BlockCharsCount;

Calculation of total and tail number of bits and chars during decoding

int totalBitsLength = ((data.Length - 1) * BlockBitsCount / BlockCharsCount + 8) / 8 * 8;
int mainBitsLength = totalBitsLength / BlockBitsCount * BlockBitsCount;
int tailBitsLength = totalBitsLength - mainBitsLength;
int mainCharsCount = mainBitsLength * BlockCharsCount / BlockBitsCount;
int tailCharsCount = (tailBitsLength * BlockCharsCount + BlockBitsCount - 1) / BlockBitsCount;
BigInteger tailBits = CharsToBits(data, mainCharsCount, tailCharsCount);
if (tailBits >> tailBitsLength != 0)
{
	totalBitsLength += 8;
	mainBitsLength = totalBitsLength / BlockBitsCount * BlockBitsCount;
	tailBitsLength = totalBitsLength - mainBitsLength;
	mainCharsCount = mainBitsLength * BlockCharsCount / BlockBitsCount;
	tailCharsCount = (tailBitsLength * BlockCharsCount + BlockBitsCount - 1) / BlockBitsCount;
}
int iterationCount = mainCharsCount / BlockCharsCount;

Experiment

Diagram of optimal block size and alphabet length dependence has been calculated with help of system above for mbc = 64:

<details> <summary>Diagram values</summary>
ankr_bitsr_values
21111
319121.58331.0136
42121
558252.321.034
631122.58331.0136
71452.81.0258
83131
91963.16671.0136
1063193.31581.0842
1138113.45451.038
1243123.58331.0136
1337103.71.0031
141953.81.0258
1539103.91.0489
164141
1749124.08331.0349
182564.16671.0136
1955134.23081.1672
2056134.30771.1369
2157134.38461.0719
2249114.45451.038
23924.51.0332
2455124.58331.0136
2551114.63641.0588
2647104.71.0031
271944.751.0136
282454.81.0258
293474.85711.0041
3049104.91.0489
3164134.92311.3237
325151
335151.0313
3461125.08331.0349
354185.1251.024
363165.16671.0136
372655.21.0333
384795.22221.1739
3958115.27271.1015
4053105.31.1642
411635.33331.0517
424385.3751.1008
432755.41.0953
4460115.45451.038
4560115.45451.329
461125.51.0332
4761115.54551.0721
483975.57141.0679
492855.61.0523
5062115.63641.0588
511735.66671.012
5257105.71.0031
5363115.72731.005
542345.751.0136
555295.77781.0226
562955.81.0258
5764115.81821.1187
584175.85711.0041
594785.8751.0433
6059105.91.0489
6159105.91.2375
6259105.91.456
6359105.91.7086
646161
656161.0156
666161.0313
676161.0469
686161.0625
6961106.11.0609
704986.1251.024
714376.14291.034
723766.16671.0136
733766.16671.1011
743156.21.0333
755696.22221.042
765696.22221.1739
772546.251.0476
782546.251.1031
7963106.31.0266
8063106.31.1642
811936.33331.0136
821936.33331.0517
835186.3751.0002
845186.3751.1008
853256.41.0331
863256.41.0953
874576.42861.0722
885896.44441.098
895896.44441.2155
905896.44441.3441
911326.51.0109
921326.51.0332
931326.51.0558
941326.51.0786
</details>

One can see that known base64, base85, base91 encodings have been developed in good points (minimal block size with good compression ratio).

Big numbers (C# BitInteger and JavaScript BigInt) are used for some calculations.

The algorithm also has a parallel version.

License

BaseN algorithm is licensed under the Apache 2.0 License.

More detail explanation available on Russian.