Awesome

BaseNcoding

There are well-known algorithms for binary data to string encoding exist, such as algorithms with alphabet length with radix of power 2 (base32, base64) and not of power 2 (base85 and base91).

This library contains an implementation of algorithm for general case, that is a custom-length alphabet can be used. Here is online BaseNcoding DEMO.

The idea of the developed algorithm is based on base85 encoding, except block size is not constant, but it's calculation depends on an alphabet length.

Steps of algorithm

Calculation of block size in bits and chars.
Conversation of an input string to the byte array (using UTF8 Encoding).
Splitting byte array on n-bit groups.
Conversation of every group to radix n.
Tail bits processing.

Mathematical justification

For optimal block size calculation the following considerations has been used:

System for an optimal block size calculation

In this system:

a — Length of alphabet A.
k — Count of encoding chars.
b — Digit radix (2 in most cases).
n — Number of bits in radix b for representing k chars of alphabet A.
r_values — Compression ratio (less is better).
r_bits — Bits per char count (greater is better).
mbc — Block max bits count.
⌊x⌋ — The largest integer not greater than x (floor).
⌈x⌉ — The smallest integer not less than x (ceiling).

See implementation in GetOptimalBitsCount method.

Tail bits processing

The most difficult problem in this project which nevertheless has been solved with only integer calculations.

Calculation of total and tail number of bits and chars during encoding

int mainBitsLength = (data.Length * 8 / BlockBitsCount) * BlockBitsCount;
int tailBitsLength = data.Length * 8 - mainBitsLength;
int totalBitsLength = mainBitsLength + tailBitsLength;
int mainCharsCount = mainBitsLength * BlockCharsCount / BlockBitsCount;
int tailCharsCount = (tailBitsLength * BlockCharsCount + BlockBitsCount - 1) / BlockBitsCount;
int totalCharsCount = mainCharsCount + tailCharsCount;
int iterationCount = mainCharsCount / BlockCharsCount;

Calculation of total and tail number of bits and chars during decoding

int totalBitsLength = ((data.Length - 1) * BlockBitsCount / BlockCharsCount + 8) / 8 * 8;
int mainBitsLength = totalBitsLength / BlockBitsCount * BlockBitsCount;
int tailBitsLength = totalBitsLength - mainBitsLength;
int mainCharsCount = mainBitsLength * BlockCharsCount / BlockBitsCount;
int tailCharsCount = (tailBitsLength * BlockCharsCount + BlockBitsCount - 1) / BlockBitsCount;
BigInteger tailBits = CharsToBits(data, mainCharsCount, tailCharsCount);
if (tailBits >> tailBitsLength != 0)
{
	totalBitsLength += 8;
	mainBitsLength = totalBitsLength / BlockBitsCount * BlockBitsCount;
	tailBitsLength = totalBitsLength - mainBitsLength;
	mainCharsCount = mainBitsLength * BlockCharsCount / BlockBitsCount;
	tailCharsCount = (tailBitsLength * BlockCharsCount + BlockBitsCount - 1) / BlockBitsCount;
}
int iterationCount = mainCharsCount / BlockCharsCount;

Experiment

Diagram of optimal block size and alphabet length dependence has been calculated with help of system above for mbc = 64:

<details> <summary>Diagram values</summary>

a	n	k	r_bits	r_values
2	1	1	1	1
3	19	12	1.5833	1.0136
4	2	1	2	1
5	58	25	2.32	1.034
6	31	12	2.5833	1.0136
7	14	5	2.8	1.0258
8	3	1	3	1
9	19	6	3.1667	1.0136
10	63	19	3.3158	1.0842
11	38	11	3.4545	1.038
12	43	12	3.5833	1.0136
13	37	10	3.7	1.0031
14	19	5	3.8	1.0258
15	39	10	3.9	1.0489
16	4	1	4	1
17	49	12	4.0833	1.0349
18	25	6	4.1667	1.0136
19	55	13	4.2308	1.1672
20	56	13	4.3077	1.1369
21	57	13	4.3846	1.0719
22	49	11	4.4545	1.038
23	9	2	4.5	1.0332
24	55	12	4.5833	1.0136
25	51	11	4.6364	1.0588
26	47	10	4.7	1.0031
27	19	4	4.75	1.0136
28	24	5	4.8	1.0258
29	34	7	4.8571	1.0041
30	49	10	4.9	1.0489
31	64	13	4.9231	1.3237
32	5	1	5	1
33	5	1	5	1.0313
34	61	12	5.0833	1.0349
35	41	8	5.125	1.024
36	31	6	5.1667	1.0136
37	26	5	5.2	1.0333
38	47	9	5.2222	1.1739
39	58	11	5.2727	1.1015
40	53	10	5.3	1.1642
41	16	3	5.3333	1.0517
42	43	8	5.375	1.1008
43	27	5	5.4	1.0953
44	60	11	5.4545	1.038
45	60	11	5.4545	1.329
46	11	2	5.5	1.0332
47	61	11	5.5455	1.0721
48	39	7	5.5714	1.0679
49	28	5	5.6	1.0523
50	62	11	5.6364	1.0588
51	17	3	5.6667	1.012
52	57	10	5.7	1.0031
53	63	11	5.7273	1.005
54	23	4	5.75	1.0136
55	52	9	5.7778	1.0226
56	29	5	5.8	1.0258
57	64	11	5.8182	1.1187
58	41	7	5.8571	1.0041
59	47	8	5.875	1.0433
60	59	10	5.9	1.0489
61	59	10	5.9	1.2375
62	59	10	5.9	1.456
63	59	10	5.9	1.7086
64	6	1	6	1
65	6	1	6	1.0156
66	6	1	6	1.0313
67	6	1	6	1.0469
68	6	1	6	1.0625
69	61	10	6.1	1.0609
70	49	8	6.125	1.024
71	43	7	6.1429	1.034
72	37	6	6.1667	1.0136
73	37	6	6.1667	1.1011
74	31	5	6.2	1.0333
75	56	9	6.2222	1.042
76	56	9	6.2222	1.1739
77	25	4	6.25	1.0476
78	25	4	6.25	1.1031
79	63	10	6.3	1.0266
80	63	10	6.3	1.1642
81	19	3	6.3333	1.0136
82	19	3	6.3333	1.0517
83	51	8	6.375	1.0002
84	51	8	6.375	1.1008
85	32	5	6.4	1.0331
86	32	5	6.4	1.0953
87	45	7	6.4286	1.0722
88	58	9	6.4444	1.098
89	58	9	6.4444	1.2155
90	58	9	6.4444	1.3441
91	13	2	6.5	1.0109
92	13	2	6.5	1.0332
93	13	2	6.5	1.0558
94	13	2	6.5	1.0786

</details>

One can see that known base64, base85, base91 encodings have been developed in good points (minimal block size with good compression ratio).

Big numbers (C# BitInteger and JavaScript BigInt) are used for some calculations.

The algorithm also has a parallel version.

License

BaseN algorithm is licensed under the Apache 2.0 License.

More detail explanation available on Russian.