Home

Awesome

Bssom Specification

Bssom(Binary search algorithm structure model object binary marshalling) is a protocol for structured marshalling of objects using a binary search algorithm model. The marshalled data has special metadata information,according to these metadata information, you can efficiently only read and change an element in the object,In this way, when serializing and deserializing large objects, you do not have to cause complete serialization overhead because only one field is read or written.

Bssom is different from other binary protocols in that it has two characteristics:

  1. <a name="characteristic1"></a>can read an element in an object without deserializing the object in its entirety, this means that when I design Bssom, I need to allow the designed type to have a special metadata format, so that the metadata can be used to locate the specified element, and read it.
  2. <a name="characteristic2"></a>can change an element in an object without serializing the entire object, this requires that some types of bssom should be designed as fixed length structures, and their formats will not change with the change of values.

Based on these two characteristics, Bssom defines the two concepts of type system and format.

Serialization is conversion from application objects into Bssom formats via Bssom type system, Deserialization is conversion from Bssom formats into application objects via Bssom type system.

This document introduces the bssom type system and the bssom format.

Table of contents

Type system

The Bssom type system defines 11 categories

Native type:

Bssom allows applications to define application-specific types using native types, the defined native type format consists of an integer and data, the integer represents the data length of the native type. When performing cross-program interaction, the application can freely choose whether to process the unknown native type. If you choose to ignore it, you can skip its data by reading the integer.

Extension type:

In order to ensure the type extension of the Bssom protocol, an extension type is designed. The extension type is composed of an extension type code and data. The application should support it as much as possible, but there is currently no existing extension type.

Blank type:

When changing the Bssom format object or one of its elements, if the width of the changed value is smaller than the original object, a gap will be generated. This gap needs to be supplemented with fillers. The Blank type is used to fill blank bytes here

Supplement:

Format

Overview

format namefirst byte (in binary)first byte (in hex)
VarBlank0xxxxxxx0x00 - 0x7f
UInt16Blank100000000x80
UInt32Blank100000010x81
Null100000100x82
Int8100000110x83
Int16100001000x84
Int32100001010x85
Int64100001100x86
UInt8100001110x87
UInt16100010000x88
UInt32100010010x89
UInt64100010100x8a
Float32100010110x8b
Float64100011000x8c
Boolean100011010x8d
Timestamp100011100x8e
String100011110x8f
(never used)0x90 - 0xc0
Map1110000010xc1
Map2110000100xc2
(never used)0xc3 - 0xd0
Array1110100010xd1
Array2110100100xd2
Array3110100110xd3
(never used)0xd4 - 0xf0
ExtendCode111100010xf1
NativeCode111100100xf2
(never used)0xf4 - 0xff

Specific:

There is a variable-length unsigned integer format in Bssom: VarUInt , this format is not in the defined type system, and is only used in the internal Bssom format, such as recording the length and number of elements of Array, String, Map

VarUInt formatfirst byterange
VariableUInt80x00~0xfa0~250
VariableUInt90xfb251~505
FixUInt80xfc0~255
FixUInt160xfd0~65535
FixUInt320xfe0~4294967295
FixUInt640xff0~18446744073709551615

The format content for each format is shown in detail next

Symbolic representation

one byte:
+--------+
|        |
+--------+

a set of bytes:
+========+
|        |
+========+

objects defined in the format:
+~~~~~~~~~~~~~~~~~+
|                 |
+~~~~~~~~~~~~~~~~~+

Internal VarUInt Type format

VarUInt has six formats, which are used to represent the length of the object inside the Bssom format

1.VariableUInt8 : Use a byte to store 8-bit positive integers, the minimum value is 0x00, and the maximum value is 0xfa
+--------+   +--------+
|  0x00  | ~ |  0xfa  |
+--------+   +--------+

2.VariableUInt9 : Use two bytes, the second byte stores 8-bit positive integers, the minimum value is 0xfb+0x00, and the maximum value is 0xfa+0xff
+--------+--------+   +--------+--------+
|  0xfb  |  0x00  | ~ |  0xfb  |  0xff  |
+--------+--------+   +--------+--------+

3.FixUInt8 : Use two bytes, the second byte stores an 8-bit positive integer
+--------+--------+
|  0xfc  |XXXXXXXX|
+--------+--------+

4.FixUInt16 : Three bytes are used, and small-endian 16-bit positive integers are stored starting from the second byte
+--------+--------+--------+
|  0xfd  |XXXXXXXX|XXXXXXXX|
+--------+--------+--------+

5.FixUInt32 : Five bytes are used, and little-endian 32-bit positive integers are stored starting from the second byte
+--------+--------+--------+--------+--------+
|  0xfe  |XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|
+--------+--------+--------+--------+--------+

6.FixUInt64 : Nine bytes are used, and little-endian 64-bit positive integers are stored starting from the second byte
+--------+--------+--------+--------+--------+--------+--------+--------+--------+
|  0xff  |XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|
+--------+--------+--------+--------+--------+--------+--------+--------+--------+

Note: in the following format, varuint is represented by the following symbols

VarUInt:
+*------*+
|        |
+*------*+

Blank Type format

There are three formats for the blank type.

1.variable blank :  A byte to store a 7-bit positive integer, use to represent the next blank byte with a maximum length of 0x7f
+--------+=========+
|0XXXXXXX| N*blank |
+--------+=========+

2.UInt16 Blank : Two bytes are used to store 16 bit positive integers to represent the next blank byte with a maximum length of 2 ^ 16-1
+--------+--------+--------+=========+
|  0x80	 |XXXXXXXX|XXXXXXXX| N*blank |
+--------+--------+--------+=========+

3.UInt32 Blank : Four bytes are used to store 32-bit positive integers to represent the next blank byte with a maximum length of 2 ^ 32-1
+--------+--------+--------+--------+--------+=========+
|  0x81	 |XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX| N*blank |
+--------+--------+--------+--------+--------+=========+

Null Type format

Null types have only one format.

null:
+--------+
|  0x82  |
+--------+

Boolean Type format

The Boolean type has two formats, representing false and true respectively.

1.false:
+--------+--------+
|  0x8d  |  0x00  |
+--------+--------+

2.true:
+--------+--------+
|  0x8d  |  0x01  |
+--------+--------+

Number Type format

The Number type contains 8 formats Int8,Int16,Int32,Int64,UInt8,UInt16,UInt32,UInt64

Int8 stores an 8-bit signed integer
+--------+--------+
|  0x83  |XXXXXXXX|
+--------+--------+

Int16 stores a 16-bit little-endian signed integer
+--------+--------+--------+
|  0x84  |XXXXXXXX|XXXXXXXX|
+--------+--------+--------+

Int32 stores a 32-bit little-endian signed integer
+--------+--------+--------+--------+--------+
|  0x85  |XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|
+--------+--------+--------+--------+--------+

Int64 stores a 64-bit little-endian signed integer
+--------+--------+--------+--------+--------+--------+--------+--------+--------+
|  0x86  |XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|
+--------+--------+--------+--------+--------+--------+--------+--------+--------+

UInt8 stores an 8-bit unsigned integer
+--------+--------+
|  0x87  |XXXXXXXX|
+--------+--------+

UInt16 stores a 16-bit little-endian unsigned integer
+--------+--------+--------+
|  0x88  |XXXXXXXX|XXXXXXXX|
+--------+--------+--------+

UInt32 stores a 32-bit little-endian unsigned integer
+--------+--------+--------+--------+--------+
|  0x89  |XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|
+--------+--------+--------+--------+--------+

UInt64 stores a 64-bit little-endian unsigned integer
+--------+--------+--------+--------+--------+--------+--------+--------+--------+
|  0x8a  |XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|
+--------+--------+--------+--------+--------+--------+--------+--------+--------+

Float Type format

The Float type contains two formats, Float32,Float64

Float32 stores a floating-point number in IEEE 754 single-precision floating-point number 32-bit format
+--------+--------+--------+--------+--------+
|  0x8b  |XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|
+--------+--------+--------+--------+--------+

Float64 stores a floating-point number in IEEE 754 single-precision floating-point number 64-bit format:
+--------+--------+--------+--------+--------+--------+--------+--------+--------+
|  0x8c  |YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|
+--------+--------+--------+--------+--------+--------+--------+--------+--------+

* When storing, floating-point numbers are stored in little-endian order

String Type format

The string type has only one format, which consists of byte size and Utf8 encoded byte array

+--------+*-------*+========+
|  0x8f  | DataLen |  Data  |
+--------+*-------*+========+

* Data is the byte array after encoding the string with Utf8 encoding
* DataLen is the length of the encoded byte array

Timestamp Type format

The Timestamp type has only one format, the format is the number of seconds since the Unix Epoch represented by 8 bytes and the increment within a given number of seconds of 4 bytes.

+--------+------------------------------+------------------------------------+
|  0x8e  | seconds in 64-bit signed int | nanoseconds in 32-bit unsigned int |
+--------+------------------------------+------------------------------------+

Array Type format

The array type has three formats : Array1,Array2,Array3.

<a name="array1">Array1</a>

The Array1 format is composed of header information and a sequence of elements. In the header information, element type is included, which means that the elements in the array do not need to store type header information, therefore, there are two requirements for array elements in Array1 : 1.Elements in the array cannot contain type headers, 2.The width of the elements in the array are the same.

Because the elements of Array1 are all fixed-length, the access method of obtaining and changing elements can be obtained by: ElementsPos + index * eleSize

+--------+~~~~~~~~~~~~~~~+*------*+*-----*+~~~~~~~~~~~~~~~~~+
|  0xd1	 |  ElementType  | Length | Count |    N*Element    |
+--------+~~~~~~~~~~~~~~~+*------*+*-----*+~~~~~~~~~~~~~~~~~+

* ElementsPos refers to the starting position of the element sequence (the end offset of the Count field)
* ElementType : see type system
* The value of Length is the end of the entire data minus the starting offset of Count
* The value of Count is the number of elements
* N represents the number of elements

<a name="array2">Array2</a>

Array2 format is composed of length, number of elements and element sequence, unlike Array1, Array2 has no requirements for elements.

Because there are no restrictions on the width of elements in Array2, the access method of getting elements and changing elements can be obtained by: ElementsPos + index * SkipElement

+--------+*------*+*-----*+~~~~~~~~~~~~~~~~~+
|  0xd2	 | Length | Count |    N*Element    |
+--------+*------*+*-----*+~~~~~~~~~~~~~~~~~+

* ElementsPos refers to the starting position of the element sequence (the end offset of the Count field)
* The value of Length is the end of the entire data minus the starting offset of Count
* The value of Count is the number of elements

<a name="array3">Array3</a>

Array3 format is composed of length, number of elements, element offset sequence, element sequence.

Because Array3 has a sequence of element offsets, the access method of getting elements and changing elements can be obtained by: BasePos + GetElementOffset(index)

+--------+*------*+*-----*+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+
|  0xd3	 | Length | Count |  ElementOffsets |    N*Element    |
+--------+*------*+*-----*+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+

ElementOffsets is a sequence of positive integers composed of the VarUInt format. Each VarUInt represents the offset of an element at the index based on the starting position (BasePos) of the entire data segment:
+*--------------*+*--------------*+*--------------*+~~~~~~~~~~+~~~~~~~~~~+~~~~~~~~~~+
| Element1Offset | Element2Offset | Element3Offset | Element1 | Element2 | Element3 |
+*--------------*+*--------------*+*--------------*+~~~~~~~~~~+~~~~~~~~~~+~~~~~~~~~~+

* BasePos is the starting position of the 0xd3 type code
* The value of Length is the end of the entire data minus the starting offset of Count
* The value of Count is the number of elements and also represents the number of formats in ElementOffsets

Map Type format

There are two formats of Map type , Map1 and Map2.

<a name="map1">Map1</a>

The Map1 format is composed of header information and a sequence of key-value pairs. The header information contains the data length and the number of key-value pairs.

+--------+*-------*+*-----*+~~~~~~~~~~~~~~~~~~+
|  0xc1  | DataLen | Count |  N * (Key,Value) |
+--------+*-------*+*-----*+~~~~~~~~~~~~~~~~~~+

* The value of DataLen is the end offset of the entire data minus the starting offset of Count
* The value of Count is the number of key-value pairs
* N represents the number of key-value pairs

<a name="map2">Map2</a>

The difference between map2 format and Map1 is that the value and key are stored separately, and it has a special routing format to quickly locate the value offset.

Map2 consists of three parts: header information, routing information and value.

The header information includes data length, metadata length, number of key-value pairs, and maximum depth. The routing information mainly stores the relative offset of the value corresponding to each Key in the entire data, and the value part stores the sequence of values ​​in turn.

In the following, the header information and routing information are collectively referred to as the metadata part.

+--------+*-------*+*-------*+*-------*+*--------*+~~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~~+
|  0xc2  | DataLen |  Count  |  Depth  | RouteLen |   RouteSegment   |   ValueSegment   |
+--------+*-------*+*-------*+*-------*+*--------*+~~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~~+

* The value of DataLen is the position where the entire data ends minus the starting offset position of RouteLen
* The value of Count is the number of key-value pairs
* The value of Depth is the maximum depth in RouteSegment
* The value of RouteLen is the end position of the entire data minus the start offset position of RouteSegment
* ValueSegment is a sequence of values ​​in key-value pairs

The following will mainly introduce the routing segment format:

To understand the route segment format of Map2, you need to know what a binary search algorithm is, the metadata of Map2 is derived from the idea of ​​a binary search algorithm.

There is an array here, you need to design an algorithm to quickly find the subscript of the element in the array.

[120,3,1,125,5]

The most primitive method is to linearly traverse the array directly, so that the complexity of the algorithm is o(n). If you can sort the array before searching, and then fold it in half each time to search by interval, the complexity will be Is reduced to logn, such an algorithm is called a binary search algorithm.

Therefore, we can write BinarySearch as follows:

function BinarySearch(int[] array,int index,int length){
  int lo = index;
  int hi = index + length - 1;
  while (lo <= hi)
  {
      int i = lo + ((hi - lo) >> 1);
      int order = array[i] - value;
      if (order == 0) return i;
      if (order < 0)
          lo = i + 1;
      else
          hi = i - 1;
  }
  return ~lo;
}

In modern compiler optimizations, switch lookups for constants are often optimized to binary search branch code in some cases.If the elements of the array are always the same, we can always find them using the following specific lookup code:

function compiled_BinarySearch(int input)
{
	if(input<=3)
	{
		if(input == 1)
			return;
		else if(input == 3)
			return;
	}
	else
	{
		if(input == 5)
			return;
		else if(input==120)
			return;
		else if(input==125)
			return;
	}
}

This code is an expanded form of the above BinarySearch, and the search speed is faster.
What the routing section of Map2 describes is what the above code represents. The corresponding Map2 routing format of the above code is as follows:

[0]LessThen1 NextOff(3) KeyBytes(3) 
	[1]EqualNext1 NextOff(2) KeyBytes(1) KeyType ValOffset NoChildren 
	[2]EqualLast1 KeyBytes(3) KeyType ValOffset NoChildren 
[3]LessElse 
	[4]EqualNext1 NextOff(5) KeyBytes(5) KeyType ValOffset NoChildren 
	[5]EqualNext1 NextOff(6) KeyBytes(120) KeyType ValOffset NoChildren 
	[6]EqualLast1 KeyBytes(125) KeyType ValOffset NoChildren

This is a Token expression format used to assist in observing the routing segment, which basically corresponds to the above code.

In bssom, when the width of the key exceeds 8 bytes, it is necessary to split the key by 8 bytes during marshalling (this is because modern computers can usually compare 64 bits at a time), When unmarshalling, you need to restore the values ​​obtained in the process from top to bottom.

Below, I will show a complex format. Because the width of the Keys demonstrated exceeds 8, it has levels and depth, and it is more like a stack structure.

RawValue:
	Map { 
		  {"a1234567b1":1},
		  {"a1234567":2},
		  {"c1234567d1":3},
		  {"p1":4},
		  {"e1234567r1234567":5}  
	 } 

intermediate format:
	+ p1 : 4
	+ a1234567 : 2
		- b1 : 1
	+ c1234567 
		- d1 : 3
	+ e1234567
		- r1234567 : 5

final:
	[1]LessThen8 NextOff(5) KeyU64(3978425819141910881) 
		[2]EqualNext2 NextOff(3) KeyBytes(112,49) KeyType(StringCode) ValOffset NoChildren 
		[3]EqualLast8 KeyU64(3978425819141910881) KeyType(StringCode) ValOffset HasChildren 
			[4]EqualLast2 KeyBytes(98,49) KeyType(StringCode) ValOffset NoChildren 
	[5]LessElse 
		[6]EqualNextN NextOff(8) KeyU64(3978425819141910883) 
			[7]EqualLast2 KeyBytes(100,49) KeyType(StringCode) ValOffset NoChildren 
		[8]EqualLastN KeyU64(3978425819141910885) EqualLast8 KeyU64(3978425819141910898) KeyType(StringCode) ValOffset NoChildren

Map2 defines the following tokens in the routing section:

TokendescriptionTypeValue
LessThen{1-7}An If branch for judging less than or equal to, the number of bytes stored in the judged body (KeyBytes) is 1~7byte21-27
LessThen8An If branch for judging less than or equal to, the judged body (KeyBytes) stores little-endian UInt64byte28
LessElseAn else branch, always corresponding to LessThenbyte30
EqualNext{1-7}An If branch used to judge whether they are equal, the number of bytes stored in the main body (KeyBytes) used to judge is 1~7, and the current branch has a Keybyte1-7
EqualNext8An If branch for judging whether it is equal, the main body (KeyBytes) used for judging is stored as little-endian UInt64, and the current branch has a Keybyte8
EqualNextNAn If branch for judging whether it is equal, the main body (KeyBytes) used for judging is stored as little-endian UInt64, and the current branch does not exist Keybyte9
EqualLast{1-7}An If branch used to judge whether they are equal, representing the last branch in the current body, the number of bytes stored in the body (KeyBytes) used for judgment is 1~7 , and the current branch has a Keybyte11-17
EqualLast8An If branch for judging whether it is equal, representing the last branch in the current body, the body used for judgment (KeyBytes) stores little-endian UInt64, and the current branch has a Keybyte18
EqualLastNAn If branch that judges whether it is equal, represents the last branch in the current main body, the KeyBytes main body used for judgment is stored as little-endian UInt64, and the current branch does not exist Keybyte19
NoChildrenRepresents the current branch has no sub-branchbyte32
NextOffRepresents the relative offset in the data of the next branch in the current logicVarUInt
KeyBytesRepresents the type data of the stored KeyBianry/UInt64
ValOffsetRepresents the relative offset of the Value corresponding to the Key in the dataVarUInt
KeyTypeRepresents the type of Key

Annotation:

The following paragraphs describe the semantics of routing segments:

Branch::=
	->	equalNext(1-8) + nextBranchOff + keyBytes + keyType + valoffset -> nonChildren
	                                                                       -> hasChildern + [Branch]
	->	equalNextN + nextBranchOff + keyBytes + [Branch]
		
	->	equalLast(1-8) + keyBytes + keyType + valoffset -> nonChildren
	                                                       -> hasChildern + [Branch]
	->	equalLastN + keyBytes + [Branch]
	
	->	LessThen + [Branch] + LessElse + [Branch]

Native Type format

The native type stores a VarUInt representing the size of the data

+--------+*------*+========+
|  0xf2  | Length |  Data  |
+--------+*------*+========+

* Length represents the size of Data
* Data represents the native type of data

Extend Type format

The extension type stores a type code used to represent the extension type and the data of the extension type (currently there is no defined extension type)

+--------+----------+========+
|  0xf1  | TypeCode |  Data  |
+--------+----------+========+

* TypeCode represents the type code of the extended type
* Data represents the data of the extended type

Implementation

.NET platform:

  1. Bssom.Net