Awesome
This work has been integrated into dotnet/orleans and this repository is in maintenance mode. See dotnet/orleans#7070
Hagar
There are many existing serialization libraries and formats which are efficient, fast, and support schema evolution, so why create this?
Existing serialization libraries which support version tolerance tend to restrict how data is modelled, usually by providing a very restricted type system which supports few of the features found in common type systems, features such as:
- Polymorphism
- Generics (parametric types)
- References, including cyclic references
Hagar is a new serialization library which does supports these features, is fast & compact, supports schema evolution, and requires minimal input from the developer.
Encoding
- Fields are tagged with compact field ids. Those field ids are provided by developers.
- Fields are encoded into primitives which fall into 4 categories:
- Fixed length - most numerics, unless specifically annotated.
- Variable length - for variable-length integer encoding, useful for length, count, index type properties (relatively small and 0-based in nature).
- Length-prefixed - strings, arrays of fixed-width primitives.
- Tag-delimited - objects, collections of non-primitive types.
- Type information is embedded, but not required for parsing.
- Separation of wire type & runtime type.
- Library of application-defined runtime types are available during encoding & decoding. These types can be given short ids to reduce the size of the serialized payload.
- Types can be parameterized (support for generics).
- Types which are not specified in the type library can be explicitly named.
- These named types are runtime specific (i.e, .NET specific).
- Note: may want to restrict this for security reasons.
- Wire format based around 1-byte tags, which are in one of two forms:
[W W W] [S S] [F F F]
where:W
is a wire type bit.S
is a schema type bit.F
is a field identifier bit.
[1 1 1] [E E] [X X X]
where:E
is an extended wire type bit.X
is reserved for use in the context of the extended wire type.
- The wire type, schema type, and extended wire type are detailed more below.
- When a schema type requires extra data, it is encoded after initial tag.
- When a field id cannot be encoded in 3 bits, it is encoded after schema data.
- Overall encoding takes the form:
Tag Schema FieldId FieldData
. - Every message can be parsed without prior knowledge of any schema type because all wire types have a fixed, well-known format for determining the length of the encoded data.
- When serializing and deserializing data, there is no single, predetermined mapping between a .NET type and a wire encoding. For example, ProtoBufs dictates that an
int64
is encoded as aVarint
and thatfloat32
is encoded as a fixed 32-bit field. Instead, the serializer can determine that along
is encoded asVarInt
,Fixed32
, orFixed64
at runtime depending on which takes up the least space.
/// <summary>
/// Represents a 3-bit wire type, shifted into position
/// </summary>
public enum WireType : byte
{
VarInt = 0b000 << 5, // Followed by a VarInt
TagDelimited = 0b001 << 5, // Followed by field specifiers, then an Extended tag with EndTagDelimited as the extended wire type.
LengthPrefixed = 0b010 << 5, // Followed by VarInt length representing the number of bytes which follow.
Fixed32 = 0b011 << 5, // Followed by 4 bytes
Fixed64 = 0b100 << 5, // Followed by 8 bytes
Reference = 0b110 << 5, // Followed by a VarInt reference to a previously defined object. Note that the SchemaType and type specification must still be included.
Extended = 0b111 << 5, // This is a control tag. The schema type and embedded field id are invalid. The remaining 5 bits are used for control information.
}
public enum SchemaType : byte
{
Expected = 0b00 << 3, // This value has the type expected by the current schema.
WellKnown = 0b01 << 3, // This value is an instance of a well-known type. Followed by a VarInt type id.
Encoded = 0b10 << 3, // This value is of a named type. Followed by an encoded type name.
Referenced = 0b11 << 3, // This value is of a type which was previously specified. Followed by a VarInt indicating which previous type is being reused.
}
public enum ExtendedWireType : byte
{
EndTagDelimited = 0b00 << 3, // This tag marks the end of a tag-delimited object. Field id is invalid.
EndBaseFields = 0b01 << 3, // This tag marks the end of a base object in a tag-delimited object.
}
-
If a type has base types, the fields of the base types are serialized before the subtype fields. Between the base type fields and its sub type is an
EndBaseFields
tag. This allows base types and sub types to have overlapping field ids without ambiguity. Therefore object encoding follows this pattern:[StartTagDelimited] [Base Fields]* [EndBaseFields] [Sub Type fields]* [EndTagDelimited]
. -
Third-party serializers such as ProtoBuf, Bond, .NET's BinaryFormatter, JSON.NET, etc, are supported by serializing using a serializer-specific type id and including the payload via the length-prefixed wire type. This has the advantage of supporting any number of well-known serializers and does not require double-encoding the concrete type, since the external serializer is responsible for that.
Security
Allowing arbitrary types to be specified in a serialized payload is a vector for security vulnerabilities. Because of this, all types should be checked against a whitelist.
Rules
Version Tolerance is supported provided the developer follows a set of rules when modifying types. If the developer is familiar with systems such as ProtoBuf and Bond, then these rules will come as no surprise.
Composites (class
& struct
)
- Inheritance is supported, but modifying the inheritance hierarchy of an object is not supported. The base class of a class cannot be added, changed to another class, or removed.
- With the exception of some numeric types, described in the Numerics section, field types cannot be changed.
- Fields can be added or removed at any point in an inheritance hierarchy.
- Field ids cannot be changed.
- Field ids must be unique for each level in a type hierarchy, but can be reused between base-classes and sub-classes. For example,
Base
class can declare a field with id0
and a different field can be declared bySub : Base
with the same id,0
.
Numerics
- The signedness of a numeric field cannot be changed.
- Conversions between
int
&uint
are invalid.
- Conversions between
- The width of a numeric field can be changed.
- Eg: conversions from
int
tolong
orulong
toushort
are supported. - Conversions which narrow the width will throw if the runtime value of a field would cause an overflow.
- Conversion from
ulong
toushort
are only supported if the value at runtime is less thanushort.MaxValue
. - Conversions from
double
tofloat
are only supported if the runtime value is betweenfloat.MinValue
andfloat.MaxValue
. - Similarly for
decimal
, which has a narrower range than bothdouble
andfloat
.
- Conversion from
- Eg: conversions from
Types
- Types can be added to the system and used as long as either:
- They are only used for newly added fields.
- They are never used on older versions which do not have access to that type.
- Type names cannot be changed unless the type was always registered as a
WellKnown
type or aTypeCodec
is used to translate between the old and new name.
Packages
Packages are published to nuget.org: https://www.nuget.org/packages?q=hagar
Running build.ps1
will build and locally publish packages for testing purposes