Home

Awesome

yaccety_sax

Fast, StAX-like XML Parser for BEAM Languages

The big idea

Instead of as with SAX or DOM parsing of XML, forcing the user to handle everything at once, this parser allows the user to consume events from a stream as it suits them. Simply call next_event on the stream. This means that the user can parse multiple streams from the same process at the same time. It works like an iterator on any set or list-like type, but returns XML events instead.

Conformance and Features

yaccety_sax is a Namespace aware, non-validating XML 1.0 parser.

Roadmap of Coming Features

Parsing SOAP from an API

Chances are when parsing XML from some REST API, you won't need a lot of the features yaccety has. This is what yaccety_sax_simple is for.

It works mostly in the same way as the full version, except for:

Examples

Checking XMLs with different encodings and human-readability for equality

kinda_equal(Filename1, Filename2) ->
    % UTF-16 file with external DTD and full of whitespace nodes
    {Cont, Init} = ys_utils:trancoding_file_continuation(Filename1),
    LhState = yaccety_sax:stream(Init, [
        {whitespace, false},
        {comments, false},
        {proc_inst, false},
        {continuation, {Cont, <<>>}},
        {base, filename:dirname(Filename1)},
        {external, fun ys_utils:external_file_reader/2}
    ]),
    % Start Document event
    {_, LhState1} = yaccety_sax:next_event(LhState),
    % DTD event
    {_, LhState2} = yaccety_sax:next_event(LhState1),

    % UTF-8 file with no DTD or whitespace nodes
    % Could have streamed this file as well...
    {ok, Bin2} = file:read_file(Filename2),
    RhState = yaccety_sax:stream(Bin2),
    % Start Document event
    {_, RhState1} = yaccety_sax:next_event(RhState),

    % Now both streams are in a comparable state, so diff them
    diff_loop(LhState2, RhState1).

diff_loop(LhState, RhState) ->
    {LhEvent, LhState1} = yaccety_sax:next_event(LhState),
    {RhEvent, RhState1} = yaccety_sax:next_event(RhState),
    #{type := EventType} = LhEvent,
    % Some function that checks equality, maybe ignoring 
    % namespaces or prefixes or something.
    case equal_enough(LhEvent, RhEvent) of
        true when EventType =:= endDocument -> true;
        true -> diff_loop(LhState1, RhState1);
        false -> false
    end.

some (early) numbers

Just-for-fun parsing a 5.2 GB Wiki abstract dump with a callback that throws away all events:

Another big difference is that the xmerl process held onto about 42 MB by the end of parsing. yaccety never went above 109 KB.

I didn't attempt using the xmerl_scan on the 5.2 GB file. Not sure it's a good idea to try.

I'm sure there are other parsers out there that stream-parse large data. It would be cool to see how all of them react.

The repo name

Anyone who has seen The Benny Hill Show knows the song that inspired the name for the repo. Yakety Sax