Awesome
JSON Decoding Algorithm
Everyone and their dog already has a json parsing and encoding library. So this module is more of a fun curiosity rather than a useful tool.
Now even your pet rock is able to parse JSON.
Parsing tables for JSON
The code in the build_tables.py constructs a parsing table that matches on the railroad diagrams at http://json.org/
There's a recognizer that can push and pop states into a stack. Every state transition is associated with action code that is separate from the recognizer and allows you to parse the contents of the input correctly.
Tutorial for implementing a JSON decoder or parser
Write the following program in your programming language of choice:
def parse(input):
stack = []
state = 0x00
ds = [] # data stack
ss = [] # string stack
es = [] # escape stack
for ch in input:
cat = catcode[min(ord(ch), 0x7E)]
state = parse_ch(cat, ch, stack, state, ds, ss, es)
state = parse_ch(catcode[32], u'', stack, state, ds, ss, es)
if state != 0x00:
raise Exception("JSON decode error: truncated")
if len(ds) != 1:
raise Exception("JSON decode error: too many objects")
return ds.pop()
def parse_ch(cat, ch, stack, state, ds, ss, es):
while True:
code = states[state][cat]
action = code >> 8 & 0xFF
code = code & 0xFF
if action == 0xFF and code == 0xFF:
raise Exception("JSON decode error: syntax")
elif action >= 0x80: # shift
stack.append(gotos[state])
action -= 0x80
if action > 0:
do_action(action, ch, ds, ss, es)
if code == 0xFF:
state = stack.pop()
else:
state = code
return state
# This action table is unique for every language.
# It also depends on which structures you want to
# generate.
def do_action(action, ch, ds, ss, es):
if action == 0x1: # push list
ds.append([])
# Push object to ds
elif action == 0x2: # push object
ds.append({})
elif action == 0x3: # pop & append
val = ds.pop()
ds[len(ds)-1].append(val)
elif action == 0x4: # pop pop & setitem
val = ds.pop()
key = ds.pop()
ds[len(ds)-1][key] = val
elif action == 0x5: # push null
ds.append(None)
elif action == 0x6: # push true
ds.append(True)
elif action == 0x7: # push false
ds.append(False)
elif action == 0x8: # push string
val = u"".join(ss)
ds.append(val)
ss[:] = [] # clear ss and es stacks.
es[:] = []
elif action == 0x9:
val = int(u"".join(ss)) # push int
ds.append(val)
ss[:] = [] # clear ss stack.
elif action == 0xA:
val = float(u"".join(ss)) # push float
ds.append(val)
ss[:] = []
elif action == 0xB: # push ch to ss
ss.append(ch)
elif action == 0xC: # push ch to es
es.append(ch)
elif action == 0xD: # push escape
ss.append(unichr(escape_characters[ch]))
elif action == 0xE: # push unicode point
ss.append(unichr(int(u"".join(es), 16)))
es[:] = []
else: # This is very unlikely to happen. But make
# a crashpoint here if possible.
# Also if you write it in parts, let this line
# be the first one you write into this routine.
assert False, "JSON decoder bug"
# Non-trivial escape characters. At worst you can
# 'switch' or 'if/else' them into do_action -function.
escape_characters = {'b': 8, 't': 9, 'n': 10, 'f': 12, 'r': 13}
Add the lists 'states', 'gotos', 'catcode' from json.txt in this directory/repository. Add them into same file under your application. Also add the comment in that file so that your code stays maintainable.
If the file is not in the correct format, write a reformatter. DO NOT try to reformat it by hand to avoid errors.
Recreational use
This can be probably used to generate random JSON strings as well. I haven't tried to do that. :D Could be fun and pointless.
How is this special?
This project is unique in the sense that it is probably the easiest to port JSON decoder you can write.
If you wanted to port this, you would only have to rewrite the driver and reformat the parsing tables.
Also the algorithm is incremental. You can suspend it after any character input. It also builds the JSON as it appears.
With small modification it'd be able to parse multiple JSON objects and pass them as they appear in the stream.
Potential uses
The driver is divided into a recognizer and action table. If the recognizer finds an input not in JSON syntax, it raises a SYN error.
The input is interpreted according to how you program the do_action -procedure.
If you want to frown people with traditional JSON parsers, you could adjust the driver to parse multiple objects and read JSON objects as they appear in the TCP stream.
After every JSON object:
if len(ds) > 0 and state == 0:
return ds.pop(0)
If you do this, remember to emit newline or whitespace character after each JSON message. This lets the recipients JSON parser reach the state where it receives the JSON object.
But maybe it's better to use the length#json_message -protocol. :) Very few other JSON parsers are able of doing this.
This is also useful if you don't have a JSON parser you could trust. For example if your parser mishandles backslash characters and doubles them on decode/encode. Or if your parser misrecognizes floats because it tries to parse ',' rather than '.'.
The code that you write to use this driver gets easily tested. And the tables containing the recognizer come from this project, so you can trust those tables have the same behavior as here.
Also can be useful if you want to read the JSON floats into something else than floating point floats. Just write your own do_action that does it differently.
What if I want to encode JSON as well?
To encode JSON you need some routines to tokenize strings and integers. You may also require a pretty printer.
CS-TR-79-770 describes a pretty printing method that should be sufficient for tokenizing JSON.
There's a short example of the algorithm described by that
paper in the printer.py
. Call it with a json file and it
will parse your file with verifier.py
and then pretty
prints it out.
The printer.py
holds everything for stringifying
JSON except the float formatter.
Unicode Gotcha
This code and algorithms should be neutral about handling unicode characters.
With some tuning the pretty printer should even be able to handle output with non-monospace fonts. Though for convenience please treat plaintext json output as if it was monospace.
Whether this code can handle unicode characters depends on the code implemented and the programming language you use to implement it.
Bugfixes
There is some verification that the state table is correct. I have made a program that goes through every state transition.
If you find a bug in the tables, do not modify them. Modify
the program build_tables.py
or file an issue in github.com.
Verifying that it works was tricky. The coverage test I did catched quite few bugs. I'm quite certain it matches with the railroad diagram on the json.org now.
License
MIT License
Copyright (c) 2016 Henri Tuhola
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.