Home

Awesome

SLAXML

SLAXML is a pure-Lua SAX-like streaming XML parser. It is more robust than many (simpler) pattern-based parsers that exist (such as mine), properly supporting code like <expr test="5 > 7" />, CDATA nodes, comments, namespaces, and processing instructions.

It is currently not a truly valid XML parser, however, as it allows certain XML that is syntactically-invalid (not well-formed) to be parsed without reporting an error.

Features

Usage

local SLAXML = require 'slaxml'

local myxml = io.open('my.xml'):read('*all')

-- Specify as many/few of these as you like
parser = SLAXML:parser{
  startElement = function(name,nsURI,nsPrefix)       end, -- When "<foo" or <x:foo is seen
  attribute    = function(name,value,nsURI,nsPrefix) end, -- attribute found on current element
  closeElement = function(name,nsURI)                end, -- When "</foo>" or </x:foo> or "/>" is seen
  text         = function(text,cdata)                end, -- text and CDATA nodes (cdata is true for cdata nodes)
  comment      = function(content)                   end, -- comments
  pi           = function(target,content)            end, -- processing instructions e.g. "<?yes mon?>"
}

-- Ignore whitespace-only text nodes and strip leading/trailing whitespace from text
-- (does not strip leading/trailing whitespace from CDATA)
parser:parse(myxml,{stripWhitespace=true})

If you just want to see if it will parse your document correctly, you can simply do:

local SLAXML = require 'slaxml'
SLAXML:parse(myxml)

…which will cause SLAXML to use its built-in callbacks that print the results as they are seen.

DOM Builder

If you simply want to build tables from your XML, you can alternatively:

local SLAXML = require 'slaxdom' -- also requires slaxml.lua; be sure to copy both files
local doc = SLAXML:dom(myxml)

The returned table is a 'document' composed of tables for elements, attributes, text nodes, comments, and processing instructions. See the following documentation for what each supports.

DOM Table Features

Finding Text for a DOM Element

The following function can be used to calculate the "inner text" for an element:

function elementText(el)
  local pieces = {}
  for _,n in ipairs(el.kids) do
    if n.type=='element' then pieces[#pieces+1] = elementText(n)
    elseif n.type=='text' then pieces[#pieces+1] = n.value
    end
  end
  return table.concat(pieces)
end

local xml  = [[<p>Hello <em>you crazy <b>World</b></em>!</p>]]
local para = SLAXML:dom(xml).root
print(elementText(para)) --> "Hello you crazy World!"

A Simpler DOM

If you want the DOM tables to be easier to inspect you can supply the simple option via:

local dom = SLAXML:dom(myXML,{ simple=true })

In this case the document will have no root property, no table will have a parent property, elements will not have the el collection, and the attr collection will be a simple array (without values accessible directly via attribute name). In short, the output will be a strict hierarchy with no internal references to other tables, and all data represented in exactly one spot.

Serializing the DOM

You can serialize any DOM table to an XML string by passing it to the SLAXML:xml() method:

local SLAXML = require 'slaxdom'
local doc = SLAXML:dom(myxml)
-- ...modify the document...
local xml = SLAXML:xml(doc)

The xml() method takes an optional table of options as its second argument:

local xml = SLAXML:xml(doc,{
  indent = 2,    -- each pi/comment/element/text node on its own line, indented by this many spaces
  indent = '\t', -- ...or, supply a custom string to use for indentation
  sort   = true, -- sort attributes by name, with no-namespace attributes coming first
  omit   = {...} -- an array of namespace URIs; removes elements and attributes in these namespaces
})

When using the indent option, you likely want to ensure that you parsed your DOM using the stripWhitespace option. This will prevent you from having whitespace text nodes between elements that are then placed on their own indented line.

Some examples showing the serialization options:

local xml = [[
<!-- a simple document showing sorting and namespace culling -->
<r c="1" z="3" b="2" xmlns="uri1" xmlns:x="uri2" xmlns:a="uri3">
  <e a:foo="f" x:alpha="a" a:bar="b" alpha="y" beta="beta" />
  <a:wrap><f/></a:wrap>
</r>
]]

local dom = SLAXML:dom(xml, {stripWhitespace=true})

print(SLAXML:xml(dom))
--> <!-- a simple document showing sorting and namespace culling --><r c="1" z="3" b="2" xmlns="uri1" xmlns:x="uri2" xmlns:a="uri3"><e a:foo="f" x:alpha="a" a:bar="b" alpha="y" beta="beta"/><a:wrap><f/></a:wrap></r>

print(SLAXML:xml(dom, {indent=2}))
--> <!-- a simple document showing sorting and namespace culling -->
--> <r c="1" z="3" b="2" xmlns="uri1" xmlns:x="uri2" xmlns:a="uri3">
-->   <e a:foo="f" x:alpha="a" a:bar="b" alpha="y" beta="beta"/>
-->   <a:wrap>
-->     <f/>
-->   </a:wrap>
--> </r>

print(SLAXML:xml(dom.root.kids[2]))
--> <a:wrap><f/></a:wrap>
-- NOTE: you can serialize any DOM table node, not just documents

print(SLAXML:xml(dom.root.kids[1], {indent=2, sort=true}))
--> <e alpha="y" beta="beta" a:bar="b" a:foo="f" x:alpha="a"/>
-- NOTE: attributes with no namespace come first

print(SLAXML:xml(dom, {indent=2, omit={'uri3'}}))
--> <!-- a simple document showing sorting and namespace culling -->
--> <r c="1" z="3" b="2" xmlns="uri1" xmlns:x="uri2">
-->   <e x:alpha="a" alpha="y" beta="beta"/>
--> </r>
-- NOTE: Omitting a namespace omits:
--       * namespace declaration(s) for that space
--       * attributes prefixed for that namespace
--       * elements in that namespace, INCLUDING DESCENDANTS

print(SLAXML:xml(dom, {indent=2, omit={'uri3', 'uri2'}}))
--> <!-- a simple document showing sorting and namespace culling -->
--> <r c="1" z="3" b="2" xmlns="uri1">
-->   <e alpha="y" beta="beta"/>
--> </r>

print(SLAXML:xml(dom, {indent=2, omit={'uri1'}}))
--> <!-- a simple document showing sorting and namespace culling -->
-- NOTE: Omitting namespace for the root element removes everything

Serialization of elements and attributes ignores the nsURI property in favor of the nsPrefix attribute. As such, you can construct DOM's that serialize to invalid XML:

local el = {
  type="element",
  nsPrefix="oops", name="root",
  attr={
    {type="attribute", name="xmlns:nope", value="myuri"},
    {type="attribute", nsPrefix="x", name="wow", value="myuri"}
  }
}
print( SLAXML:xml(el) )
--> <oops:root xmlns:nope="myuri" x:wow="myuri"/>

So, if you want to use a foo prefix on an element or attribute, be sure to add an appropriate xmlns:foo attribute defining that namespace on an ancestor element.

Known Limitations / TODO

History

v0.8.1 2022-Dec-31

v0.8 2018-Oct-23

v0.7 2014-Sep-26

v0.6.1 2014-Sep-25

v0.6 2014-Apr-18

v0.5.3 2014-Feb-12

v0.5.2 2013-Nov-7

v0.5.1 2013-Feb-18

v0.5 2013-Feb-18

v0.4.3 2013-Feb-17

v0.4 2013-Feb-16

v0.3 2013-Feb-15

v0.2 2013-Feb-15

v0.1 2013-Feb-7

License

Copyright © 2013 Gavin Kistner

Licensed under the MIT License. See LICENSE.txt for more details.