Awesome
xml-in
your friendly XML navigator
What
XML is this new hot markup language everyone is raving about. Attributes, namespaces, schemas, security, XSL.. what's there not to love.
xml-in
is not about parsing XML, but rather working with already parsed XML.
It takes heavily nested {:tag .. :attrs .. :content [...]}
structures that Clojure XML parsers produce and helps to navigate these structures in a Clojure "get-in
style" using internal and custom transducers.
- clojure/data.xml is an example of a good and lazy Clojure/ClojureScript XML parser
- funcool/tubax is another example of a ClojureScript XML parser
Why
XML navigation in Clojure is usually done with help of zippers. clojure/data.zip is usially used, and a common navigation looks like this:
(data.zip/xml1-> (clojure.zip/xml-zip parsed-xml)
:universe
:system
:delta-orionis
:δ-ori-aa1
:radius
data.zip/text)
There is a great article "XML for fun and profit" that shows how zippers are used to navigate XML DOM trees.
But we can do better: faster, cleaner, composable and "no zippers".
How much faster? Let's see:
zippers:
=> (time (dotimes [_ 250000]
(data.zip/xml1-> (clojure.zip/xml-zip parsed-xml)
:universe
:system
:delta-orionis
:δ-ori-aa1
:radius
data.zip/text)))
"Elapsed time: 13385.563442 msecs"
xml-in:
=> (time (dotimes [_ 250000]
(xml/find-first parsed-xml [:universe
:system
:delta-orionis
:δ-ori-aa1
:radius])))
"Elapsed time: 765.884111 msecs"
Property based navigation
Here is an XML document all the examples in this documentation are based on:
<?xml version="1.0" encoding="UTF-8"?>
<universe>
<system>
<solar>
<planet age="4.543" inhabitable="true">Earth</planet>
<planet age="4.503">Mars</planet>
</solar>
<delta-orionis>
<constellation>Orion</constellation>
<δ-ori-aa1>
<mass>24</mass>
<radius>16.5</radius>
<luminosity>190000</luminosity>
<surface-gravity>3.37</surface-gravity>
<temperature>29500</temperature>
<rotational-velocity>130</rotational-velocity>
</δ-ori-aa1>
</delta-orionis>
</system>
</universe>
it lives in dev-resources/universe.xml
Since xml-in
works with a parsed XML (e.g. a DOM tree), let's parse it once and call it the "universe
":
=> (require '[clojure.data.xml :as dx])
=> (def universe (dx/parse-str (slurp "dev-resources/universe.xml")))
#'boot.user/universe
it gets parsed into a common nested {:tag :attrs :content}
structure that looks like this:
=> (pprint universe)
{:tag :universe,
:attrs {},
:content
("\n "
{:tag :system,
:attrs {},
:content
("\n "
{:tag :solar,
:attrs {},
:content
("\n "
{:tag :planet,
;; ...
;; ...
One way to access child nodes in this XML document is to use "a vector of nested properties".
For example, let's check out "those two" planets in a solar system.
Bringing xml-in
in:
=> (require '[xml-in.core :as xml])
and
=> (xml/find-all universe [:universe :system :solar :planet])
("Earth" "Mars")
All the planets are returned. In case we need "a" planet we can match the first one and stop searching:
=> (xml/find-first universe [:universe :system :solar :planet])
("Earth")
notice find-all
vs. find-first
All matching vs. The first matching
Even if there is only one element that matches a search criteria it is best not to look for it using find-all
since there is a cost of looking at all the child nodes that are on the same level as a matched element.
Let's look at the example. From the XML above, let's find a radius
of δ-ori-aa1
component of the delta-orionis
star system:
=> (xml/find-all universe [:universe :system :delta-orionis :δ-ori-aa1 :radius])
("16.5")
=> (xml/find-first universe [:universe :system :delta-orionis :δ-ori-aa1 :radius])
("16.5")
Both find-all
and find-first
return the same exact value, but we know for a fact that the δ-ori-aa1
component has only one radius
. Which means it is best found with find-first
rather than find-all
.
Let's see the performance difference:
=> (time (dotimes [_ 250000]
(xml/find-all universe [:universe :system :delta-orionis :δ-ori-aa1 :radius])))
"Elapsed time: 1216.927309 msecs"
=> (time (dotimes [_ 250000]
(xml/find-first universe [:universe :system :delta-orionis :δ-ori-aa1 :radius])))
"Elapsed time: 792.958283 msecs"
Quite a difference. The secret is quite simple: find-first
stops searching once it finds a matching element.
But it does improve performance, especially for a large number of XML documents.
NOTE:
find-first
returns a "seq
", and not just a "single" value, so it can be composed as described in Creating sub documents
Functional navigation
Navigation using functions, or rather transducers, adds custom "predicate batteries" to the process.
A few internal batteries are included in xml-in
:
=> (require '[xml-in.core :as xml :refer [tag= some-tag= attr=]])
-
tag=
finds child nodes under all matched tags -
some-tag=
finds child nodes under the first matching tag -
attr=
finds child nodes under all tags with attribute's key and value
Let's find all inhabitable planets of the solar system to the best of our knowledge (i.e. based on the XML above):
=> (xml/find-in universe [(tag= :universe)
(tag= :system)
(some-tag= :solar)
(attr= :inhabitable "true")])
("Earth")
a find-in
function takes a parsed XML and a sequence of transducers and computes a sequence from the application of all the transducers composed
Since find-in
does not need to create transducers like find-all
and find-first
it is a bit more performant:
=> (time (dotimes [_ 250000] (xml/find-first universe [:universe :system :solar :planet])))
"Elapsed time: 507.325005 msecs"
vs.
=> (time (dotimes [_ 250000] (xml/find-in universe [(some-tag= :universe)
(some-tag= :system)
(some-tag= :solar)
(some-tag= :planet)])))
"Elapsed time: 467.535705 msecs"
Creating sub documents
Let's say we need to get several properties out of the δ-ori-aa1
component. We can do it as:
=> (xml/find-first universe [:universe :system :delta-orionis :δ-ori-aa1 :mass])
("24")
=> (xml/find-first universe [:universe :system :delta-orionis :δ-ori-aa1 :radius])
("16.5")
=> (xml/find-first universe [:universe :system :delta-orionis :δ-ori-aa1 :surface-gravity])
("3.37")
we can of course group [:mass :radius :surface-gravity]
together and map over them to call xml/find-first universe
with a prefix, but it would not change the fact that we would need to "get-into" ":universe :system :delta-orionis :δ-ori-aa1
" on every property lookup.
We can do better: navigate to :universe :system :delta-orionis :δ-ori-aa1
once and treat is as a document instead:
=> (def aa1 (xml/find-first universe [:universe :system :delta-orionis :δ-ori-aa1]))
#'boot.user/aa1
=> (xml/find-first aa1 [:mass])
("24")
=> (xml/find-first aa1 [:radius])
("16.5")
=> (xml/find-first aa1 [:surface-gravity])
("3.37")
to create a sub document no special syntax is needed, just search "upto" the new root element.
and in cases where it is applicable, using a sub document is a bit faster:
=> (time (dotimes [_ 100000]
[(xml/find-first universe [:universe :system :delta-orionis :δ-ori-aa1 :mass])
(xml/find-first universe [:universe :system :delta-orionis :δ-ori-aa1 :radius])
(xml/find-first universe [:universe :system :delta-orionis :δ-ori-aa1 :surface-gravity])]))
"Elapsed time: 973.376399 msecs"
vs.
=> (time (dotimes [_ 100000]
(let [aa1 (xml/find-first universe [:universe :system :delta-orionis :δ-ori-aa1])]
[(xml/find-first aa1 [:mass])
(xml/find-first aa1 [:radius])
(xml/find-first aa1 [:surface-gravity])])))
"Elapsed time: 760.332762 msecs"
License
Copyright © 2019 tolitius
Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.