Awesome
Extensible Abstract Syntax Tree format.
xast is a specification for representing XML as an abstract syntax tree. It implements the unist spec.
This document may not be released.
See releases for released documents.
The latest released version is 1.0.0
.
Contents
- Introduction
- Types
- Nodes (abstract)
- Nodes
- Other types
- Glossary
- List of utilities
- References
- Related
- Contribute
- Acknowledgments
- License
Introduction
This document defines a format for representing XML as an abstract syntax tree. This specification is written in a Web IDL-like grammar. Development started in January 2020.
Where this specification fits
xast extends unist, a format for syntax trees, to benefit from its ecosystem of utilities.
xast relates to JavaScript in that it has an ecosystem of utilities for working with compliant syntax trees in JavaScript. However, xast is not limited to JavaScript and can be used in other programming languages.
xast relates to the unified project in that xast syntax trees are used throughout its ecosystem.
Scope
xast represents XML syntax, not semantics: there are no namespaces or local names; only qualified names.
xast supports a sensible subset of XML by omitting the ostensibly bad DTD. XML processors are not guaranteed to process DTDs, making them unsafe.
xast represents expanded entities and therefore does not deal with entities or
character references.
It is suggested that utilities around xast,
that parse or serialize,
do not support parameter-entity references or
entity references other than the
predefined entities
(<
for <
U+003C LESS THAN;
>
for >
U+003E GREATER THAN;
&
for &
U+0026 AMPERSAND;
'
for '
U+0027 APOSTROPHE;
"
for "
U+0022 QUOTATION MARK).
This prevents billion laughs attacks.
Declarations
Declarations ([XML]) other than doctype have no representation in xast:
<!ELEMENT %name.para; %content.para;>
<!ATTLIST poem xml:space (default|preserve) 'preserve'>
<!ENTITY % ISOLat2 SYSTEM "http://www.xml.com/iso/isolat2-xml.entities">
<!ENTITY Pub-Status "This is a pre-release of the specification.">
<![%draft;[<!ELEMENT book (comments*, title, body, supplements?)>]]>
<![%final;[<!ELEMENT book (title, body, supplements?)>]]>
Internal subset
Internal document type declarations have no representation in xast:
<!DOCTYPE greeting [
<!ELEMENT greeting (#PCDATA)>
]>
<greeting>Hello, world!</greeting>
Types
If you are using TypeScript, you can use the xast types by installing them with npm:
npm install @types/xast
Nodes (abstract)
Literal
interface Literal <: UnistLiteral {
value: string
}
Literal (UnistLiteral) represents a node in xast containing a value.
Parent
interface Parent <: UnistParent {
children: [Cdata | Comment | Doctype | Element | Instruction | Text]
}
Parent (UnistParent) represents a node in xast containing other nodes (said to be children).
Its content is limited to only other xast content.
Nodes
Cdata
interface Cdata <: Literal {
type: 'cdata'
}
Cdata (Literal) represents a CDATA section ([XML]).
For example, the following XML:
<![CDATA[<greeting>Hello, world!</greeting>]]>
Yields:
{
type: 'cdata',
value: '<greeting>Hello, world!</greeting>'
}
Comment
interface Comment <: Literal {
type: 'comment'
}
Comment (Literal) represents a comment ([XML]).
For example, the following XML:
<!--Charlie-->
Yields:
{type: 'comment', value: 'Charlie'}
Doctype
interface Doctype <: Node {
type: 'doctype'
name: string
public: string?
system: string?
}
Doctype (Node) represents a doctype ([XML]).
A name
field must be present.
A public
field should be present.
If present,
it must be set to a string,
and represents the document’s public identifier.
A system
field should be present.
If present,
it must be set to a string,
and represents the document’s system identifier.
For example, the following XML:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
Yields:
{
type: 'doctype',
name: 'HTML',
public: '-//W3C//DTD HTML 4.0 Transitional//EN',
system: 'http://www.w3.org/TR/REC-html40/loose.dtd'
}
Element
interface Element <: Parent {
type: 'element'
name: string
attributes: Attributes?
children: [Cdata | Comment | Element | Instruction | Text]
}
Element (Parent) represents an element ([XML]).
The name
field must be present.
It represents the element’s name ([XML]),
specifically its qualified name
([XML-NAMES]).
The children
field should be present.
The attributes
field should be present.
It represents information associated with the element.
The value of the attributes
field implements the
Attributes interface.
For example, the following XML:
<package unique-identifier="id" xmlns="http://www.idpf.org/2007/opf" />
Yields:
{
type: 'element',
name: 'package',
attributes: {
'unique-identifier': 'id',
xmlns: 'http://www.idpf.org/2007/opf'
},
children: []
}
Instruction
interface Instruction <: Literal {
type: 'instruction'
name: string
}
Instruction (Literal) represents a processing instruction ([XML]).
A name
field must be present.
For example, the following XML:
<?xml version="1.0" encoding="UTF-8"?>
Yields:
{
type: 'instruction',
name: 'xml',
value: 'version="1.0" encoding="UTF-8"'
}
Root
interface Root <: Parent {
type: 'root'
}
Root (Parent) represents a document fragment or a whole document.
Root should be used as the root of a tree and must not be used as a child.
XML specifies that documents should have exactly one element child, therefore a root should have exactly one element child when representing a whole document.
Text
interface Text <: Literal {
type: 'text'
}
Text (Literal) represents character data ([XML]).
For example, the following XML:
<dc:language>en</dc:language>
Yields:
{
type: 'element',
name: 'dc:language',
attributes: {},
children: [{type: 'text', value: 'en'}]
}
Other types
Attributes
interface Attributes {}
Attributes represents information associated with an element.
Every field must be a AttributeName and every value an AttributeValue.
AttributeName
typedef string AttributeName
Attribute names are keys on Attributes objects and must reflect XML attribute names exactly.
AttributeValue
typedef string AttributeValue
Attribute values are values on Attributes objects and must reflect XML attribute values exactly as a string.
In JSON, the value
null
must be treated as if the attribute was not included. In JavaScript, bothnull
andundefined
must be similarly ignored.
Glossary
See the unist glossary.
List of utilities
See the unist list of utilities for more utilities.
xastscript
— create treesxast-util-feed
— build feeds (RSS, Atom)xast-util-from-xml
— parse from XMLxast-util-sitemap
— buildsitemap.xml
xast-util-to-string
— get the text valuexast-util-to-xml
— serialize to XMLhast-util-to-xast
— transform to xast
References
- JSON The JavaScript Object Notation (JSON) Data Interchange Format, T. Bray. IETF.
- JavaScript: ECMAScript Language Specification. Ecma International.
- unist: Universal Syntax Tree. T. Wormer; et al.
- XML: Extensible Markup Language (XML) 1.0 (Fifth Edition) T. Bray; et al. W3C.
- XML-NAMES: Namespaces in XML 1.0 (Third Edition) T. Bray; et al. W3C.
- Web IDL: Web IDL, C. McCormack. W3C.
Related
- hast — Hypertext Abstract Syntax Tree format
- mdast — Markdown Abstract Syntax Tree format
- nlcst — Natural Language Concrete Syntax Tree format
Contribute
See contributing.md
in syntax-tree/.github
for
ways to get started.
See support.md
for ways to get help.
Ideas for new utilities and tools can be posted in syntax-tree/ideas
.
A curated list of awesome syntax-tree
,
unist,
hast,
mdast,
nlcst,
and xast resources can be found in awesome syntax-tree.
This project has a code of conduct. By interacting with this repository, organization, or community you agree to abide by its terms.
Acknowledgments
The initial release of this project was authored by @wooorm.