Awesome
Xsoup
XPath selector based on Jsoup.
Get started:
@Test
public void testSelect() {
String html = "<html><div><a href='https://github.com'>github.com</a></div>" +
"<table><tr><td>a</td><td>b</td></tr></table></html>";
Document document = Jsoup.parse(html);
String result = Xsoup.compile("//a/@href").evaluate(document).get();
Assert.assertEquals("https://github.com", result);
List<String> list = Xsoup.compile("//tr/td/text()").evaluate(document).list();
Assert.assertEquals("a", list.get(0));
Assert.assertEquals("b", list.get(1));
}
Performance:
Xsoup use Jsoup as HTML parser.
Compare with another most used XPath selector for HTML - HtmlCleaner
, Xsoup is much faster:
Normal HTML, size 44KB
XPath: "//a"
Run for 2000 times
Environment:Mac Air MD231CH/A
CPU: 1.8Ghz Intel Core i5
<table>
<tr>
<td width="100">Operation</td>
<td width="100">Xsoup</td>
<td>HtmlCleaner</td>
</tr>
<tr>
<td>parse</td>
<td>3,207(ms)</td>
<td>7,999(ms)</td>
</tr>
<tr>
<td>select</td>
<td>95(ms)</td>
<td>380(ms)</td>
</tr>
</table>
Syntax supported:
XPath1.0:
<table> <tr> <td width="100">Name</td> <td width="100">Expression</td> <td>Support</td> </tr> <tr> <td>nodename</td> <td>nodename</td> <td>yes</td> </tr> <tr> <td>immediate parent</td> <td>/</td> <td>yes</td> </tr> <tr> <td>parent</td> <td>//</td> <td>yes</td> </tr> <tr> <td>attribute</td> <td>[@key=value]</td> <td>yes</td> </tr> <tr> <td>nth child</td> <td>tag[n]</td> <td>yes</td> </tr> <tr> <td>attribute</td> <td>/@key</td> <td>yes</td> </tr> <tr> <td>wildcard in tagname</td> <td>/*</td> <td>yes</td> </tr> <tr> <td>wildcard in attribute</td> <td>/[@*]</td> <td>yes</td> </tr> <tr> <td>function</td> <td>function()</td> <td>part</td> </tr> <tr> <td>or</td> <td>a | b</td> <td>yes since 0.2.0</td> </tr> <tr> <td>parent in path</td> <td>. or ..</td> <td>no</td> </tr> <tr> <td>predicates</td> <td>price>35</td> <td>no</td> </tr> <tr> <td>predicates logic</td> <td>@class=a or @class=b</td> <td>yes since 0.2.0</td> </tr> </table>Function supported:
In Xsoup, we use some function (maybe not in Standard XPath 1.0):
<table> <tr> <td width="100">Expression</td> <td width="100">Description</td> <td>Standard XPath</td> </tr> <tr> <td width="100">text(n)</td> <td width="100">nth text content of element(0 for all)</td> <td>text() only</td> </tr> <tr> <td width="100">allText()</td> <td width="100">text including children</td> <td>not support</td> </tr> </tr> <tr> <td width="100">tidyText()</td> <td width="100">text including children, well formatted</td> <td>not support</td> </tr> <tr> <td width="100">html()</td> <td width="100">innerhtml of element</td> <td>not support</td> </tr> <tr> <td width="100">outerHtml()</td> <td width="100">outerHtml of element</td> <td>not support</td> </tr> <tr> <td width="100">regex(@attr,expr,group)</td> <td width="100">use regex to extract content</td> <td>not support</td> </tr> </table>Extended syntax supported:
These XPath syntax are extended only in Xsoup (for convenience in extracting HTML, refer to Jsoup CSS Selector):
<table> <tr> <td width="100">Name</td> <td width="100">Expression</td> <td>Support</td> </tr> <tr> <td>attribute value not equals</td> <td>[@key!=value]</td> <td>yes</td> </tr> <tr> <td>attribute value start with</td> <td>[@key~=value]</td> <td>yes</td> </tr> <tr> <td>attribute value end with</td> <td>[@key$=value]</td> <td>yes</td> </tr> <tr> <td>attribute value contains</td> <td>[@key*=value]</td> <td>yes</td> </tr> <tr> <td>attribute value match regex</td> <td>[@key~=value]</td> <td>yes</td> </tr> </table>License
MIT License, see file LICENSE