Awesome

hq

Small utility to parse and grep HTML files. It uses CSS selectors or XPath Selectors to extract HTML elements.

Usage

hq - command line HTML elements finder; version 1.0.0

Usage: hq [-hptV] [-a=<attribute>] [-f=<FILE>] [-o=<FILE>] [-s=<POLICY>] [-x=<XPATH>] <selector>
          [COMMAND]
      <selector>            The CSS selector
  -a, --attribute=<attribute>
                            Return only this attribute from the selected HTML elements
  -f, --file=<FILE>         The HTML input file. If not supplied it will default to stdin
  -h, --help                Show this help message and exit.
  -o, --output=<FILE>       The output file. If not supplied it will default to stdout
  -p, --pretty              Force pretty printing the output
  -r, --remove=<SELECTOR>   Remove nodes matching given selector
  -s, --sanitize=<POLICY>   Sanitizes the html input according to the given policy
  -t, --text                Display only the inner text of the selected HTML top element
  -V, --version             Print version information and exit.
  -x, --xpath=<XPATH>       Supply an XPath selector instead of CSS
Commands:
  generate-completion  Generate bash/zsh completion script for hq.

Installation

Homebrew

> brew tap ludovicianul/tap
> brew install ludovicianul/tap/hq

Manual

hq is compiled to native code using GraalVM. Check the release page for binaries (Linux, MacOS, uberjar).

After download, you can make hq globally available:

sudo cp hq-macos /usr/local/bin/hq

The uberjar can be run using java -jar hq. Requires Java 11+.

Autocomplete

Run the following commands to get autocomplete:

hq generate-completion >> hq_autocomplete

source hq_autocomplete

HTML Sanitizing

hq can sanitize html output. Supported modes are: NONE, BASIC, SIMPLE_TEXT, BASIC_WITH_IMAGES, RELAXED.

This is how sanitization works:

Policy	Details
`NONE`	Allows only text nodes: all HTML will be stripped.
`BASIC`	Allows a fuller range of text nodes: `a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li, ol, p, pre, q, small, span, strike, strong, sub, sup, u, ul`, and appropriate attributes. Does not allow images.
`SIMPLE_TEXT`	Allows only simple text formatting: `b, em, i, strong, u`. All other HTML (tags and attributes) will be removed.
`BASIC_WITH_IMAGES`	Allows the same text tags as `BASIC`, and also allows `img` tags, with appropriate attributes, with `src` pointing to `http` or `https`.
`RELAXES`	Allows a full range of text and structural body HTML: `a, b, blockquote, br, caption, cite, code, col, colgroup, dd, div, dl, dt, em, h1, h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, span, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, u, ul`.

Examples

Get the div with id mainLeaderboard:

➜ curl -s https://www.w3schools.com/cssref/css_selectors.php | hq "#main > p:nth-child(6)" -t

In CSS, selectors are patterns used to select the element(s) you want to style.

Get the text inside an article:

➜ curl -s https://ludovicianul.github.io/2021/07/16/unicode_language_version/ | hq '.post' -t

Make sure you know which Unicode version is supported by your programming language version 16 Jul 2021 While enhancing CATS I recently added a feature to send requests that include 
single and multi code point emojis. This is a single code point emoji: 🥶, which can be represented in Java as the \uD83E\uDD76 string. The test case is simple: inject emojis within 
strings and expect that the REST endpoint will sanitize the input and remove them entirely (I appreciate this might not be a valid case for all APIs, this is why the behaviour is 
configurable in CATS, but not the focus of this article). I usually recommend that any REST endpoint should sanitize input before validating it and remove special characters. 
A typical regex for this would be [\p{C}\p{Z}\p{So}]+ (although you should enhance it to allow spaces between words), which means: p{C} - match Unicode invisible Control 
Chars (\u000D - carriage return for example) ...
...

Sanitize the html according to the specified policy:

 ➜ curl -s https://ludovicianul.github.io/2021/07/16/unicode_language_version/ | hq html -s=BASIC -p

<html>
    <head></head>
    <body>
        <a href="https://ludovicianul.github.io/" rel="nofollow"> m's blog </a>
        <p>practical thoughts about software engineering</p>
        <a href="https://ludovicianul.github.io/" rel="nofollow">Home</a>
        <a rel="nofollow">About</a>
        <a href="https://github.com/ludovicianul" rel="nofollow">GitHub</a>
        <p>© 2021. All rights reserved.</p>
        Make sure you know which Unicode version is supported by your programming language version
        <span>16 Jul 2021</span>
        <p>
...
    </body>
</html>

Get all href attributes from a given page:

 ➜ curl -s https://ludovicianul.github.io | hq "*" -a "href"
http://gmpg.org/xfn/11
https://ludovicianul.github.io/public/css/poole.css
https://ludovicianul.github.io/public/css/syntax.css
https://ludovicianul.github.io/public/css/hyde.css
https://fonts.googleapis.com/css?family=PT+Sans:400,400italic,700|Abril+Fatface
https://ludovicianul.github.io/public/apple-touch-icon-144-precomposed.png
https://ludovicianul.github.io/public/favicon.ico
/atom.xml
https://ludovicianul.github.io/
https://ludovicianul.github.io/
/about/
...