Home

Awesome

Markup.ml   CI status Coverage

Markup.ml is a pair of parsers implementing the HTML5 and XML specifications, including error recovery. Usage is simple, because each parser is a function from byte streams to parsing signal streams:

Usage example

In addition to being error-correcting, the parsers are:

The parsers detect character encodings automatically, and emit everything in UTF-8. The HTML parser understands SVG and MathML, in addition to HTML5.

Here is a breakdown showing the signal stream and errors emitted during the parsing and pretty-printing of bad_html:

string bad_html         "<body><p><em>Markup.ml<p>rocks!"

|> parse_html           `Start_element "body"
|> signals              `Start_element "p"
                        `Start_element "em"
                        `Text ["Markup.ml"]
                        ~report (1, 10) (`Unmatched_start_tag "em")
                        `End_element                   (* </em>: recovery *)
                        `End_element                   (* </p>: not an error *)
                        `Start_element "p"
                        `Start_element "em"            (* recovery *)
                        `Text ["rocks!"]
                        `End_element                   (* </em> *)
                        `End_element                   (* </p> *)
                        `End_element                   (* </body> *)

|> pretty_print         (* adjusts the `Text signals *)

|> write_html
|> to_channel stdout;;  "...shown above..."            (* valid HTML *)

The parsers are tested thoroughly.

For a higher-level parser, see Lambda Soup, which is based on Markup.ml, but can search documents using CSS selectors, and perform various manipulations.

<br/>

Overview and basic usage

The interface is centered around four functions between byte streams and signal streams: parse_html, write_html, parse_xml, and write_xml. These have several optional arguments for fine-tuning their behavior. The rest of the functions either input or output byte streams, or transform signal streams in some interesting way.

Here is an example with an optional argument:

(* Show up to 10 XML well-formedness errors to the user. Stop after
   the 10th, without reading more input. *)
let report =
  let count = ref 0 in
  fun location error ->
    error |> Error.to_string ~location |> prerr_endline;
    count := !count + 1;
    if !count >= 10 then raise_notrace Exit

file "some.xml" |> fst |> parse_xml ~report |> signals |> drain
<br/>

Advanced: Cohttp + Markup.ml + Lambda Soup + Lwt

This program requests a Google search, then does a streaming scrape of result titles. It exits when it finds a GitHub link, without reading more input. Only one h3 element is converted into an in-memory tree at a time.

let () =
  Lwt_main.run begin
    (* Send request. Assume success. *)
    let url = "https://www.google.com/search?q=markup.ml" in
    let%lwt _, body = Cohttp_lwt_unix.Client.get (Uri.of_string url) in

    (* Adapt response to a Markup.ml stream. *)
    let body = body |> Cohttp_lwt.Body.to_stream |> Markup_lwt.lwt_stream in

    (* Set up a lazy stream of h3 elements. *)
    let h3s = Markup.(body
      |> strings_to_bytes |> parse_html |> signals
      |> elements (fun (_ns, name) _attrs -> name = "h3"))
    in

    (* Find the GitHub link. .iter and .load cause actual reading of data. *)
    h3s |> Markup_lwt.iter (fun h3 ->
      let%lwt h3 = Markup_lwt.load h3 in
      match Soup.(from_signals h3 $? "a[href*=github]") with
      | None -> Lwt.return_unit
      | Some anchor ->
        print_endline (String.concat "" (Soup.texts anchor));
        exit 0)
  end

This prints GitHub - aantron/markup.ml: Error-recovering streaming HTML5 and .... To run it, do:

ocamlfind opt -linkpkg -package lwt.ppx,cohttp.lwt,markup.lwt,lambdasoup \
    scrape.ml && ./a.out

You can get all the necessary packages by

opam install lwt_ssl
opam install cohttp-lwt-unix lambdasoup markup
<br/>

Installing

opam install markup
<br/>

Documentation

The interface of Markup.ml is three modules: Markup, Markup_lwt, and Markup_lwt_unix. The last two are available only if you have Lwt installed (OPAM package lwt).

The documentation includes a summary of the conformance status of Markup.ml.

<br/>

Depending

Markup.ml uses semantic versioning, but is currently in 0.x.x. The minor version number will be incremented on breaking changes.

<br/>

Contributing

Contributions are very much welcome. Please see CONTRIBUTING for instructions, suggestions, and an overview of the code. There is also a list of easy issues.

<br/>

License

Markup.ml is distributed under the MIT license. The Markup.ml source distribution includes a copy of the HTML5 entity list, which is distributed under the W3C document license.