Home

Awesome

MMark

License BSD3 Hackage Stackage Nightly Stackage LTS CI

MMark (read “em-mark”) is a strict markdown processor for writers. “Strict” means that not every input is considered valid markdown document and parse errors are possible and even desirable, because they allow us to spot markup issues without searching for them in rendered document. If a markdown document passes the MMark parser, then it is likely to produce an HTML output without quirks. This feature makes it a good choice for writers and bloggers.

MMark in its current state features:

There is also a blog post announcing the project:

https://markkarpov.com/post/announcing-mmark.html

Quick start: MMark vs GitHub-flavored markdown

It's easy to start using MMark if you're used to GitHub-flavored markdown. There are four main differences:

  1. URIs are not automatically recognized, you must enclose them in < and >.

  2. Block quotes require only one > and they continue as long as the inner content is indented.

    This is OK:

    > Here goes my block quote.
      And this is the second line of the quote.
    

    This produces two block quotes:

    > Here goes my block quote.
    > And this is another block quote!
    
  3. HTML blocks and inline HTML are not supported.

  4. See differences in inline parsing.

MMark and Common Mark

MMark mostly tries to follow the Common Mark specification as given here:

https://spec.commonmark.org/0.28/

However, due to the fact that we do not allow inputs that do not make sense, and also try to guard against common mistakes (like writing ##My header and having it rendered as a paragraph starting with hashes) MMark obviously can't follow the specification precisely. In particular, parsing of inlines differs considerably from Common Mark (see below).

Another difference between Common Mark and MMark is that the latter supports more (pun alert) common markdown extensions out-of-the-box. In particular, MMark supports:

One does not need to enable or tweak anything for these to work, they are built-in features.

Differences in inline parsing

Emphasis and strong emphasis is an especially hairy topic in the Common Mark specification. There are 17 ad-hoc rules defining the interaction between * and _ -based emphasis and more than an half of all Common Mark examples (that's about 300) test just this.

Not only it is hard to implement, it's hard to understand for humans too. For example, this input:

*(*foo*)*

results in the following HTML:

<p><em>(<em>foo</em>)</em></p>

(Note the nested emphasis.)

Could it produce something like this instead?

<p><em>(</em>foo<em>)</em></p>

Well, why not? Without remembering those 17 ad-hoc rules, there going to be a lot of tricky cases when the user won't be able to tell how markdown will be parsed.

I decided to make parsing of emphasis, strong emphasis, and similar constructs like strikethrough, subscript, and superscript more symmetric and less ad-hoc. In 99% of practical cases it is identical to Common Mark, and normal markdown intuitions will work OK for the users.

Let's start by dividing all characters into four groups:

Next, let's assign levels to all groups but markup characters:

When markup characters or punctuation characters are escaped with backslash they become other characters.

We'll call markdown characters placed between a character of level L and a character of level R left-flanking delimiter run if and only if:

level(L) < level(R)

These markup characters sort of hang on the left hand side of a word.

Similarly we'll call markdown characters placed between a character of level L and a character of level R right-flanking delimiter run if and only if:

level(L) > level (R)

These markup characters hang on the right hand side of a word.

Emphasis markup (and other similar things like strikethrough, which we won't mention explicitly anymore for brevity) can start only as left-flanking delimiter run and end only as right-flanking delimiter run.

This produces a parse error:

*Something * is not right.
Something __is __ not right.

And this too:

__foo__bar

This means that inter-word emphasis is not supported.

The next example is OK because s is an other character and . is a punctuation character, so level('s') > level('.').

Here it *goes*.

In some rare cases backslash escaping can help get the right result:

Here goes *(something\)*.

We escaped the closing parenthesis ) so it becomes an other character with level 2 and so its level is greater than the level of plain punctuation character ..

Other differences

Block-level parsing:

Inline-level parsing:

About MMark-specific extensions

Performance

I have compared speed and memory consumption of various Haskell markdown libraries by running them on an identical, big-enough markdown document and by rendering it as HTML:

LibraryParsing libraryExecution timeAllocatedMax residency
cmark-0.5.6Custom C code323.4 μs228,4409,608
mmark-0.0.5.1Megaparsec7.027 ms26,180,27237,792
cheapskate-0.1.1Custom Haskell code10.76 ms44,686,272799,200
markdown-0.1.16Attoparsec14.13 ms69,261,816699,656
pandoc-2.0.5Parsec37.90 ms141,868,8401,471,080

Results are ordered from fastest to slowest.

† The markdown library is sloppy and parses markdown incorrectly. For example, it parses the following *My * text as an inline containing emphasis, while in reality both asterisks must form flanking delimiter runs to create emphasis, like so *My* text. This allowed markdown to get away with a far simpler approach to parsing at the price that it's not really a valid markdown implementation.

Related packages

Contribution

Issues, bugs, and questions may be reported in the GitHub issue tracker for this project.

Pull requests are also welcome.

License

Copyright © 2017–present Mark Karpov

Distributed under BSD 3 clause license.