Home

Awesome

CI Version Build status Code coverage Support Python versions

Markovify

Markovify is a simple, extensible Markov chain generator. Right now, its primary use is for building Markov models of large corpora of text and generating random sentences from that. However, in theory, it could be used for other applications.

Why Markovify?

Some reasons:

Installation

pip install markovify

Basic Usage

import markovify

# Get raw text as string.
with open("/path/to/my/corpus.txt") as f:
    text = f.read()

# Build the model.
text_model = markovify.Text(text)

# Print five randomly-generated sentences
for i in range(5):
    print(text_model.make_sentence())

# Print three randomly-generated sentences of no more than 280 characters
for i in range(3):
    print(text_model.make_short_sentence(280))

Notes:

Advanced Usage

Specifying the model's state size

State size is a number of words the probability of a next word depends on.

By default, markovify.Text uses a state size of 2. But you can instantiate a model with a different state size. E.g.,:

text_model = markovify.Text(text, state_size=3)

Combining models

With markovify.combine(...), you can combine two or more Markov chains. The function accepts two arguments:

For instance:

model_a = markovify.Text(text_a)
model_b = markovify.Text(text_b)

model_combo = markovify.combine([ model_a, model_b ], [ 1.5, 1 ])

This code snippet would combine model_a and model_b, but, it would also place 50% more weight on the connections from model_a.

Compiling a model

Once a model has been generated, it may also be compiled for improved text generation speed and reduced size.

text_model = markovify.Text(text)
text_model = text_model.compile()

Models may also be compiled in-place:

text_model = markovify.Text(text)
text_model.compile(inplace = True)

Currently, compiled models may not be combined with other models using markovify.combine(...). If you wish to combine models, do that first and then compile the result.

Working with messy texts

Starting with v0.7.2, markovify.Text accepts two additional parameters: well_formed and reject_reg.

Extending markovify.Text

The markovify.Text class is highly extensible; most methods can be overridden. For example, the following POSifiedText class uses NLTK's part-of-speech tagger to generate a Markov model that obeys sentence structure better than a naive model. (It works; however, be warned: pos_tag is very slow.)

import markovify
import nltk
import re

class POSifiedText(markovify.Text):
    def word_split(self, sentence):
        words = re.split(self.word_split_pattern, sentence)
        words = [ "::".join(tag) for tag in nltk.pos_tag(words) ]
        return words

    def word_join(self, words):
        sentence = " ".join(word.split("::")[0] for word in words)
        return sentence

Or, you can use spaCy which is way faster:

import markovify
import re
import spacy

nlp = spacy.load("en_core_web_sm")

class POSifiedText(markovify.Text):
    def word_split(self, sentence):
        return ["::".join((word.orth_, word.pos_)) for word in nlp(sentence)]

    def word_join(self, words):
        sentence = " ".join(word.split("::")[0] for word in words)
        return sentence

The most useful markovify.Text models you can override are:

For details on what they do, see the (annotated) source code.

Exporting

It can take a while to generate a Markov model from a large corpus. Sometimes you'll want to generate once and reuse it later. To export a generated markovify.Text model, use my_text_model.to_json(). For example:

corpus = open("sherlock.txt").read()

text_model = markovify.Text(corpus, state_size=3)
model_json = text_model.to_json()
# In theory, here you'd save the JSON to disk, and then read it back later.

reconstituted_model = markovify.Text.from_json(model_json)
reconstituted_model.make_short_sentence(280)

>>> 'It cost me something in foolscap, and I had no idea that he was a man of evil reputation among women.'

You can also export the underlying Markov chain on its own — i.e., excluding the original corpus and the state_size metadata — via my_text_model.chain.to_json().

Generating markovify.Text models from very large corpora

By default, the markovify.Text class loads, and retains, your textual corpus, so that it can compare generated sentences with the original (and only emit novel sentences). However, with very large corpora, loading the entire text at once (and retaining it) can be memory-intensive. To overcome this, you can (a) tell Markovify not to retain the original:

with open("path/to/my/huge/corpus.txt") as f:
    text_model = markovify.Text(f, retain_original=False)

print(text_model.make_sentence())

And (b) read in the corpus line-by-line or file-by-file and combine them into one model at each step:

combined_model = None
for (dirpath, _, filenames) in os.walk("path/to/my/huge/corpus"):
    for filename in filenames:
        with open(os.path.join(dirpath, filename)) as f:
            model = markovify.Text(f, retain_original=False)
            if combined_model:
                combined_model = markovify.combine(models=[combined_model, model])
            else:
                combined_model = model

print(combined_model.make_sentence())

Markovify In The Wild

Have other examples? Pull requests welcome.

Thanks

Many thanks to the following GitHub users for contributing code and/or ideas:

Initially developed at BuzzFeed.