Writing a poetry generator

Published in

Sense and Sentences

5 min readJul 17, 2017

At this year’s CODEX Hackathon, I created an auto-generated book of poetry, rhyming old dictionary definitions along familiar rhyming schemes.

For example, Sprung Death-Bed Harped Know-all, Entailment Derailment and Reprobate Rorqual Law Aweless Goatee:

Or Sulk Brewage Desquamation, Pinned Encroach and Roar Inexhaustible Birthnight Sleuth:

These are absolutely intended to be read aloud.

The play here is in meaning-making through form: masquerading randomness behind a structure we know to be meaningful yet sometimes obtuse, in the hopes of reader meaning from noise. This poetry is often silly or nonsensical, but sometimes, it stumbles upon accidental poignancy.

The week of CODEX, my Text Mining class had been discussing generative art as an exercise in deconstruction. The hackathon provided the opportunity to try this sort of close dissection, a diversion from the usual aggregate view of texts that I have at the HathiTrust Research Center.

The idea that readers can make meaning from randomness, encouraged by context, is common in generative text. Liza Daly’s The Days Left Forebodings and Water uses the form of blackout poetry to brilliant effect, while Darius Kazemi’s Hottest Startups bot hilariously strips marxist aphorisms of their original context and offers them as a startup idea. I was thinking of Marshall McLuhan, whose book The Medium is the Massage (with an ‘a’) is said to have originated as a typo, one that McLuhan loved because of the way media massages into our senses.

DISARRAY CONSOLE
Disorder in dress,
cheer in distress.

I’ve discussed my affinity for old dictionaries in the past. In addition to being a time capsule to the English language at a given time, the colorful style of early Webster’s dictionaries is rewarding even for words that you already know. While it ran, my bot of old slang, vulgarities, and colloquialisms was a favorite son.

This book’s code was patched together with a variety of heuristics surmised through trial and error, and leaning on a variety of libraries to do the heavy lifting for image processing, optical character recognition, and rhyming. The rough order to create it was:

Downloading a PDF of Laird & Lee’s Webster’s New Standard American Dictionary (1912) from the HathiTrust Digital Library. These are converted to images with ImageMagick.
Identifying the text in the scans using Tesseract, through the PyOCR wrapper. This is somewhat slow, though it multi-threaded just fine with Dask. In addition to the words, Tesseract provides coordinates for each left-to-right line of text. Here’s the information that I get for each page:

Identifying junk, such as headers, footers, ornamentation, watermarks. For example, Tesseract return coordinates for non-text, so you can identify the horizontal lines at the top and bottom (e.g. width > 1000px) and ignore everything before and after. Other heuristics identified watermarks or column headers.
Identifying columns. This was done by seeing how much a line’s left margin deviates from the previous line. Columns that were mostly junk were removed.
Identifying the definitions, which entailed finding the starting words of a definition and patching together all the consecutive, non-junk lines until the next one. This was made easier by the capitalization of definition words; again, simple rules were used so that words like SYN or ANT weren’t incorrectly identified. These definitions were tokenized into sentences.

SUCKER: Nickname for one living in Illinois

Extracting possible pictures and their captions. These were saved for later.
Analyzing definitions for word stresses and syllable count. The hope was that rhyming verses would have similar syllabic lengths. A simple measure was also created for ‘rhythm’, striving for an iambic meter. Essentially, it counted the number of stressed/unstressed alternations divided by the maximum number possible. Here are some definitions that the metric said had a good iambic rhythm:

ORTHOEPICALLY: “With correct pronunciation”
JERSEY: “Largest of the English Channel Islands.”
CERUMEN: “Wax secreted by the ear.”
SLAKED: “Become disintegrated or extinct.”
PROTESTANT: “Dissenter from the doctrines of the Roman Catholic Church.”

Finding rhymes. This was done using the CMU Pronuncing Dictionary and the Pronouncing library. A line was randomly selected, and I looked for definition sentences that ended with a rhyming word. The pronunciation dictionary was extended for words that it did not have, with pronunciations inferred by lextool. This has a peculiar effect in some places, where a line ending with an OCR error has a rhyme based on how that typo might be pronounced.
Structuring the rhymes by familiar patterns, from simple couplets (AA BB CC) and ciquians (ABABBA) to sonnets (ABAB CDCD EFEF GG) and ballades (ABABBCBC BCBC). I even generated an ill-advised sestina (ABCDEF FAEBDC CFDABE ECBFAD DEACFB BDFECA and ending with internal rhymes (AD)(BE)(CF)).

Constructing pages. Pillow was use to lay out one poem per page, adding a random extracted image at the top. This was mostly unexciting work, playing with positions, resizing, masking, etc. A title page and frontispiece were made from interesting illustrations, and ImageMagick encoded the images back to PDF.

The trouble with weekend projects is that it’s hard to find time to wrap them up. A bit late, but the code has been posted on Github, and a 100-page PDF is below. Let me know what you think.

Writing a poetry generator

Written by Peter Organisciak