Looking for Poetry Using Rule-Based Methods

Molly Taylor '25

Molly Taylor '25

May 14, 2025

The Princeton Prosody Archive holds more than 300 books explicitly “for use in schools,” written to teach students the fundamentals of English language, literature, and rhetoric. Many include poems to demonstrate concepts like meter and verse, and, as part of a research effort to trace poets' influence over time, the project team wants to know which poems are most frequently quoted throughout the collection. To answer this question, we first need to find poetry in the collection.

Ahead of the project team's attempt to detect poetry using text reuse algorithms and machine learning, I looked to see how far we could get with a rule-based approach. Given OCR text, can we identify whether a page contains poetry and pinpoint which lines belong to a poem? I tried using two different tools from the HathiTrust Research Center: the HTRC Features Reader library and the Data Capsule. The former offers page-level information about each volume in HathiTrust, while the latter allows direct access to the text.

Attempt 1: Using HTRC Feature Reader Library

First, I used the Feature Reader library to access HTRC’s Extracted Features dataset, which details page-level information about its volumes. For example, the Extracted Features dataset includes word frequencies, line counts, and lists of characters that start or end lines on each page. However, the data is abstracted from the text itself—what is commonly referred to as a “bag-of-words.” Extracted Features does not tell you where words appear on the page, or in what order start-line characters fall.

Can page-level data help us to find poetry? Using the Feature Reader library, I implemented an approach devised by HTRC that leverages the characteristics of pages that tend to contain poems. With enjambment, lines of poetry are often shorter than lines of prose, so we flag pages with fewer-than-average words (a standard deviation below the book’s mean, plus or minus 100). And most—if not all—lines of poetry in the PPA begin with uppercase letters, so we flag pages with more uppercase start-line characters than lowercase ones. The intersection of these sets offers a preliminary guess at pages with poetry.

This approach correctly identified several cases of poetry in the For Use in Schools dataset:

Side by side comparison of 3 pages of poetry — ***Correct IDs*** (Left) John Mulligan, *Exposition of the grammatical structure of the English language: being an attempt to furnish an improved method of teaching grammar. For the use of schools and colleges* (1874); (Middle) Mary F. Hyde, *Practical lessons in the use of English for primary and grammar schools* (1894); (Right) Alexander Jamieson, *A grammar of rhetoric and polite literature: comprehending the principles of language and style, and the elements of taste and criticism. For the use of schools, or private instruction* (1875)

These pages are primarily poetry, so it is easy to see how they satisfy the filter criteria. However, this approach also caught several false positives—pages with relatively fewer words and more uppercase start-line characters but no poetry, such as lists of vocabulary words, sentences demonstrating grammar concepts, or tables of contents.

Side by side comparison of two texts — ***False Positives*** (Left) Mulligan, *Exposition*; (Right) Hyde, *Practical Lessons*

Worse, this approach tended to miss prose-heavy pages with small poetry excerpts.

Given the goal of identifying the most frequently quoted poems in the collection, we can’t afford to miss pages like these—we need an approach that captures pages with just a few lines of poetry while minimizing false positives. I experimented with the filter criteria, removing the maximum word count (as pages with short poems may still have a high word count), lowering the portion of uppercase start-line characters (as shorter poems contribute fewer lines), and adding a requirement that at least some lines end in commas. These adjustments seemed to boost the accuracy, but they still missed several pages with poetry and failed to exclude false positives.

Page-level information helped me to identify pages that were primarily poetry. For harder-to-find poems, I turned to a text-level approach.

Attempt 2: Into the Data Capsule

The HTRC Data Capsule allows users to run custom scripts on the text of works in the HathiTrust library. I could apply character-level criteria and identify lines, instead of page numbers, that corresponded to poetry. So, at the character level, what does a poem look like?

My first thought was to look for consecutive lines beginning with uppercase letters. These sequences would surely include prose, but I would catch most, if not all, poetry. After trying out this approach, though, I discovered that many poems in the collection start with a quotation mark or a number. Accordingly, I broadened my search to allow for alternative start characters, but then—to exclude numbered lists and dialogue—I added a requirement that each sequence contain at most one line beginning with a non-uppercase-letter character.

Page image — A correctly identified poem whose third line begins with a quotation mark.
Hyde, *Practical lessons*

Still, looking at only start-line characters returns many, many lines that are not poetry, so I needed to find other ways to filter out prose. In examining false positives, the most obvious indication of prose was a sentence beginning in the middle of a line and ending partway through the next. In the collection’s poems, by contrast, sentences tend to start at the beginning of a line and end at the end of another line. Based on this observation, I tried excluding lines with a period, exclamation point, or question mark in the middle.

However, some poems in the collection do contain lines where one sentence ends and another begins. In these cases, the line still tends to end with a punctuation mark (usually a comma), so I only exclude a line if a sentence ends in the middle and it ends with a character other than a punctuation mark, like a letter.

Poems on pages — When a sentence ends in the middle of a line, the line also tends to end in a punctuation mark.
Hyde, *Practical lessons*

To further narrow down my results, I implemented a few additional rules. No line in a potential poem can contain more than 12 words, and at least one in every four lines has to end in a comma or semicolon.

Ultimately, I found poetry in the Data Capsule with far greater success than with the Feature Reader library. In particular, I was able to identify short excerpts that were difficult to detect with only page-level information, allowing for a much more comprehensive search of poems across the collection.

With the granularity granted by this text-level tool, I found myself constantly weighing the trade-offs between criteria that were too broad and those that were too narrow. For one, excluding lines with more than 12 words dramatically reduced the number of false positives, but surely, some poems contain a 13-word line. It was also difficult to determine my criteria given that there was no labeled poetry that I could use for evaluation; I manually checked my results on a small set of volumes, making it difficult to ensure that I was not overfitting to a certain format, or to understand how adding or removing a requirement would impact performance across all volumes.

Throughout the trial-and-error process, I was surprised to discover the number of ways in which poetry appears throughout the collection, despite the period’s relatively standard conventions (in contrast to, say, poetry today). Just as there is no character-based definition of a poem, there is no perfect set of rules for poetry detection.

As the project team explores machine learning methods for poetry identification, I am optimistic that they will recognize more complex character-level patterns and leverage additional features. Page images could be especially valuable as they retain visual elements like whitespace—one of the most distinctive features of a poem—that are not preserved in OCR text. And while it is easy to get mired in the technical specifications, this work will inform our understanding of the most influential poets and poems over time.