Announcing Our Partnership with ECCO

Mary Naydan; Yan Che; Rebecca Sutton Koeser

Mary Naydan, Yan Che, Rebecca Sutton Koeser

August 30, 2024

In the Spring of 2021, the PPA project team reached an agreement with Gale/Cengage to incorporate thousands of new works from Eighteenth Century Collections Online (ECCO) into the Archive.

At the time of our proposal for the partnership with ECCO, the PPA contained only 228 HathiTrust works published between 1701-1800. To put this number in perspective, eighteenth-century works comprised just ~5% of the PPA’s total holdings, compared to the nearly 64% of works published in the nineteenth century (2,855 works out of 4,478 total).

With the addition of 1,498 works from ECCO to the PPA, eighteenth-century works now comprise 26% of our holdings (1,737 works out of 6,752 total); the green slices of the pie charts below visualize this growth. Scholars of poetry and prosody can now easily search across the eighteenth and nineteenth centuries in one interface to track how prosody’s discourses have evolved over two centuries.

Pie chart showing small relative slice of eighteenth-century works

Pie chart showing much larger slice of eighteenth-century works

One of the major interventions of the PPA is that it pulls out a relatively small, curated set of content that would otherwise be lost (or at least very difficult to find) when searching across the full-scale of these enormous aggregators (180,000 for ECCO; for HathiTrust, a whopping 17+ million).

Indeed, the process of identifying relevant works to include from ECCO’s collection was a big challenge. Graduate Research Assistant Yan Che and Project Manager Mary Naydan spearheaded this phase of the project, which required intensive collaboration, much trial and error, and several layers of data curation. Obviously, we could not review all 180,000 texts in ECCO for relevance manually, so we had to decide how to best leverage the resources of distant reading to efficiently identify relevant texts. In this essay, we’ll outline the approaches we tried (so that you can learn from our failed and successful attempts); narrate how Gale/Cengage added support for the PPA to their internal API as an "experiment"; and gesture toward the larger implications of what it means to combine digital surrogates from two large aggregators (HathiTrust and Gale/Cengage) into one interface.

LOC Subject Headings: A Failed Approach

The first approach we tried involved giving Gale’s VP of Content and Metadata a list of keywords to search against the Library of Congress subject headings in ECCO and limiting the search to English-language works. Although this approach mirrored the approach we previously used to identify relevant works in HathiTrust, it was tricky because there were so many levels of subheadings, from very broad to very narrow (you can see the range of topics this approach yielded below). We needed a couple rounds of tailoring the keywords, and some had to be eliminated because they were too broad. This approach ultimately yielded a list of 18,892 works, or about 10% of ECCO.

Screenshot of a spreadsheet showing Library of Congress subject headings

The next big challenge was excluding works from that list that got a hit, but were not actually relevant to the PPA. For example, our keyword “meter” pulled works listed under subjects such as “barometer” and “thermometer” because “meter” is a part of those words. With the help of our undergraduate researcher Vinicius Wagner ’19, these obviously irrelevant works were eliminated relatively easily by skimming the list of titles for such irrelevant words and running an algorithm in R to exclude items with those words. If the item’s LOC subject heading tags and title metadata contained ONLY irrelevant keywords and NO relevant keyword, it was automatically excluded. If the item metadata contained ONLY relevant keywords and NO irrelevant keywords, it was automatically included. If the item metadata contained BOTH relevant and irrelevant keywords, it was tagged for further inquiry. Through this approach, we were left with 10,828 works of undecided relevance.

However, we really needed to see inside the items, especially those related to poetry, to determine if there were prefaces to poems or poetry collections that discussed prosody, or a single prosody chapter in a grammar book. Gale generously gave us some additional metadata (such as author) to help improve our “spreadsheet reading,” but we still needed to be able to see the tables of contents, or to find some way of systematically filtering texts that contained prefatory material. Recall that the PPA is comprised of works about the study of poetry; if a book of poetry did not have prefatory material and was just a book of poems (and if those poems didn’t happen to be about the process of writing poems – another challenge to this new material), then we could exclude it. Gale’s solution was to provide us with a direct link to each “product” (Gale’s term for the digital surrogate in ECCO) so that we could manually skim the item’s “eTOC” (Gale’s term for “electronic table of contents,” a useful linked section-level metadata structure Gale provides for ECCO works) in ECCO’s interface during data curation.

Screenshot of an eTable of Contents in ECCO — An example eTOC for Thomas Newton’s edition of John Milton’s *Paradise Lost.* Note that the eTOC easily confirms at a glance the presence of an essay on “THE VERSE.”

It was at this point that we made a discovery that affected the entire Princeton University Library: we were unable to access some of those linked materials, and we learned that, although Princeton had purchased a full subscription to ECCO, the actual database was, on the server side, separated into two collections, ECCO (prefixed CW on Gale IDs) and ECCO 2 (prefixed CB on Gale IDs), and access to ECCO 2 had been mistakenly disabled for Princeton users. Gale quickly rectified this error so that we could get access to all the items, though we had to generate logic in the spreadsheet ourselves to provide an alternate “CB” link for items that weren’t found. Imagine being a Princeton researcher of the eighteenth century before that point and thinking you were accessing the most complete version of ECCO, when in fact 52,689 documents were missing.¹

Even with access to the eTables of Contents, manually reviewing them for each item was unsustainable for the amount of material we had to verify. It had become clear we needed a different approach, one that would allow us to search within the full-text of these works, not just the metadata.

Success! Keyword Searches via API

Rather than searching within only the initial list of 18,892 items we had gotten from the LOC subject headings, we decided to start the search process from scratch using the Gale Content API to search the “full text” (TX) field of all available materials on ECCO for matches. The full text of each item had been produced by Gale through OCR, and the Content API allowed us to view up to 10,000 results for each keyword, sorted by relevance. In consultation with Meredith Martin, we generated the following list of keywords, including common spelling variants (indicated in parentheses):

Diction, Elocution, Etymology, Grammar, Iamb, Linguistics, Meter (Metre), Orthography, Phonetics, Phonology, Poetics, Poetry, Pronunciation, Prosodic, Prosody, Rhetoric, Rhyme (Rime), Rhythm (Rhythme), Syntax, Versification, Voice Culture

Each keyword search produced one list of results, populated by a list of texts identified by their unique IDs within ECCO.² Rather than use all results, which would have been unworkably large, we chose only items that showed up at least five times across our list of keywords. That is, we used the appearance of duplicates within our search output to our advantage, to help identify items with a higher likelihood of relevance. Although five duplicates is an arbitrary threshold, we estimated that it would take several keywords all appearing in a text for the text to likely be topically relevant. This is especially because for some keywords such as “poetry,” “meter,” and “grammar,” usages could include non-prosodic contexts. In prosodic contexts, however, these terms would likely appear alongside other, more topic-specific terms. The collocation of these words with other prosodic terms would therefore be captured by the duplicate counting method.

Overall, duplicate counting seemed like the best available method, since we were unable to run any kind of systematic machine reading on Gale’s texts. Texts were assigned likelihoods for relevance based on the number of duplicates – the more duplicate occurrences, the more likely it seemed that the text discussed relevant aspects of prosody. Given that our initial likelihood estimates would have to be confirmed by manual review regardless of the threshold we chose, this method proved efficient for rapidly trimming down search results into a manageable workload.

We then divided the first set of results by author. The team performed manual checks of these items, eliminating works not in English, bilingual dictionaries, grammars of non-English texts, and, most importantly, poetry texts that did not pertain to prosody. For example, in a multi-volume collection of poetry, we would only retain the first volume if we found a relevant essay in the preface discussing the poems’ prosodic elements.

Works by Alexander Pope and Samuel Johnson were especially predominant, usually because of the very large number of editions and reprints of each, especially as extracts found in other sources. We judged that there was little value in including all the possible editions of what was functionally the same text, and so we manually eliminated reprint editions where no new prefatory or explanatory material had been introduced. Additionally, we encountered a key limitation of the duplicate counting method: namely, dictionaries. Dictionaries necessarily contain lists of words without context and were being flagged as especially high-likelihood items simply for containing many keywords – and the eighteenth century was really the age of dictionaries; Samuel Johnson’s Dictionary, given its numerous editions, made up a large part of these false positives. Although this was a simple fix, this example shows how meaning cannot be assigned to texts based on word-counting alone.

We then merged the results from this first pass with eighteenth-century texts we knew were relevant for inclusion in the PPA from Brogan’s bibliography (the Original Bibliography in our Collections), as well as certain items suggested to us by specialist advisors.

Upon closer review of this first set of results, however, we discovered that grammars constituted the majority of these items, likely due to the nature of the keywords we selected. As always, a search is only as good as its parameters. A second pass was therefore necessary, using the same keyword search method. To sweep up texts less focused on grammar, for this second pass, we focused on terms more directly related to poetic form and meter. This second, expanded, list of keywords was:

Alexandrine, Amphibrach, Anacreontic, Anapaest (Anapest), Ballad, Blank Verse, Burlesque, Caesura, Cento, Couplet, Dactyl, Dimeter, Distich, Ecloque, Elegy, Enjambment, Epic, Epilogue, Epithalamion, Eulogy, Georgic, Heptameter, Hexameter, Iamb, Lyric, Miltonic, Monody, Monometer, Octameter, Ossianic, Panegyric, Parody, Pentameter, Pindaric, Pyrrhic, Quatrain, Sonnet, Spenserian, Spondee, Stanza, Tetrameter, Trimeter, Trochee

Because this list of terms was longer, we chose to increase the threshold to six or more duplicates, and/or three or more duplicates among words related to metrical feet. We chose the lower threshold of three duplicates for this latter category because, in manual review, many texts referred to iambs in passing, but only a subset discussed iambs in the context of a broader discussion of meter. This can be seen in the relative frequency of “iamb” as a term compared to other names of metrical feet in our search results:

Bar chart showing the relative frequency of seven metrical terms within ECCO search results

These two sets of filters naturally produced some overlapping items, which we marked as items with an even higher likelihood of relevance. We then repeated the process developed for the first list of manually reviewing items following a ranking of likelihood based on the duplicate counting method.

At the same time as we were reviewing works for relevancy, we slotted relevant works into our (at the time, seven) Collections. Notably, these eighteenth-century works pushed the boundaries of the Collections we had generated based on our mostly nineteenth-century archive. Eighteenth-century grammars looked very different from nineteenth-century grammars, and eighteenth-century dictionaries looked very different from nineteenth-century dictionaries. We undertook a couple more rounds of data curation to further winnow down the grammars and dictionaries as we iteratively thought about what these Collections should look like and include with the addition of ECCO.

After compiling these two rounds of keyword searches and further winnowing down the grammars and dictionaries, we ultimately flagged 1,208 full works, 105 articles, and 193 book excerpts for inclusion in the PPA.

Overall, we believe this method achieved a good balance between exhaustiveness and minimizing the amount of manual review required. This method is easily replicable to any other similar cataloging project involving databases by Gale/Cengage thanks to the applicability of their API.

API Development

Gale’s API was key to not only identifying relevant works during this data curation phase, but also to including the OCR, metadata, and image thumbnails from ECCO on the PPA interface during the development phase led by PPA Technical Lead Rebecca Sutton Koeser.

With our data work well underway, we began negotiations with Gale. Gale was eager to partner with us, seeing the specialized PPA as another gateway to their product, rather than a competitor, and as an opportunity to “experiment” with adapting and building out their API (primarily used internally for Gale databases) for external use.

Because the API was configured for internal use, however, we needed Gale to make some changes to it so that we could get the information that we needed. For instance, the API was pulling OCR for body text, but not for front or back matter: precisely the kind of material we were most interested in! This was a relatively easy fix for Gale’s development team, and they pushed the update quickly.

The larger issue was that the API did not have fields for unabridged main title, subtitle, sort title, current volume, publisher, place of publication, authorized author name, or original page number (necessary for citing excerpts). This information (aside from current volume and original page number) was contained in the associated MARC record for each item, to which we had access thanks to Princeton University Library staff. However, there was no easy way to pull the right MARC record from the metadata contained in the API: the API included the Gale ID of the individual volume, and to pull the right MARC record, we needed the ESTC number, which links multivolume works under a single title record. Gale wasn’t able to make this change to their API within our development timeline, so they offered us a one-time data feed with the ESTC numbers and current volume metadata as a quick fix. The screenshots below illustrate this problem.

Screenshot of the JSON from the API with the missing metadata fields.

Screenshot of MARC record with relevant fields split out (highlighted in yellow and labeled in blue).

Screenshot of item detail page on the PPA site showing granular publication information — The MARC record fields allow us to provide more granular publication information to our users than the API allows.

Screenshot of a spreadsheet showing a field for ESTC number and currentVolume information — One-time data feed containing ESTC number (bibliographicID) and currentVolume provided for our list of items in a spreadsheet by Gale.

The quick-fix data feed allowed us to add the 1,208 full works to the PPA, but we needed the API adjustments in order to support adding new works to the PPA on demand as well as excerpts; we knew Meredith would find other works to include, and as soon as we announced the ECCO materials were included, we expected our eighteenth-century users to find works that we had missed. Rebecca cycled off of active development for the PPA while we waited for Gale to make the adjustment. This was at the end of the summer of 2021. We checked back in with Gale in the winter of 2022, but our emails bounced back: it appeared our two main contacts, the VP of Software Engineering and the VP of Content & Metadata, had both left the company. We spent some time reestablishing contact with Gale and rearticulating our needs. By April 2022, Gale began development work on their API and by the beginning of June, they added fields for estc, volNumber, and folioNumber (their term for original/physical page). These changes allowed us to implement features for on-demand admin import of new works from ECCO and for supporting book excerpts/journal articles in ECCO. Both of these major features were included in the v3.9 release in February 2023, almost two years after we initiated the project with ECCO.

As you search and browse the PPA, you can always determine the provenance of a work by looking at the item detail page to see whether the hyperlinked source ID is labeled “View on HathiTrust” or “View on Gale Primary Sources.”

Screenshot of item detail page on PPA site showing link to View on Gale Primary Sources

Beyond the Walled Gardens

The PPA database structure is intentionally very simple: minimal records for each volume or excerpted work with enough metadata to allow PPA curators to manage them, group into collections, or override metadata with local corrections when necessary. More detailed metadata (drawn from MARC catalog records for full volumes) – and text for each page in the volume or excerpted range – are indexed into Solr for searching across metadata and text together.

With the addition of works from ECCO, the PPA codebase now includes logic for displaying thumbnail images from HathiTrust or Gale/Cengage, and the base DigitizedWork object is designed to be extensible for content from other sources with a similar structure. The software for importing, managing, indexing, and searching digitized works is completely agnostic as to the prosody content — although search could be fine-tuned to meet the needs of the collection, as we’ve done in a few minor ways for PPA.

To clarify the potential impact for scholar-generated research products: because our code is all open source and freely available on GitHub, you could use our code to build your own full-text searchable database that brings together HathiTrust and Gale-owned works on any subject — even one unrelated to poetry or prosody — provided you secure an MOU from HathiTrust and Gale/Cengage like we did (and we’re happy to advise on this process). Just because we have an MOU with ECCO doesn’t mean that a scholar couldn’t demonstrate the need for another agreement with Gale, which owns a wide range of historical source material across various databases, from American Fiction, 1774-1920; to Public Health in Modern America, 1890-1970; to Indigenous Peoples of North America; to dozens more. While you could build a database, for instance, about historical cookbooks, the history of medicine, or ornithology using just ECCO, imagine an archive of the history of tobacco using American Newspapers and HathiTrust materials, or tracking information across books and media regarding climate change, voting issues, or race-horses. Please contact Rebecca Koeser (rebecca.s.koeser@princeton.edu) if you want to try the codebase on your own collection.

The PPA’s incorporation of works from multiple digital libraries into one full-text searchable database is an innovation of digital research infrastructure, which is dominated by large aggregators that silo digitized works into individual proprietary databases. Martin, Koeser, and Naydan are currently working on a book proposal on just this topic, tentatively titled Beyond the Walled Gardens.

As we work on writing up these larger implications about the PPA web application, we are excited to see what discoveries you make being able to search across two centuries in the PPA, and whether there are any additional eighteenth-century works (especially poems about poems) that we missed that you’d like to see added. Please contact Meredith Martin (mm4@princeton.edu) to let us know!

Number taken from Tolonen, Mikko, Eetu Mäkelä, and Leo Lahti. “The Anatomy of Eighteenth Century Collections Online (ECCO).” Eighteenth-Century Studies 56, no. 1 (2022): 95-123. https://doi.org/10.1353/ecs.2022.0060.
Here, we had some issues with converting the API output directly to workable CSVs, since we were only interested in the results as distinct entries in ECCO identified by their unique IDs. With the help of our undergraduate researchers, we first converted the API results to a .txt file and used a simple Python script to extract only the list of IDs and work with them as unique identifier strings representing the texts.