Book Excerpts, Journal Articles, and Better Metadata
The ability to excerpt material from HathiTrust has greatly improved the PPA’s search interface.
Prior to the implementation of this feature, about 500 book chapters or articles were already in the PPA — but as full works. You can see why this was such a problem by comparing the two screenshots below.
In the first screenshot, you see only the whole journal title in the metadata (e.g., “Modern Language Notes”), rather than the title of the specific relevant article inside that journal (e.g. “Longfellow and the Hexameter”). The author and publication metadata for that article was either missing or inaccurate due to our reliance on HathiTrust’s metadata, which does not meaningfully index book chapters or periodicals in most cases, and which uses the date of the journal’s founding for the publication date field for individual journal volumes.
This full issue of Modern Language Notes was in the PPA because, using Brogan’s bibliography, the project team had previously identified it as containing an article about prosody. However, without looking back at Brogan’s bibliography or scouring the table of contents page on HathiTrust's viewer, it would be impossible to identify the relevant article quickly and accurately.
What’s more, because our Archive had to include the entire 400-page journal volume, rather than just the relevant article, users would invariably get keyword hits on pages not related to prosody. To ensure users were receiving relevant search results and to ease browsing, we needed to convert these existing works in the PPA to excerpts.
Because HathiTrust lacked the metadata we needed, we had to provide it ourselves. Project Manager Mary Naydan and former Bain-Swiggett Research Assistant Caitlin Crandell used Brogan’s bibliography to hand-curate the book chapter and journal article metadata for HathiTrust works already in the PPA in full-work form. The project team, including undergraduate research assistant Armani Aguiar '22, also hand-corrected the inaccurate publication date metadata we had ingested from HathiTrust. All this granular correction work allows for a level of ease and specificity when searching or browsing the Archive not achievable in HathiTrust.
Now, PPA users can be assured that they are seeing only relevant material when searching or browsing the Archive. Of course, the trade-off in getting rid of this excess material to make the contents of the PPA more focused and specific is losing some of the serendipity that might come from searching the whole volume. However, since the results are linked to HathiTrust or ECCO, a user can open a new browser window and explore the additional materials that surround any excerpted book chapter or article.
Researchers can also now see the works that are book chapters or journal articles at a glance when searching or browsing the Archive, as indicated by these icons designed by Gissoo Doroudian that you’ll see on the left-hand side of individual results.
Beyond the issue of missing or inaccurate metadata and the question of how to indicate excerpts in the design, the integration of HathiTrust excerpts was also technically interesting from a software engineering perspective. It required intensive and iterative collaboration among the project team as well as ingenuity from our Technical Lead Rebecca Koeser.
In the code, we revised the DigitizedWork
model (see screenshot below) to track whether an item was a full work, an excerpt, or an article (circled in yellow), and we added new fields to track the pages that should be included (circled in blue). Since the physical page numbers rarely match the digitized pages, we needed to include both: citable pages for scholars to reference, along with the digital ranges from HathiTrust’s eReader to identify the correct text and thumbnail images to be included in the search (including support for discontinuous page ranges). We based the unique source ids on source id and digital starting page to support including multiple excerpts from the same volume (circled in violet).
Because our data model is intentionally simple, it was feasible for project team members to supply the needed metadata for excerpts — although some functionality that is currently implemented based on MARC records, such as harvesting citations with tools like Zotero citations, is not yet supported for excerpts. We’re thinking about how to solve this issue now, as well as how to improve citations for all PPA content. For citability and resource discovery, we implemented a redirect from works that had previously been included in PPA as full volumes into the newly excerpted version (as long as there was only one excerpt).
In addition to converting existing excerpts, these changes to the DigitizedWork
model also empowered project team members with the ability to add new excerpts from works not already in the PPA, as we have a rolling list of books, book excerpts, and journal articles from Brogan that we add to the Archive each year once they enter the public domain.
All this work was designed to be extensible to support ECCO excerpts, which we imported the following year, and which you can read about in more detail here.
However, we later made a frustrating discovery: apparently, HathiTrust works are occasionally rescanned and re-ingested into HathiTrust, sometimes (but not always) resulting in changes to the digital pagination. These changes meant that the PPA was unknowingly pulling in wrong ranges for some HathiTrust excerpts! It also meant that our unique source identifier, which used the source id and digital starting page, was not as stable as we thought it would be. Since there was also no quick and easy way to identify which HathiTrust works had pagination changes, our research assistant Olivia Roslansky ’26 had to hand check the HathiTrust excerpts and correct the ranges. As of October 2023, 111 excerpts out of 517 needed to be fixed. During the Spring of 2024, we wrote and ran a script to fix these ranges, as well as to transition the stable identifiers from digital page range to physical page range.
We are planning to bring up this issue with our HathiTrust contacts, as the use case for having a stable digital page sequence no doubt extends beyond ours, but this is likely a structural issue based on the way HathiTrust ingests works from various institutions and libraries with little standardization or oversight. In the absence of a solution from HathiTrust, we may turn to other software engineering solutions to identify when a range has shifted and automatically calculate the new range. This issue again points to the difficulty of trying to innovate within and across a research environment built to be a walled garden, as well as the ephemerality and fragility of our digital research ecosystem.