Streamlining Search Results with Clusters

Selena Hostetler ('23), Mary Naydan, Rebecca Sutton Koeser

Edited by Meredith Martin

Originally published April 25, 2023. Updated August 30, 2024.

We’re thrilled to announce the implementation of a long-desired feature with the v3.9 release: the clustering, and default collapsing, of reprints and multiple editions. This feature enhances the user’s search experience and the search functionality of the site by substantially decreasing the number of repetitive search results.

The PPA has always included reprints and multiple editions of some works, to indicate their popularity and to support the analysis of changes in specific prosodic texts over time; however, this redundancy in the text often generated highly repetitive search results (the 100+ volumes of Hugh Blair’s Lectures on Rhetoric were a frequent culprit). We didn’t want to get rid of these semi-duplicate texts because the reprints are valuable and were intentionally curated to be included in the PPA collection; however, we needed to find a way to improve the usability of the search interface and avoid forcing users to scroll through pages of identical search results to find distinct matches.

Our solution, then, was to group the reprints/multiple editions and collapse them by default in the main search, but provide access to search within that set of texts. This change to the interface allows users to find more varied search results more quickly, but preserves the option for book historians to search within a particular set of reprints or editions (a “cluster”) for minor variations.

“Cluster” is our term for a group of works that are more or less identical because they are a reprint of a work or a new (but not substantially revised or different) edition. Of course, the “clusters” in the PPA are necessarily incomplete. Unlike WorldCat, we don’t have every known copy of every book in the PPA; nor would we want every copy. But with the new book stack icon designed by Gissoo Doroudian that you’ll see as you search or browse the Archive’s contents, you can easily distinguish between works with reprints/multiple editions and single works at a glance. This icon, and the link to search within the cluster, are available on both the main search results page and on the individual volume view; so, if you go directly to a work’s detail page and see this icon, you’ll know it’s a part of a cluster and can search within that cluster.

book stack icon indicating search and browse within cluster

New book stack icon indicating clustered works.

From a technical standpoint, adding support for storing and importing curated clusters was not too difficult, since functionally they are not that different from collections. However, implementing the collapsed keyword search results was a little tricky because it required multiple layers of collapsing on the backend. Since the PPA code indexes each work’s metadata and page text separately, the search was already collapsing these in order to display volumes with their associated pages as a single result. And now we had to figure out how to add another layer of collapsing (and expanding) the search results themselves.

We came up with a solution to collapse all results from a single cluster on the main search page. We then adjusted the logic we had used to retrieve individual pages so that we can display the most relevant work from a collapsed cluster and display matching pages from that work in the collapsed results, rather than the most relevant pages from anywhere in the cluster. (Individual works not assigned to a cluster have only their metadata and pages collapsed as usual.) Once you switch to searching within a single cluster of works, the search functions almost exactly like the main PPA search used to work before we implemented this new feature: only page and metadata results are grouped so that you can see all results within a cluster at a glance.

These clusters have been carefully and manually curated by Selena Hostetler ’23. While a limited amount of automation was possible, manual curation was necessary because identical multiple editions/reprints are not always published under the exact same title. Moreover, two works might share identical title and author metadata but have substantially different contents (such as the addition of an entirely new chapter, or a new introduction). We decided to keep substantially revised editions separate because we didn’t want to inadvertently hide relevant search results. However, the question of what exactly warranted an edition being “substantially revised” was a tricky one. Read on for Selena’s first-person account of the process she undertook, as well as the key decisions she and the project team made to determine what counted as a clustered work, and what counted as a unique text.


The data curation process began with one massive spreadsheet listing all the titles currently held in the PPA, sorted by author name so that all of an author’s works—and any potential clusters—were grouped together. The first step was to identify these clusters, finding two or more of a particular title by a particular author. The PPA has hundreds of unique titles, so these could be skipped over. Most clusters are small (2-5 titles), but could be as large as 20 or 40 titles. For small clusters, if the title, publication date, and publisher all matched exactly in the spreadsheet data, it was safe to assume that they were all identical. However, if there were differences in these columns, a closer investigation was conducted. Typically, this meant opening all the documents in the cluster in HathiTrust and/or Gale to compare the title pages.

Title pages proved remarkably useful documents for the sorting process. At a glance, I could compare the titles, publisher, publication date, and look for any information about a text being a second edition, revised edition, etc. Multiple editions or alternate publications of the same title were still counted as part of the cluster. The only time a document was not included in the same cluster was if the title page was markedly different from the rest of the cluster, or if the document included a notable addition or subtraction from the original text—for example, a lengthy new preface, an appended essay, or the removal of a chapter. In these cases, the benefit of streamlining searches had to be balanced with a user’s ability to access all the significant versions of a title and not lose extra matter (which often contained relevant prosodic material) to a collapsed search result.

At this point, the document comparison was typically complete, and I could confidently create a unique alphanumeric ID (uniqueID) for the cluster and label any multiple editions. However, if I still had doubts about the similarity of two documents, I would move to the table of contents. If the TOCs matched, I could be sure that two documents were the same and should be included in a cluster. If the TOCs were noticeably different, I would keep the documents separate.

Finally, the uniqueID process also accounted for texts that were part of a multi-volume set. Because we wanted each volume to appear in search results, all the volumes of a set needed to be assigned to individual clusters with their own IDs. Typically, this was as simple as verifying which volume number each document was, and then adding that number to the cluster ID. However, some especially popular texts in the archive were published in different editions with different numbers of volumes. Sometimes a text was published as a 2-volume set, then as a 3-volume set, and perhaps later as a “complete in one volume” edition. Each volume in each set had to be categorized separately; for example, volume 2 in the 2-volume set and volume 2 in the 3-volume set could not be collapsed into the same cluster, because they would contain different sections of the complete collection.

I will illustrate this process with one of the more complicated examples from the sorting process: Blair’s Lectures on Rhetoric and Belles Lettres. There were over 100 different documents with this title in the PPA, in many different versions: a 2-volume set, a 3-volume set, a single-volume edition, abridged editions, and variations on each of these.

Below is the title page of a volume from the 3-volume edition, one of the most common versions of this text. Moving from the top to the bottom of this title page, I confirmed it was the correct text, by the correct author. I also noted that it is part of a 3-volume set, that it’s a “new edition,” and that it is volume #2.

Title page of Hugh Blair's Lectures on Rhetoric and Belles Lettres, volume 2 of 3, new edition.

Blair, Hugh. Lectures on Rhetoric and Belles Lettres. Volume 2 of 3, A New Edition. Printed for W. Sharpe, 1820. https://prosody.princeton.edu/archive/hvd.hxjgcj/

I then compared this title page to the title page of another document in the potential cluster from the spreadsheet. The image below shows that it is also volume 2, but part of a 2-volume set; the document is also a multiple edition, but a different one than the previous document.

Title page of Hugh Blair's Lectures on Rhetoric and Belles Lettres, volume 2 of 2, fourth American edition.

Blair, Hugh. Lectures on Rhetoric and Belles Lettres. Volume 2 of 2, Fourth American Edition. Printed by Thomas Kirk, 1807. https://prosody.princeton.edu/archive/nyp.33433082515531/

To be sure that I should not collapse these two volume 2s, I checked both tables of contents. In the images below, it is obvious that the contents of each volume are different, although some of the chapters do eventually overlap. (TOC from 3-volume set is on the left; TOC from 2-volume set is on the right).

Contents page for Hugh Blair, Lectures on Rhetoric and Belles Lettres, volume 2 of 3, new edition.

Blair, Hugh. Lectures on Rhetoric and Belles Lettres. Volume 2 of 3, A New Edition. Printed for W. Sharpe, 1820. https://prosody.princeton.edu/archive/hvd.hxjgcj/

Contents page of Hugh Blair, Lectures on Rhetoric and Belles Lettres. Volume 2 of 2, Fourth American Edition

Blair, Hugh. Lectures on Rhetoric and Belles Lettres. Volume 2 of 2, Fourth American Edition. Printed by Thomas Kirk, 1807. https://prosody.princeton.edu/archive/nyp.33433082515531/

Checking other title pages, I also found some that said “complete in one volume,” which meant they needed to be their own cluster. Others did not say “one volume” on the cover, but the table of contents showed that it contained the same chapters as the complete volume texts.

Title page of Hugh Blair, Lectures on Rhetoric and Belles Lettres, Complete in One Volume

Blair, Hugh. Lectures on Rhetoric and Belles Lettres. Complete in One Volume. London: Charles Daly, 1839. https://prosody.princeton.edu/archive/nyp.33433082515671/

Some title pages also had additional information—for example, the one on the left below contains the standard Lectures and some essays and questions written by a second author. Since it had a significant amount of new content, it gave it its own ID. Finally, there were also several abridged versions, which looked like the example below on the right.

Title page of Hugh Blair and Abraham Mills, Lectures on Rhetoric and Belles Lettres

Blair, Hugh, and Abraham Mills. Lectures on Rhetoric and Belles Lettres. Hayes & Zell, 1856. https://prosody.princeton.edu/archive/
hvd.32044102846094/

Title page of Hugh Blair, Abridgment of Lectures on Rhetoric

Blair, Hugh. Abridgment of Lectures on Rhetoric. Revised and corrected. New Brunswick: Lewis Deare, 1813. https://prosody.princeton.edu/archive/
njp.32101072898206/

These images do not represent all of the unique scenarios I found when sorting the Blair texts. But Blair was an outlier; a typical cluster contains about 4 titles and only a couple of multiple editions, if any. However, this group of texts effectively illustrates the criteria for separating titles into clusters and demonstrates the practicality of the search-collapse feature—what began as a list of over one hundred individual documents can now be collapsed into just a handful of clusters, reducing repetitive search results while also making the unique versions of this popular text available at a glance.