Streamlining Search Results with Clusters

Selena Hostetler ('23), Mary Naydan

We’re thrilled to announce the implementation of a long-desired feature with the 3.9-0 release: the clustering, and default collapsing, of reprints. This feature enhances the search experience of the user and the search functionality of the site by substantially decreasing the number of repetitive search results.

The PPA integrates multiple editions and reprints of works as a non-comprehensive indication of a work’s popularity and circulation; for this reason, it is important to preserve the reprints in the Archive. However, keyword searches would often yield five or ten results from what is more or less the same text, meaning that the user had to scroll a lot to see a new result. Our solution was to collapse multiple editions/reprints into clusters, allowing the user to see more variety in their search results. And if users happen to be interested in the minor changes across editions/reprints, they have the capacity to search within a cluster by clicking on the new book stack icon on clustered results.

book stack icon indicating search and browse within cluster

New book stack icon indicating clustered works.

Because editions/reprints are not always identical (and they are not always published under the exact same title), this feature required substantial research and data work by our PPA intern, Selena Hostetler ’23. Read on for Selena’s account of the process she undertook, as well as key decisions she and the project team made to determine what counted as a reprint, and what counted as a unique text.

Due to messy metadata from HathiTrust and Gale as well as complex publishing histories, all of the texts in the PPA had to be hand-sorted into clusters that could be collapsed together in a search. The process began with one massive spreadsheet listing all the titles currently held in the PPA, sorted by author name so that all of an author’s works—and any potential clusters—were grouped together. The first step was to identify these clusters, finding two or more of a particular title by a particular author. The PPA has hundreds of unique titles, so these could be skipped over. Most clusters are small (2-5 titles), but could be as large as 20 or 40 titles. For small clusters, if the title, publication date, and publisher all matched exactly in the spreadsheet data, it was safe to assume that they were all identical. However, if there were differences in these columns, a closer investigation was conducted. Typically, this meant opening all the documents in the cluster in HathiTrust and/or Gale to compare the title pages.

Title pages proved remarkably useful documents for the sorting process. At a glance, I could compare the titles, publisher, publication date, and look for any information about a text being a second edition, revised edition, etc. Multiple editions or alternate publications of the same title were still counted as part of the cluster. The only time a document was not included in the same cluster was if the title page was markedly different from the rest of the cluster, or if the document included a notable addition or subtraction from the original text—for example, a lengthy new preface, an appended essay, or the removal of a chapter. In these cases, the benefit of streamlining searches had to be balanced with a user’s ability to access all the significant versions of a title and not lose extra matter (which often contained relevant prosodic material) to a collapsed search result.

At this point, the document comparison was typically complete, and I could confidently create a unique alphanumeric ID (uniqueID) for the cluster and label any multiple editions. However, if I still had doubts about the similarity of two documents, I would move to the table of contents. If the TOCs matched, I could be sure that two documents were the same and should be included in a cluster. If the TOCs were noticeably different, I would keep the documents separate.

Finally, the uniqueID process also accounted for texts that were part of a multi-volume set. Because we wanted each volume to appear in search results, all the volumes of a set needed to be assigned to individual clusters with their own IDs. Typically, this was as simple as verifying which volume number each document was, and then adding that number to the cluster ID. However, some especially popular texts in the archive were published in different editions with different numbers of volumes. Sometimes a text was published as a 2-volume set, then as a 3-volume set, and perhaps later as a “complete in one volume” edition. Each volume in each set had to be categorized separately; for example, volume 2 in the 2-volume set and volume 2 in the 3-volume set could not be collapsed into the same cluster, because they would contain different sections of the complete collection.

I will illustrate this process with one of the more complicated examples from the sorting process: Blair’s Lectures on Rhetoric and Belles Lettres. There were over 100 different documents with this title in the PPA, in many different versions: a 2-volume set, a 3-volume set, a single-volume edition, abridged editions, and variations on each of these.

Below is the title page of a volume from the 3-volume edition, one of the most common versions of this text. Moving from the top to the bottom of this title page, I confirmed it was the correct text, by the correct author. I also noted that it is part of a 3-volume set, that it’s a “new edition,” and that it is volume #2.

Title page of Hugh Blair's Lectures on Rhetoric and Belles Lettres, volume 2 of 3, new edition.

Blair, Hugh. Lectures on Rhetoric and Belles Lettres. Volume 2 of 3, A New Edition. Printed for W. Sharpe, 1820.

I then compared this title page to the title page of another document in the potential cluster from the spreadsheet. The image below shows that it is also volume 2, but part of a 2-volume set; the document is also a multiple edition, but a different one than the previous document.

Title page of Hugh Blair's Lectures on Rhetoric and Belles Lettres, volume 2 of 2, fourth American edition.

Blair, Hugh. Lectures on Rhetoric and Belles Lettres. Volume 2 of 2, Fourth American Edition. Printed by Thomas Kirk, 1807.

To be sure that I should not collapse these two volume 2s, I checked both tables of contents. In the images below, it is obvious that the contents of each volume are different, although some of the chapters do eventually overlap. (TOC from 3-volume set is on the left; TOC from 2-volume set is on the right).

Contents page for Hugh Blair, Lectures on Rhetoric and Belles Lettres, volume 2 of 3, new edition.

Blair, Hugh. Lectures on Rhetoric and Belles Lettres. Volume 2 of 3, A New Edition. Printed for W. Sharpe, 1820.

Contents page of Hugh Blair, Lectures on Rhetoric and Belles Lettres. Volume 2 of 2, Fourth American Edition

Blair, Hugh. Lectures on Rhetoric and Belles Lettres. Volume 2 of 2, Fourth American Edition. Printed by Thomas Kirk, 1807.

Checking other title pages, I also found some that said “complete in one volume,” which meant they needed to be their own cluster. Others did not say “one volume” on the cover, but the table of contents showed that it contained the same chapters as the complete volume texts.

Title page of Hugh Blair, Lectures on Rhetoric and Belles Lettres, Complete in One Volume

Blair, Hugh. Lectures on Rhetoric and Belles Lettres. Complete in One Volume. London: Charles Daly, 1839.

Some title pages also had additional information—for example, the one on the left below contains the standard Lectures and some essays and questions written by a second author. Since it had a significant amount of new content, it gave it its own ID. Finally, there were also several abridged versions, which looked like the example below on the right.

Title page of Hugh Blair and Abraham Mills, Lectures on Rhetoric and Belles Lettres

Blair, Hugh, and Abraham Mills. Lectures on Rhetoric and Belles Lettres. Hayes & Zell, 1856.

Title page of Hugh Blair, Abridgment of Lectures on Rhetoric

Blair, Hugh. Abridgment of Lectures on Rhetoric. Revised and corrected. New Brunswick: Lewis Deare, 1813.

These images do not represent all of the unique scenarios I found when sorting the Blair texts. But Blair was an outlier; a typical cluster contains about 4 titles and only a couple of multiple editions, if any. However, this group of texts effectively illustrates the criteria for separating titles into clusters and demonstrates the practicality of the search-collapse feature—what began as a list of over one hundred individual documents can now be collapsed into just a handful of clusters, reducing repetitive search results while also making the unique versions of this popular text available at a glance.