Visualizing the Collections

Rebecca Sutton Koeser

Disclaimer: All visualizations and numerical data were generated from Princeton Prosody Archive (PPA) v.3.2.4 and were accurate at the time of writing, which spanned several months. As the PPA project team continues to add new items and make new discoveries, these numbers will inevitably change. Collection curation and refinement is an ongoing project. The PPA team hopes to write more about how the visualizations discussed in this essay helped with this curation process.

I started with a simple idea: visualize the Princeton Prosody Archive (PPA) collections to show their relative size and overlap. When the PPA project team initially approached the CDH Development and Design Team about contributing editorial content to the site, I thought this would be a valuable contribution that would help people get a sense of the materials. As we were still developing the project, I did some early experimentation to create a draft diagram based on preliminary numbers, when there were fewer collections. When I returned to the task after the project team had finalized that phase of the collection curation work — now including seven collections instead of six, due to the addition of Music — I discovered that generating accurate Venn diagrams of more than three items is difficult (if not impossible) to do. Very few tools that can generate Venn diagrams will handle more than four or five sets, and accurately reflecting the relationships among that many sets may well be geometrically impossible. Thus began my quest to learn more about visualizing overlapping sets in order to meaningfully display the collections within the PPA.

Relative sizes

One way to represent the collections is a bubble chart [1], which is a proportional area diagram. The area of each circle is proportional to some metric, in this case the number of items in a collection, which allows you to discern relative sizes. Since this just represents numeric counts, I could also use a simple bar chart. I chose a bubble chart in part because, due to my initial interest in using Venn diagrams, I was already envisioning the collections as circles; the bubble chart looks similar to a Venn diagram, but without any overlap. However, I think the bubble chart is a better choice here because a bar chart requires imposing a sequence (for example, sorting the collections by name or count), and for this data that order would be arbitrary or possibly privilege something that isn’t significant, such as size. Although, of course, people might wrongly interpret the proximity of collections in a bubble chart as indicating some kind of relationship, when it does not [2].

Bubble chart of PPA collections showing relative sizes.

Bubble chart of PPA collections showing relative sizes. Generated with d3.js.

This bubble chart shows that the Literary and Linguistic collections are the largest and very similar in size; Original Bibliography is the next largest but much smaller; the rest decrease in size, with Dictionaries as the smallest.

One problem with this chart is that it gives an inflated sense of the PPA as a whole. Since numerous items are in multiple collections, they are represented more than once. For comparison, look at a bubble chart that includes the whole of the PPA among the collections. You can see that the Literary and Linguistic collections each represent a sizable proportion of the PPA, and that all the bubbles combined are much larger than the entirety of the PPA.

Bubble chart of the PPA and the collections, showing relative sizes.

Bubble chart of the PPA and the collections, showing relative sizes. Generated with d3.js.

This bubble chart shows that the PPA as a whole is smaller than the combination of the other collections, and even smaller than just the Literary and Linguistic collections together.

Another solution for displaying relative sizes is a treemap diagram. Like the bubble chart, it uses area to represent some metric, such as the number or sizes of items. Unlike the bubble chart, it presumes a single hierarchy; this is because it was invented to visualize disk space utilization, and files on a hard drive can exist only in one location. I knew from the outset that this wouldn’t be a good fit for the highly overlapping PPA collections, but I thought that it might not work in interesting ways. I was surprised to discover how difficult it was to come up with numbers for the collections and their relative overlap in a way that would let me generate a treemap. The sizes of regions within a treemap are usually calculated by aggregating all the groups included in that region; in the case of the PPA, the overlap between collections means that the total for each section and the overall total are vastly inflated (7,218 total items in the PPA instead of the 4,838 items that were included in the PPA at the time I generated this diagram). I also had to decide how to represent the number of items in a collection that are not also in other collections; my initial approach was to take the total number of items in a collection and subtract the totals of items in that collection that were also in other collections; however, this actually led to negative numbers in some cases because of the amount of overlap between the collections! I finally decided to use the number of items that are only in a single collection and no others [3]. For this diagram, I went only one level deep in the nested collections because I couldn’t figure out a good way to generate the numbers, and also because I knew that the problem of duplicated counts would only be compounded [4].

This treemap is confusing because of the way that the collections are duplicated in every section of the diagram. However, it does give some idea of the relative sizes of the collections. The most effective thing about this treemap is the “zooming” implementation I’ve used, which allows you to see the breakdown of items within a collection. The total numbers are inaccurate, since there are overlaps and additional nesting not represented here, but it does give some sense of how items are distributed within each collection.

Zoomable treemap

Zoomable treemap diagram of Princeton Prosody Archive collections. Only includes one level of nested collections. Select a section to zoom in; tap or click the top label to zoom back out. Adapted from d3.js Zoomable Treemap

Venn diagrams and other alternatives

My first instinct for visualizing the PPA collections was to use a Venn diagram [5]. Like the bubble chart, a Venn diagram uses circles that are sized in proportion to some metric in the data such as a count. Unlike a bubble chart, circles in a Venn diagram overlap in order to convey the relationships between the sets. As I discovered after investigation, Venn diagrams are typically only used to display two or three sets. There are ways to display more sets, but they quickly become unwieldy and difficult to read; and very few software tools for generating Venn diagrams will even attempt to generate one for more than four sets.

The only tool I could find that would attempt to handle seven sets was with Highcharts.js, and this is the closest I got to a Venn diagram for all seven PPA collections. It’s a pretty good result: we can see the relative sizes of the collections and get a sense of the broad strokes of how they overlap. However, it’s misleading and inaccurate because it is geometrically impossible to represent the relationships among these collections. This diagram entirely misses the thirteen Typographically Unique items included in the Original Bibliography, the overlaps between Music and the Original Bibliography, and the overlaps between Dictionaries and the Literary collection.

Venn diagram of PPA collections

Venn diagram of PPA collections generated with Highcharts.js. Proportions are accurate but collection overlaps are not.

This diagram shows the Literary and Linguistic collections in similar sizes and with some overlap; the Original Bibliography overlaps almost completely with the Literary collection and has only a small overlap with the Linguistic collection. Music and Typographically Unique are shown together with some overlap on the Linguistic side; Dictionaries and Word Lists are shown together with some overlap, and also more on the Linguistic side.

This diagram does show the Literary and Linguistic collections as two major discourses within the PPA that are similar in size and have some overlap, and it also shows that the Original Bibliography is largely, but not completely, Literary.

Once I learned that Venn diagrams were limited to showing relationships between a smaller number of sets, I thought I would try using Venn diagrams for a subset of the collections in the PPA. Project Manager Mary Naydan suggested that Literary, Linguistic, Music, and Typographically Unique might be an interesting group of collections to investigate. The resulting diagrams show that the Music and Typographically Unique collections currently have more overlap with the Linguistic materials than the Literary — but even the four set Venn diagram fails to represent the overlap between the Music and Typographically Unique collections. The only way to properly show the relationships is to create multiple diagrams representing the collections in sets of three.

Venn diagram of Literary, Linguistic, Music, and Typographically Unique collections.

Venn diagram of Literary, Linguistic, Music, and Typographically Unique collections. Fails to represent the overlap between Music and Typographically Unique.

side-by-side venn diagrams of collection subsets

Venn diagram of Linguistic, Music, and Typographically Unique collections (left) and Literary, Music, and Typographically Unique collections (right).

Not only are Venn diagrams limited in the number of sets; they are also limited to sets that are relatively similar in size. During 2018-2019, the PPA project team undertook data curation work to complete Brogan’s Original Bibliography, tracking down items that are not available in HathiTrust and identifying available versions for reference in a bibliographic dataset and possible future inclusion in the Archive. I thought I could use numbers from their work to reflect the status of the Original Bibliography with regard to the PPA — how much of it is currently included, how many more items are in HathiTrust, and how many items are elsewhere. However, the scale of HathiTrust – over 15.7 million volumes – is so enormous when compared with the less than 5,000 items in the PPA that there’s no good way to show both of them at the same time. Either the representation of HathiTrust is so large that you can barely see any curvature (as below), or PPA is so small you can’t make out the relationships.

Venn diagram of HathiTrust, PPA, and Original Bibliography

Venn diagram (left) and detail (right) of the relationship between PPA, the Original Bibliography and HathiTrust. HathiTrust is so much larger than PPA that if you zoomed in to see PPA and the Original Bibliography clearly, you would barely see any curvature in the enclosing circle. (PPA and Original Bibliography diagram generated with Highcharts.js and edited in SVG to create a circle for HathiTrust scaled relative to PPA. No Venn diagram generator could handle the differences in scale, and even in some SVG editors it was difficult to work with.)

When HathiTrust is viewed full scale, PPA is a small circle and the Original Bibliography is barely perceptible.

Bubble chart showing relative sizes of HathiTrust, PPA, and the Original Bibliography

Even the simpler bubble chart is ridiculous with the differences in scale among HathiTrust, PPA, and the Original Bibliography.

These diagrams are laughable, and they don’t do much to convey the overlap I had hoped to communicate. They do give a sense of scale, but more importantly they give us an idea of the scholarly labor required to identify, select, and de-duplicate materials to create curated bibliographies (more than mere collections) from the enormous corpus of HathiTrust materials, and also a sense of the corresponding value for scholars working with that smaller set of data.

As I continued my quest to visualize the PPA collections, I discovered an interesting tool called UpSet, which provides “interactive set visualization for more than three sets.” It’s a relatively new tool, based in part on dissertation work completed in 2011 and 2012. UpSet is available in interactive form as an online tool, and there are also implementations for R and Python that will generate static UpSet plots.

UpSet diagram showing PPA collection relative set size and overlaps.

UpSet diagram showing PPA collection relative set size and overlaps. Generated with UpSetPlot; collection colors added manually.

An UpSet plot may look unfamiliar and intimidating at first, but if you know how to read a bar chart, you can read an UpSet plot after a brief orientation. The full size of each set (the PPA collections, in this case) is plotted by size in a small horizontal bar chart at the left. The black vertical bars indicate the size of each combination of sets, where the sets included are denoted by circles and bars in the lower portion of the chart. The first set of plots, with only a single dot, represent items that are only in that set and no other.

Once we’ve learned how to read it, this UpSet plot shows us that the Literary and Linguistic collections include a large number of items that are only in those collections, where all the other collections only have a few items that are only in one collection. We also see that by far the largest overlap is between the Original Bibliography and the Literary collection, followed by the overlap between the Original Bibliography and the Linguistic collection – a similar insight to the one we gained from the inaccurate Venn diagram. The UpSet plot is very clear in terms of accuracy and numbers, but lacks the visual potency of a Venn diagram to represent overlaps intuitively through overlapping shapes [6].

An experimental alternative

Because I was dissatisfied with both the Venn/Euler diagram results and the UpSet plot, I created my own experimental visualization of the collections. My approach was inspired in part by the ”Playing with Data” workshop series on creative coding and data visualization with p5.js, which was put on as part of the Center for Digital Humanities 2018-2019 Year of Data in partnership with the Council on Science and Technology. In the workshops, we were working toward helping participants replicate the “warming stripes” climate change dataviz. Based on that work, I developed my own visualization of the PPA collections that I hoped would make it easier to identify how many items belong to multiple collections and understand how the collections overlap.

This visualization shows every item in the PPA, ordered randomly. The display is split into rows, where each row represents a single collection. Each item is represented by a vertical slice of the diagram: if an item is a member of a collection, a stripe is drawn in the collection row and color. As a static visualization, it’s visually interesting and maybe somewhat useful, although limited, However, when you interact with it and select a single collection, it will cluster all items in that collection and highlight items in that collection that also belong to other collections. Because items are clustered when you select a single collection, this will also give you a sense of the relative size of that collection.

Princeton Prosody Archive collections. Tab through or click to highlight a single collection.

Like the other visualizations, at first glance this shows the predominance of the Literary and Linguistic collections, followed by the Original Bibliography; the other four collections are much sparser. But where this visualization really shines is allowing you to select a single collection: highlight the Original Bibliography, and you’ll see how strongly it overlaps with the Literary collection, and how it overlaps sparsely with all the others. If you highlight Word Lists or Dictionaries, you’ll see how much smaller they are, and how they overlap with most of the other collections. You can see that Word Lists connects most strongly with the Linguistic collection, but neither Word Lists nor Dictionaries has any overlap with Music. If you select the Linguistic collection, you’ll notice how little it overlaps with Original Bibliography and that, while it overlaps some with the Literary collection, the bulk of the Literary materials are not Linguistic.The interactivity of this visualization makes it possible to isolate a single collection and see how it does or does not overlap with the others, providing opportunities for new insights about the relationships among them.

Conclusion

It turns out to be surprisingly difficult to find visualizations that convey intersecting, overlapping set membership. It’s much more common to find visualizations that display hierarchies. My suspicion is that it’s because many of the more common modern visualizations are used for displaying computer generated data and code, which is usually conceived of as organized in a single hierarchy. This includes the treemap diagram, which was invented for visualizing file system contents, and many d3.js visualizations of software dependencies [7]. I was interested to discover that, in addition to the UpSet plot, many of the online tools for generating Venn and Euler diagrams seemed to be coming out of molecular biology. This makes me wonder about the state of data visualization in Digital Humanities; the sequential and hierarchical aspects of humanities data are often not the most important or interesting aspects. Creative and evocative visualizations abound both within DH and in the larger worlds of Design and Data Visualization, but they are rarely made available as generalizable new approaches or tools. Where are the new data visualizations for Digital Humanities work?

In the process of exploring these collections, I now have a deeper understanding of the complexity of categorizing the content in the PPA. This Archive exists and is significant in large part due to the shifting meaning of the term prosody over time, encompassing, on the one hand, grammar/phonology, and versification on the other, as Meredith Martin explains in What is Prosody?. The visualizations I generated, in their varying degrees of success, all reflect aspects of these two overlapping discourses in the form of the Linguistic and Literary collections within the PPA.

Most collected items

One thing I discovered as I wrangled the data to generate these different visualizations is the fact that there are a number of items in more than two collections. I included counts for three sets of overlaps [8] to generate the Venn diagrams, and about 3% of the total volumes are in three collections, so the collection overlap represented by those items in three collections should be represented if the layout was able to accommodate it [9]. What I found more interesting was that there are fifteen items that are in at least four collections, and one unusual volume that is in five collections. That unusual volume is the 1832 edition of William Gardiner’s Music of Nature (which I first tripped over by way of the test search term “grasshopper”). All six editions of this work currently in the PPA are included in the Literary, Linguistic, Music, and Typographically Unique collections; the 1832 edition has the additional distinction of being included in the Original Bibliography.

Finding only those items that belong to multiple specified collections isn’t feasible from the Archive search interface [10]; that’s because it’s intentionally designed to find items in any of the collections you select. Otherwise, it could be too easy for people to generate searches with no results. For instance, if both the Music and Typographically Unique collections are selected, you won’t yield only those items that belong to both the Music and Typographically Unique collections. Instead, you’ll get any item that has either of those collections tags, and it might have others, too.

Since they are not otherwise easy to find, I wanted to highlight these “most collected items” here.

Least collected items

When I was generating numbers for the treemap diagram, I likewise discovered there are some collections with a small number of items that are found only in that collection. Like the most collected items, it isn’t possible to find these from the public Archive search.

Word Lists

Dictionaries

Music

Acknowledgements

Mary Naydan and Meredith Martin read an early draft of this work and gave me guidance and encouragement to pursue it; Mary Naydan provided substantial guidance to shape it, and tremendously helpful editing. Gissoo Doroudian came up with the beautiful, accessible color scheme that enabled me to use consistent colors across all the different visualizations. She also helped with the visuals, provided feedback, and encouraged me to give the UpSet plot its proper due. Ben Hicks generated an earlier version of the UpSet plot with UpSetR and helped me understand how to read it. CDH staff gave me feedback on the collection stripes visualization, and Nick Budak helped me think through making it more accessible.

Diagrams were generated with d3.js and Highcharts.js and exported as SVG, and then edited in Figma for consistent colors and styles.

  1. The term bubble chart may also refer to a bubble plot, which displays proportional bubbles plotted on an axis based on other variables, but that’s not how I’m using it here.
  2. This particular bubble chart was generated with the d3.js Pack Layout, which by default optimizes the layout for the best use of space by starting with the largest bubbles first.
  3. This choice was inspired in part by the UpSet plot. See the next section for more.
  4. This was particularly difficult since I used Solr Pivot Faceting to generate the counts. This is also called “decision tree” faceting and is intended to let you calculate values and compare the results for different combinations of filters; it’s not intended for looking at overlaps within a single hierarchy of filters. The only way I could get a count for items in a single collection was to search for items in that collection and explicitly filter out all the other collections by name.
  5. Technically, a Venn diagram shows all possible logical relations between a collection of sets and is not necessarily proportional. The popular conception of Venn diagrams is perhaps closer to the less familiar Euler diagram. Like a Venn diagram, an Euler diagram is made up of overlapping shapes, usually proportional to the number of elements in the set, but unlike a Venn diagram an Euler diagram shows only the relevant relationships between sets.
  6. An interactive version of UpSet is available online, which allows you to highlight overlaps or select a single set to see how it overlaps with the others, but unfortunately it does not support linking to a custom dataset. To try the interactive version with PPA data: navigate to the UpSet tool, click “Load Data”, use this url for a previously generated PPA data gist, and hit submit.
  7. For instance, see the Flare visualization toolkit package hierarchy and imports visualized as a radial dendrogram or in a tree layout.
  8. This is due to my use of Solr pivot facets to generate counts; I pivoted only three times.
  9. I found claims that including more than two sets of overlaps doesn’t matter anyway. My understanding is that this is due to a limitation of representing relationships as overlaps in two dimensions; it simply isn’t that precise.
  10. In fact, they’re not all that easy to find through other means. Solr doesn’t support searching for something based on the number of values in a multivalued field, unless you configure special indexing to generate a count as you add items to the index. The only ways I could find these items were by querying the database or filtering a CSV export.