Visualizing the Collections
Disclaimer: All visualizations and numerical data were generated from Princeton Prosody Archive (PPA) v.3.2.4 and were accurate at the time of writing, which spanned several months. As the PPA project team continues to add new items and make new discoveries, these numbers will inevitably change. Collection curation and refinement is an ongoing project. The PPA team hopes to write more about how the visualizations discussed in this essay helped with this curation process.
I started with a simple idea: visualize the Princeton Prosody Archive (PPA) collections to show their relative size and overlap. When the PPA project team initially approached the CDH Development and Design Team about contributing editorial content to the site, I thought this would be a valuable contribution that would help people get a sense of the materials. As we were still developing the project, I did some early experimentation to create a draft diagram based on preliminary numbers, when there were fewer collections. When I returned to the task after the project team had finalized that phase of the collection curation work — now including seven collections instead of six, due to the addition of Music — I discovered that generating accurate Venn diagrams of more than three items is difficult (if not impossible) to do. Very few tools that can generate Venn diagrams will handle more than four or five sets, and accurately reflecting the relationships among that many sets may well be geometrically impossible. Thus began my quest to learn more about visualizing overlapping sets in order to meaningfully display the collections within the PPA.
One problem with this chart is that it gives an inflated sense of the PPA as a whole. Since numerous items are in multiple collections, they are represented more than once. For comparison, look at a bubble chart that includes the whole of the PPA among the collections. You can see that the Literary and Linguistic collections each represent a sizable proportion of the PPA, and that all the bubbles combined are much larger than the entirety of the PPA.
Another solution for displaying relative sizes is a treemap diagram. Like the bubble chart, it uses area to represent some metric, such as the number or sizes of items. Unlike the bubble chart, it presumes a single hierarchy; this is because it was invented to visualize disk space utilization, and files on a hard drive can exist only in one location. I knew from the outset that this wouldn’t be a good fit for the highly overlapping PPA collections, but I thought that it might not work in interesting ways. I was surprised to discover how difficult it was to come up with numbers for the collections and their relative overlap in a way that would let me generate a treemap. The sizes of regions within a treemap are usually calculated by aggregating all the groups included in that region; in the case of the PPA, the overlap between collections means that the total for each section and the overall total are vastly inflated (7,218 total items in the PPA instead of the 4,838 items that were included in the PPA at the time I generated this diagram). I also had to decide how to represent the number of items in a collection that are not also in other collections; my initial approach was to take the total number of items in a collection and subtract the totals of items in that collection that were also in other collections; however, this actually led to negative numbers in some cases because of the amount of overlap between the collections! I finally decided to use the number of items that are only in a single collection and no others . For this diagram, I went only one level deep in the nested collections because I couldn’t figure out a good way to generate the numbers, and also because I knew that the problem of duplicated counts would only be compounded .
This treemap is confusing because of the way that the collections are duplicated in every section of the diagram. However, it does give some idea of the relative sizes of the collections. The most effective thing about this treemap is the “zooming” implementation I’ve used, which allows you to see the breakdown of items within a collection. The total numbers are inaccurate, since there are overlaps and additional nesting not represented here, but it does give some sense of how items are distributed within each collection.
This diagram does show the Literary and Linguistic collections as two major discourses within the PPA that are similar in size and have some overlap, and it also shows that the Original Bibliography is largely, but not completely, Literary.
Once I learned that Venn diagrams were limited to showing relationships between a smaller number of sets, I thought I would try using Venn diagrams for a subset of the collections in the PPA. Project Manager Mary Naydan suggested that Literary, Linguistic, Music, and Typographically Unique might be an interesting group of collections to investigate. The resulting diagrams show that the Music and Typographically Unique collections currently have more overlap with the Linguistic materials than the Literary — but even the four set Venn diagram fails to represent the overlap between the Music and Typographically Unique collections. The only way to properly show the relationships is to create multiple diagrams representing the collections in sets of three.
Not only are Venn diagrams limited in the number of sets; they are also limited to sets that are relatively similar in size. During 2018-2019, the PPA project team undertook data curation work to complete Brogan’s Original Bibliography, tracking down items that are not available in HathiTrust and identifying available versions for reference in a bibliographic dataset and possible future inclusion in the Archive. I thought I could use numbers from their work to reflect the status of the Original Bibliography with regard to the PPA — how much of it is currently included, how many more items are in HathiTrust, and how many items are elsewhere. However, the scale of HathiTrust – over 15.7 million volumes – is so enormous when compared with the less than 5,000 items in the PPA that there’s no good way to show both of them at the same time. Either the representation of HathiTrust is so large that you can barely see any curvature (as below), or PPA is so small you can’t make out the relationships.
These diagrams are laughable, and they don’t do much to convey the overlap I had hoped to communicate. They do give a sense of scale, but more importantly they give us an idea of the scholarly labor required to identify, select, and de-duplicate materials to create curated bibliographies (more than mere collections) from the enormous corpus of HathiTrust materials, and also a sense of the corresponding value for scholars working with that smaller set of data.
As I continued my quest to visualize the PPA collections, I discovered an interesting tool called UpSet, which provides “interactive set visualization for more than three sets.” It’s a relatively new tool, based in part on dissertation work completed in 2011 and 2012. UpSet is available in interactive form as an online tool, and there are also implementations for R and Python that will generate static UpSet plots.
An UpSet plot may look unfamiliar and intimidating at first, but if you know how to read a bar chart, you can read an UpSet plot after a brief orientation. The full size of each set (the PPA collections, in this case) is plotted by size in a small horizontal bar chart at the left. The black vertical bars indicate the size of each combination of sets, where the sets included are denoted by circles and bars in the lower portion of the chart. The first set of plots, with only a single dot, represent items that are only in that set and no other.
Once we’ve learned how to read it, this UpSet plot shows us that the Literary and Linguistic collections include a large number of items that are only in those collections, where all the other collections only have a few items that are only in one collection. We also see that by far the largest overlap is between the Original Bibliography and the Literary collection, followed by the overlap between the Original Bibliography and the Linguistic collection – a similar insight to the one we gained from the inaccurate Venn diagram. The UpSet plot is very clear in terms of accuracy and numbers, but lacks the visual potency of a Venn diagram to represent overlaps intuitively through overlapping shapes .
Like the other visualizations, at first glance this shows the predominance of the Literary and Linguistic collections, followed by the Original Bibliography; the other four collections are much sparser. But where this visualization really shines is allowing you to select a single collection: highlight the Original Bibliography, and you’ll see how strongly it overlaps with the Literary collection, and how it overlaps sparsely with all the others. If you highlight Word Lists or Dictionaries, you’ll see how much smaller they are, and how they overlap with most of the other collections. You can see that Word Lists connects most strongly with the Linguistic collection, but neither Word Lists nor Dictionaries has any overlap with Music. If you select the Linguistic collection, you’ll notice how little it overlaps with Original Bibliography and that, while it overlaps some with the Literary collection, the bulk of the Literary materials are not Linguistic.The interactivity of this visualization makes it possible to isolate a single collection and see how it does or does not overlap with the others, providing opportunities for new insights about the relationships among them.
- The term bubble chart may also refer to a bubble plot, which displays proportional bubbles plotted on an axis based on other variables, but that’s not how I’m using it here.
- This particular bubble chart was generated with the d3.js Pack Layout, which by default optimizes the layout for the best use of space by starting with the largest bubbles first.
- This choice was inspired in part by the UpSet plot. See the next section for more.
- This was particularly difficult since I used Solr Pivot Faceting to generate the counts. This is also called “decision tree” faceting and is intended to let you calculate values and compare the results for different combinations of filters; it’s not intended for looking at overlaps within a single hierarchy of filters. The only way I could get a count for items in a single collection was to search for items in that collection and explicitly filter out all the other collections by name.
- Technically, a Venn diagram shows all possible logical relations between a collection of sets and is not necessarily proportional. The popular conception of Venn diagrams is perhaps closer to the less familiar Euler diagram. Like a Venn diagram, an Euler diagram is made up of overlapping shapes, usually proportional to the number of elements in the set, but unlike a Venn diagram an Euler diagram shows only the relevant relationships between sets.
- An interactive version of UpSet is available online, which allows you to highlight overlaps or select a single set to see how it overlaps with the others, but unfortunately it does not support linking to a custom dataset. To try the interactive version with PPA data: navigate to the UpSet tool, click “Load Data”, use this url for a previously generated PPA data gist, and hit submit.
- For instance, see the Flare visualization toolkit package hierarchy and imports visualized as a radial dendrogram or in a tree layout.
- This is due to my use of Solr pivot facets to generate counts; I pivoted only three times.
- I found claims that including more than two sets of overlaps doesn’t matter anyway. My understanding is that this is due to a limitation of representing relationships as overlaps in two dimensions; it simply isn’t that precise.
- In fact, they’re not all that easy to find through other means. Solr doesn’t support searching for something based on the number of values in a multivalued field, unless you configure special indexing to generate a count as you add items to the index. The only ways I could find these items were by querying the database or filtering a CSV export.