Introduction: Becoming well-versed
The Princeton Prosody Archive houses thousands of digitized texts regarding the study of language, poetry, and versification published in English between 1559 and 1927. In the Archive’s spirit of allowing scholars to engage with such material through an interactive database, the goal of this project, “A New Meterstick,” is to enable enhanced interaction with the Archive by transforming the history of versification into visualizations. Using information including collection (Dictionary, Linguistic, Literary, Music, Original Bibliography, Typographically Unique, and Word List), author, publisher, publication city, publication date, and reprint status for each text, I visualized the data as timelines and networks to demonstrate the rich overlap of content and literary circles concerned with prosody in the sixteenth through twentieth centuries.
The timelines and networks were constructed using Altair and NetworkX APIs, transforming the cleaned data to interactive visualizations. Each element in the timeline corresponds to a singular text, while each node in the network corresponds to either a publisher or an author. Thus, this set of visualizations describes both the works and the players at the heart of sixteenth- through twentieth-century prosody, their interactions, and their evolution.
Beyond turning substantial, dense data into an easily digestible format, these visualizations provide an entirely different, complementary perspective to the close reading one can do on individual sources housed in the database. The broader trends identifiable through the timelines and networks, as well as the more global questions they prompt, in turn, may inform analysis at a more granular level. If a picture is worth a thousand words, interactive visualizations must be invaluable—right?
Timelines: Reading between the (time)lines
Place of Publication
The first graph constructed shows the corpus distribution as a timeline with year on the x-axis and number of texts published on the y-axis colored according to publication cities. The top 10 most common cities are indicated in the legend, grouped by cities in the United States (warm colors) and the United Kingdom (cool colors), allowing the timeline to also function as a heatmap, visualizing the shift from primarily British to American materials. The timeline colorfully indicates the two major databases that comprise the works in the PPA; many of the blue texts can be attributed to Eighteenth Century Collections Online’s (ECCO) collection of British works, while HathiTrust houses texts scanned from American research libraries, and thus mostly appear here as red nineteenth-century works. The “Publication City” legend, in addition to being grouped by UK and US, is also ordered from most popular (top) to least popular (bottom) within those respective categories. Baltimore, the least popular of the 10 cities, accounts for 84 publications, while all cities labeled as “other” account for fewer than 40 texts each.
This graph, as can be seen in the Colab Notebook and is the case for every timeline, also allows the user to scroll to zoom and hover over a specific dot to obtain more information about the data point (e.g. Author, Publisher, etc.).
The timeline in Figure 2 breaks down the corpus according to the collection. Only 276 texts, or about 4% of the data, had more than two collections listed, and the maximum number of collections for a singular text was five. The dropdown selection highlights the texts tagged in that category in blue, while the other texts remain light gray.
Frequency of Reprinting
Using the Unique ID dataset that was used to produce the new clusters feature in the PPA search interface, this timeline displays the reprinting frequency of texts. Interestingly, the density and degree of reprinting is greatest in the 1780-1880 date range. The hover capability, which in this case displays Source ID, Publication Date, Unique ID, Title, and Author, can be used to more closely inspect and explain trends in the data. For instance, the gray peak in 1779 is due to many texts, although not reprints (as the gray color indicates), by the same author, Samuel Johnson.
Collection and Publication Place Combined
This highly interactive timeline conveys a lot of information about any selected chunk of the texts. By clicking and dragging to select an area on the timeline, the user can highlight any portion of the timeline about which they want distributional data regarding collection and publication location. This selection can then be dragged across the plot, and the bar charts will immediately update to display changes in distributional data for the same length of time across the dataset.
The publication location includes the same most popular cities and color schemes as in the first timeline. Unlike the second timeline’s more holistic treatment of collection, the “Primary Collection” and “Also tagged as” categories here are linked through a split bar graph. Thus, the overlaps between primary and additional collection labels, and changes to that overlap over time, are readily visible. For instance, while most “Literary” texts are strictly labeled as “Literary,” a notable amount also belong to the “Original Bibliography” and, in the seventeenth and eighteenth centuries, “Word List” collections.
Networks: Dotting the i’s and connecting the dots
The next set of visualizations uses networks to demonstrate the interconnectivity of Author-Publisher relationships in the Archive’s texts from the eighteenth and nineteenth centuries. The blue nodes correspond to authors, while the orange nodes correspond to publishers. The presence of an edge between two nodes indicates that a publisher has published at least one work by that author. The graphs below show such a network alongside a zoomed in segment of the central cluster. Although the nineteenth-century graph is much more highly connected (despite the increased divide between US and UK publications), both graphs consist of a central cluster of relatively higher connectivity surrounded by a ring of mostly author-publisher duos and trios (i.e. much smaller connectivity). This suggests that the presence of a small number of very highly connected publishers outweighs the increased presence of a US-UK publishing divide.
The hovering tool again can be used to discover the author or publisher name at a specific node. Interestingly, while the eighteenth-century graph was dominated by clusters of multiple publishers around a single author, the nineteenth century saw a trend toward clusters of multiple authors around a single publisher, perhaps corresponding to the rise of more centralized publishing. This may also reflect the tendency of eighteenth-century publishers to reprint different versions of the same text (for instance, a London and Dublin version). The hovering tool will also be useful to check for missed instances of cleaning. (The publisher data was especially messy, and small discrepancies such as “J. Smith” and “John Smith” when referring to the same publisher will result in distinct nodes; therefore, it will be a beneficial direction of future research to continue to check large publisher clusters for such repeats).
The network visualization also allows for numerical analysis of centrality (e.g., degree centrality, betweenness centrality, etc.). Such analysis, a clear next step in this research, would produce much more meaningful results after isolating the central cluster. As of right now, while it is quite clear that the nineteenth-century data is much more interconnected in the central cluster, its outer ring contains more low-connectivity clusters, making any notable distinction between the overall centrality of the eighteenth and nineteenth-century data unclear.
Miscellaneous: Rhyme without (quite as much) reason
There are relatively few texts in the Archive published by “famous” authors: that is, instantly recognizable names who are still widely taught in secondary schools in the United States today and therefore familiar to me and likely familiar to most members of an educated general public. To create this exploratory visualization, I manually tagged “famous” authors and grouped them by national identity. The names highlighted are:
- Classical: Aristotle, Horace, Virgil, Homer
- English & Irish: John Dryden, William Shakespeare, John Milton, Chaucer, W. B. Yeats, William Wordsworth, Alfred Lord Tennyson
- French: Voltaire, Jean-Jacques Rousseau
- American: Ralph Waldo Emerson, T. S. Eliot, Edgar Allan Poe, Henry Wadsworth Longfellow
For future visualizations, a more objective criterion for “fame” might be whether or not the author has a Wikipedia page — information available through linked metadata augmentation via OpenRefine. Moreover, a more robust representation of national identity of the authors in the Archive through linked metadata augmentation would provide a more holistic understanding of author demographics, as well as a methodology that could be extended to analyze factors such as gender or age at publication.
This timeline highlights the relationship between publication date, page count, and text frequencies. Longer texts became a bit more common over time, although the corpus is dominated by texts under 500 pages. You can also see the shift toward shorter journal articles in periodicals in the twentieth century, with a large number of records under 100 pages appearing in dark blue around 1910. The heatmap representation suggests a new direction for future visualizations — one that is not necessarily anchored chronologically.
Conclusion: A forward-looking afterword
Overall, the timeline and network visualizations above provide a fascinating, data-driven lens through which to view the Achive’s vast collection of texts. Like any analysis, however, this methodology comes with its own limitations. For instance, as the timeline makes clear, there is a sharp decline in the total number of texts immediately preceding 1800. As it’s more likely that the database is lacking data from this period rather than publication halted so dramatically, it is evident that, as in any data-driven analysis, more data, specifically in this underrepresented time frame, would allow for more accurate results. Furthermore, even after much computer-aided (using OpenRefine) and manual data cleaning, some mess remains around the publishers given the lack of uniformity across various publication formatting and source metadata, which may lead to the addition of repetitive nodes in the network graph. The hover tool in the interactive version, however, helps the user identify such duplication.
These visualizations, of course, are only the first chapter in a broader analysis. The author-publisher networks are ripe for additional centrality and modularity class analysis, which would allow for identification of the major players in the system. Additionally, the methodology could be applied to the texts from the sixteenth, seventeenth, and twentieth centuries as well, although this was omitted due to the presence of comparatively very few texts in these time ranges. Shifting to a person-oriented analysis, pulling in additional information about authors and publishers (such as place of birth, education, socioeconomic status, etc.) could provide important insight into whose work was deemed worthy — or simply lucky — enough to be published and preserved.
Finally, keeping the human in “Digital Humanities,” one must remember that these authors, publishers, and texts are more than dots on a screen. Each text encapsulates a rich perspective of the human experience that must be condensed and pixelated to allow for this broader, digital perspective. Incorporating computational textual analysis — looking beyond simply the assigned collections to the words themselves — can provide valuable insights into how prosody responds to the historical events and shifting cultural attitudes that characterized the United States and United Kingdom in the sixteenth through twentieth centuries — in times of war and peace, during the ebbs and flows of nationalism, and amid the great forces of globalization and individualism.