The Distant Reader is a high performance computing system which takes an arbitrary amount of unstructured data (text) as input, and outputs sets of structured data for analysis. Put another way, the Reader consumes just about any number of files in any just about any format, and it outputs plain text files, delimited files, a relational database, and a set of HTML reports all for the purposes of systematic reading.
The first half of this workshop will be on the use of the Distant Reader. Attendees will learn how to submit content to the Reader, and then how to interact with the HTML reports. Thus, attendees will be able to “read” dozens of websites, hundreds of books, thousands of journal articles, or just about anything on their computer.
The second half of the workshop will be on hacking the Reader’s structured data. Given the plain text files, tab-delimited files, and relational database the Reader also outputs, attendees will learn how to do various visualizations against the data, subset the data with SQL, index the data with Solr, normalize the data with OpenRefine, use machine learning against the data, etc., all for the purposes of more in-depth analysis. It is our sincere hope attendees – given some instruction – will exploit the structured data and create interesting hacks of their own.