Introduction to natural language processing with spaCy - A fast and accessible library that integrates modern machine learning technology

This half-day tutorial will introduce participants to spaCy, a free and open-source library for text analysis. Developed by Matthew Hannibal and Ines Montari in Berlin, spaCy offers a suite of tools for applied natural language processing (NLP) that are fast, practical and allow for quick experimentation and evaluation of language models. These tools make it possible for individuals to quickly train models that can infer customized categories in named entity recognition tasks, match phrases, and visualize model performance. While comparable to the Natural Language Toolkit (NLTK), spaCy offers neural network models, integrated word vectors, dependency parsing and a variety of new features that are not available elsewhere. Participants will learn how to use spaCy for common research tasks and gain an understanding of how spaCy compares with other tools for NLP. We will also work with Prodigy, which is an annotation and active learning tool from the makers of spaCy. Prodigy allows a single researcher to quickly fine-tune a model for greater accuracy on a specific task or to train new categories and entities for recognition. I have taught this workshop twice before, first as a three-hour workshop at DH2019 in Utrecht and then a day-and-a-half-long version at DH Budapest.



Floor Plan

9:00am - 12:00pm
The Westin Pittsburgh - Somerset East