Quantitative Text Analysis

Professor Sagarzazu has an excellent lecture style. I also like how approachable he is, always ready to answer questions and always willing to help. — participant from Singapore

This course provides participants with a set of quantitative methods and computational techniques for acquiring, measuring, and modeling text as data. It develops the most important models for analyzing textual content – from manual through dictionary-based content analysis, automated document classification, clustering, and scaling approaches – within a unified measurement framework that clarifies the different substantive assumptions motivating each model. The course also provides participants with hands-on exercise in fitting, interpreting, and criticizing each kind of model using real-world data and practical solutions to the problem of efficiently 'harvesting' and storing data from publicly available sources.

The course emphasizes the question of how to integrate the results of textual analyses with other quantitative and qualitative methods and discusses how to make best use of text analysis in various research designs. Towards the end of the second week, the course devotes time to discussing, planning, and problem solving for the participants' own research projects.


This course was offered in 2014, 2015, and 2016.


Iñaki Sagarzazu (picture), Texas Tech University

Detailed Description

The course begins by placing quantitative text analysis in context and demonstrating its connections to other qualitative and quantitative social science research methods. It also introduces the key mathematical ideas that guide all the subsequent models of text and takes a fresh look at the most common form of social science text analysis, i.e., manual content analysis.

After a critical examination of manual content analysis, we consider some recent, closely related methodological developments, such as semi-automated, dictionary-based content analysis and statistical topic models. We show how to build and deploy these models to quantify the content of political speeches, manifestos, legal materials, etc. We then turn to the problem of assigning large numbers of documents into categories, e.g., to determine media agendas by assigning many years of newspaper stories into different categories of news. We learn how to determine the precision and accuracy of semi-automated categorization methods and consider the extent to which categories can be determined inductively from raw text. The section on models for text finishes with an in-depth look at text scaling. We develop a general class of text scaling models and show how they can be understood both as substantive models of position-taking using language, e.g., in political debates or legislative content, or as compact and effective methods of visualizing all kinds of count data, ranging from the results of content analyses to the content of multiple documents.

The course also offers practical solutions to problems arising from acquiring, pre-processing, and storing large numbers of texts, e.g., from government or NGO websites. These topics, often described as 'spidering' and the 'scraping' of data, are directly linked to the projects and research interests of the participants. Depending on these projects and interests, the course also address issues arising from non-English language materials as well as scripts, word segmentation, etc. The final sessions are devoted to an explicit discussion of various research designs that are best suited to text analyses, with a particular emphasis on how textual measures can be integrated into the participants' mixed method or qualitative research projects.


There are no formal prerequisites. It would be beneficial if participants had some experience with the statistical software R and were familiar with basic statistical concepts. However, participants unfamiliar with these concepts and tools will be able to effectively participate in the course.


Participants are expected to bring a WiFi-enabled laptop computer. Access to data, temporary licenses for the course software, and installation support will be provided by the Methods School.

Core Readings

Krippendorff, Klaus H. 2013. Content Analysis: An Introduction to Its Methodology. 3rd edition. Thousand Oaks, CA: Sage Publications.

Neuendorf, Kimberly A. 2002. The Content Analysis Guidebook. Thousand Oaks, CA: Sage Publications.