Below is my revise-and-extend version of remarks for my undergrad stats course.
There’s a lot of mystique around data.
If you want to sound serious and knowledgeable, one way to stake your claim is to assert that you’re data-driven. The implication is that your competitors, of course, are driven by mere vibes. And the further implication is that once you have data, the data will speak for themselves—clearly, plainly, loudly, and in a way that brooks no argument. To be data-driven is to be in direct contact with truth.
That’s an appealing vision. It’s wrong—indeed, likely impossible—as a way of understanding human society. Still, it’s appealing, at least to many people, because we all want to act on the best possible evidence, and statistical data—which appears to us in nice, neat charts, and which is produced by stats wizards who know mysterious terms like “standard errors” and “logistic regression”—seems to offer, by definition, the best possible evidence.
There’s a contrary portrayal of this vision. In this alternative view, data is the enemy of judgment. It’s practically a trope at this point to have a television show or movie in which the main character will turn to a computer or nerd and say something like “You might have data, but you forgot the most important point: the human heart.” Or something like that.
The less hackneyed version of this is to point out all the limitations of data and measurement. Data, after all, don’t just emerge from the vasty deep—they have to be produced, recorded, and designed to answer particular questions. It’s rare that we have exactly the data we would want to have to answer the questions we want to pose about the world, especially about the human world.
I sometimes joke that, as a political scientist, it’s my job to use an accelerator to smash politicians into each other to see if we can discover new, fundamental forms of politics—but the joke is that it’s very hard to use the methods of physics in politics. Humans, it turns out, are tricky creatures to understand. They’re deceitful, both to observers and to themselves; they can become aware that they are being monitored and change their behavior; and they have a range of options through which to express their preferences. All of that means that we can’t just choose one outcome to monitor and then assume that we’ve gotten it right. An atom won’t change what it’s doing because we’re watching, but a politician might. Worse, humans are individuals, and that means that we can’t treat them as interchangeable, the way a physicist might treat two atoms from the same element. As a result, it can be hard to say with certainty what humans are doing and why they are doing it.
Unless we’re very lucky or very careful, we are usually in the situation of using second-, third-, or fourth-best data to understand what people are doing and why. And the worse our data become relative to our ideal situation—which we can rarely if ever attain—the more cautious we have to be in assuming that our data are illuminating anything near to the truth.
Sophisticated critics of quantitative methods, then, will often assert that naive appeals to data are, well, naive—that it’s a bit of faith that will lead the credulous astray.
Well, if that’s the case, then why should you take stats as a required course?
There are, I think, three big reasons.
The first is that statistical reasoning is one of the greatest intellectual accomplishments of human reasoning. Statistics (if we include its long, open border with probability theory) offers a way to understand patterns within the randomness of observation. Statistical thinking can be applied to a variety of problems at a range of scales, from the behavior of the fundamental particles of the universe to the evolution of the grandest structures in the universe—suns, galaxies, the distribution of matter and heat itself. And we can even apply it to hard problems, like knowing whether adding a subway line will meaningfully reduce use of freeways (one of the actual impetuses behind the development of more sophisticated tools for analyzing choice, which won their developer an economics Nobel Memorial Prize). By bounding uncertainty, statistics offers us a way to make randomness intelligible and tractable—an enormously valuable contribution in its own regard.
Statistics, done well, is a disciplined approach to understanding the world that enables us to systematically analyze observations in ways that we could not do without the assistance of the intellectual discipline and computational techniques it enables. That is to say: it is all well and good to record one’s own impressionistic observations, but it is also very good practice to work to sum across many observations and to seek patterns beyond those which can be identified by a Mark I human brain.
The second is that, as a practical matter, statistical reasoning is increasingly central to the work of both academic social scientists and the much wider array of social engineers (who do not go by that name, but who should) in the real world. If you’re asked to do data-analytic work as a professional, then you will very quickly move beyond simple tools like bar charts and line graphs and start harnessing the power of tools that draw on statistical reasoning. That applies as much to the Facebook social graph analyst as it does to the graduate student seeking to understand the results of her experiment.
There’s something funny in the way that we present political science and other social sciences to undergraduate students. A lot of material in a typical undergraduate social science major, especially in the early years, looks like it could be assigned in a history, philosophy, or even literature course. There’s an endless amount of words and verbal theorizing, and much of the evidence that is presented is done via cases and, sometimes, wordy summaries of quantitative findings.
This approach really does a disservice to students. Most working social scientists (especially if we recall that economists constitute the bulk of that category!) spend most of their time wrestling with statistical data—measuring, analyzing, and disputing it. Not all—there are many valuable advances in the realm of pure theory and purely qualitative (that is to say, non-quantified) evidence. But look at the publications in journals like the American Journal of Political Science or International Organization, and you will quickly see that being conversant with many, if not most or even “nearly all”, leading empirical debates requires a firm grounding in statistics. (More recently, the most salacious academic scandals have involved careful investigations into what seem to be cases of fraud involving the manipulation of data—you might not think that knowing the details of how Excel records data is exciting stuff, but it can be as telling as a smoking gun in a murder investigation.)
As a consequence, if you want to actually understand what social scientists know and think they know about the world, you genuinely must be able to at least interpret a regression table—without that knowledge, you will be utterly reliant on what is claimed by authors with no way to verify their assertions or you will be thrust into dependence on the tribes of social scientists who do not use the tools that have become mainstream. (And even many of them are deeply enmeshed in debates with their quantitative rivals and allies.)
That data analysis has particular relevance for employment is also important.
The third major point is that even if you are a skeptic of quantification, there is no way you can make a credible case without understanding quantification. If your response to the millions of person-hours that have gone into the development of stats and empirical applications of stats is simply to dismiss it with a wave of a hand—a gesture that might combine sincere disapprobation with some anxieties about the hand-waver’s own competence—then only those predisposed to sympathy with you will view your dismissal as credible. To dispute something requires understanding it; the Jesuits have long believed that, and I take that view.
So even if you do not intend to become a “quant”, if you want to engage with them you need to be able to actually understand what they are talking about. For my own part, I view this kind of informed skepticism as incredibly important precisely because skeptics inside and outside the tent are the ones who drive the responsible use of statistics—just because someone can perform an analysis does not mean it tells us anything, and often those who know and understand a case well from other perspectives can check if the emperor has put on clothes or just mimicked the actions.
So what is quantification?
Quantification is the art of turning observations into structured, numerical data. The most basic form of this is counting—how often does something appear? The more dimensions we track, the more questions we can ask:
How often does something appear over time?
How often does something appear relative to some other observation?
How often does something appear before or after some other observation?
Where does something appear (that is, how often does it appear over some measure of space)?
How often does something appear next to or far away from some other observation?
Do we observe some pattern in the data with the same distributions as would be determined by chance alone, or does it appear to have a more deterministic relationship but with some error function?
There is nothing magical about this. There is nothing threatening about this. And there is nothing “non-human” or “anti-humanistic” about this. It is all, ultimately, about counting.
And it’s not just about counting things. Some of the most exciting developments that have achieved mainstream status in the past ten or fifteen years (in my scholarly neck of the woods) have involved counting connections—who talks to whom? who went to school with whom? Count those up, throw ‘em in a matrix, and you’ve got the data for a social network analysis. And you can even count words: how often is this word used compared to those words? What happens when we code words as being happy or sad, angry or confused? Well, we can start to assess the emotional state of a text—and from there we can begin to understand what people are feeling, or at least what feelings they’re expressing, when they talk about particular issues. We can even teach computers to see and to recognize what they’re seeing—to do the counting for us at a scale we couldn’t do even with megagirls, or gigaboys, of effort.
From the counting we move on to questions like:
What is the best way to theorize what the relationships would be in the data if one explanation or another is true?
What is the best probability distribution we should use to understand the baseline, “random” distribution—to understand what chance would be?
What is the best way to present data?
Hey: what should we count?
These are core questions—about theory development, hypothesis testing, and communication—that blend technique (the maths and coding) with judgment (reasoning and aesthetics). The data do not speak for themselves, but they do whisper—and sometimes, after much effort, we can hear them. And if we’re smart and informed and careful, we can
None of this, by the way, requires computers. But we use computers because computers make all of this much, much easier.
(There’s a fun fact—okay, well, it’s horrifying. The term “computer” used to refer to the people who performed “computations”, not the machines we use today. Because this was relatively low-status and rote work, it eventually became the province of women working under the direction of men who monopolized the high-status thought work for themselves. By the end of the 19th century, “computers” worked in organizations like Harvard and NASA, the U.S. federal government, and so on, working on difficult, distributed math problems. By the Second World War, there was even the unit of “kilogirl”—a thousand hours of computing labor done by women.)
Computers now of course require specialized training. They allow undergraduates to undertake what use, not too long ago, were Nobel Prize-winning techniques – quite literally. These are in many cases the same techniques that underlie just-behind-cutting-edge mechanisms in machine learning and social network analysis (since the cutting edge now belongs to transformers and LLMs). The differences in many cases has more to do with the scale and the quality of data available rather than the principle of analysis.
Think about it. The great minds of social science—Marx, Du Bois, Durkheim, Weber—all of them were limited by calculations and data gathering available to them. None of them could do what you can do with training, discipline, and a laptop. No pressure!
I’ll be very bold. There’s a lot of folks who think that there’s a hard line between “qualitative” and “quantitative” work. Often that just means that the correct techniques in quant work or conceptualization haven’t yet been designed or haven’t yet been employed. You can use computers to count, to see, to track, to measure, to find nodes, to read – you can do a lot with quant work if you put your mind to it. For that matter, your ability to theorize will often be enhanced if you learn how to observe and draw on observations that aren’t statistical—you will probably know a lot about how to theorize viral spread, for instance, if you look hard at how people interact to learn what to count.
And this is largely where contemporary social science work unfolds. One way or another, the ways in which quantification is changing what social science does is much the same as anywhere else in human endeavors…it’s everywhere, it’s growing, and it requires training.
All of this, however, requires some hard work – although a lot less hard than 20 years ago – to learn. I think the payoff is worth it—even if we shouldn’t be too naive about how being data-driven will save us. I hope you will come to believe this, too, but even if you don’t, I hope you will at least learn how to disagree with me at a much higher level than before you took this course.
A high school teacher friend in a rural district has worked stats into their curriculum for the very reasons you suggest, and the students are taking to it like fish to water. So important to teach these skills to everyone!
And this is going directly on to the syllabus