【计算机外文翻译】The Future of Coding

The Future of Coding: A Comparison of Hand-Coding and Three Types of Computer-Assisted Text Analysis Methods

Sociological Methods & Research
2021, Vol. 50(1) 202-237
©The Author(s) 2018
Article reuse guidelines;
sagepub.com/journals-permissions
DOI: 10.1177/0049124118769114
journals.sagepub.com/home/smr

Laura K. Nelson1 , Derek Burk2 , Marcel Knudsen2 and Leslie McCall 3

Abstract
Advances in computer science and computational linguistics have yielded new, and faster, computational approaches to structuring and analyzing textual data. These approaches perform well on tasks like information extraction, but their ability to identify complex, socially constructed, and unsettled theoretical concepts—a central goal of sociological content analysis—has not been tested. To fill this gap, we compare the results produced by three common computer-assisted approaches—dictionary, supervised machine learning (SML), and unsupervised machine learning—to those produced through a rigorous hand-coding analysis of inequality in the news (N =1,253 articles). Although we find that SML methods perform best in replicating hand-coded results, we document and clarify the strengths and weaknesses of each approach, including how they can complement one
1.Northeastern University, Boston, MA, USA
2.Northwestern University, Evanston, IL, USA
3.The Graduate Center, City University of New York, New York, NY, USA

Corresponding Author:
Laura K. Nelson, Department of Sociology and Anthropology, Northeastern University, Boston, MA 02115 USA.
Email: l.nelson@northeastern.edu
another. We argue that content analysts in the social sciences would do well to keep all these approaches in their toolkit, deploying them purposefully according to the task at hand.

Keywords
supervised machine learning, hand-coding methods, unsupervised machine learning, dictionary methods, content/text analysis, inequality

Content analysis of text-based data is a well-established method in the social sciences, and advances in techniques for collecting and storing data, and in computational power and methods, are continually pushing it in new directions. These advances are typically aimed at making the process more scientific— more reliable, valid, and reproducible. 1 Previous advancesinclude,forinstance, intercoder reliability scores (e.g., Krippendorff 1970), designed to validate the coding of text across multiple people; qualitative data analysis software such as Atlas.ti and NVivo, designed to enable both qualitative analysis and quantitative identification of patternsto support qualitative conclusions; and the application of algorithms and mathematical models to extract objective patterns in text (Bearman and Stovel 2000; Carley 1994; Franzosi, Fazio, and Vicari 2012; Martin 2000; Mische and Pattison 2000; Mohr and Duquenne 1997). 2
This latter development, the application of algorithms and mathematical models to text-based data, is seeing renewed vigor from content analysts, as emerging methods in natural language processing (NLP) and machine learning are enabling new, and faster, computational approaches to structuring and analyzing textual data, including “big” data (DiMaggio, Nag, and Blei 2013; Grimmer and Stewart 2011; Mohr et al. 2013). Indeed, one of the promises of these techniques is that they will allow researchers to do more with fewer resources, permitting the analysis of more data or data from more diverse sources (e.g., newspapers and television), as well as the extraction of more fine-grained patterns from a data set of any size, including within a sample of previously hand-coded text. Given the resource-intensive nature of hand-coding techniques, achieving breadth and depth in the analysis of textbased data has been virtually impossible.
The specific advances in using computers to identify categories in text that we examine in this article were initiated by computer scientists and computational linguists with the aim of classifying text into prespecified or unknown categories (Andersen et al. 1992; Cowie and Lehnert 1996).
To test the performance of these algorithms, computer scientists and computational linguists rely on a number of standard, labeled collections of text, such as the Reuters-21578 data set (“Reuters-21578 Test Collection” n.d.) and the 20 Newsgroup data set (Lang 1995). Categories in these benchmark data sets are determined by the collection curators and include topics such as “computers,” “recreation,” “science,” and “economics” among others.
The general conclusion from this research is that, given an adequate supply of previously labeled data, researchers can find an algorithm, or an ensemble of algorithms, that will accurately classify unlabeled data into the chosen classification scheme. That is, supervised machine learning (SML) algorithms of this kind can promise greater efficiency, transparency, and replicability, once a relatively small set of hand-coded documents has proven successful in “supervising” the computer to identify the desired content (Hanna 2013; King, Pan, and Roberts 2013). A number of software packages have therefore been developed to bundle algorithms and simplify their application in routine text analysis projects (e.g., RTextTools, scikit-learn, and Stanford NLP, which we discuss below).
However, as accessibility expands, scholars outside of computer science are moving beyond the benchmark collections and applying them to their own, discipline- or domain-specific tasks. This raises three methodological questions: (1) Can algorithms benchmarked on the standard collections perform as well in other domains? (2) If so, can these algorithms, and other computational tools, be successfully incorporated into the workflow of domain-specific questions and analyses? (3) More ambitiously, can they replace hand-coded work altogether?
We address these questions from the perspective of the domain of sociology (and allied disciplines). Scholars are turning to machine learning and other computational methods to augment or replace one of the most common tasks in sociological content analysis: identifying and coding themes, frames, concepts, and/or categories within text. But, in contrast to computer scientists and computational linguists, social scientists are typically not as interested in classifying a massive amount of text into their dominant categories, as they are in identifying complex, socially constructed, and unsettled theoretical concepts, often with ill-defined boundaries, such as populism, rationality, ambiguity, and inequality (Bonikowski and Gidron 2016; Evans 2002; Griswold 1987a). Most social scientists continue to rely on traditional human coding methods as the gold standard for the analysis of such phenomena (Benoit, Laver, and Mikhaylov 2009; Grimmer and Stewart 2011).
Our main objective in this article is to empirically test the three most prominent computer-assisted content coding methods—the dictionary method, SML methods, and unsupervised machine learning (UML) methods—against the gold standard of rigorous hand-coding for a complex topic of sociological interest. While there is considerable effort devoted to developing new algorithms for specific domains and problems (see, e.g., Bamman and Smith 2015; Nardulli, Althaus, and Hayes 2015), there is a dearth of empirical research to guide scholars in the selection and application of already established and packaged automated methods, especially with respect to the analysis of complex conceptual content. Can the leading fully-automated approaches to content analysis—dictionaries and UML— circumvent the need for hand-coding altogether? Indeed, are semiautomated methods like SML even up to the task of coding complex content?
Surprisingly, there has been no comprehensive comparison of how the various techniques perform relative to well established hand-coding methods when performing the kind of content coding of complex material that is of greatest interest to social scientists (including qualitative and quantitative researchers alike). Yet most social scientists do not have the resources to fully test these various approaches when embarking on their own content analysis project. We describe what, exactly, different automated techniques can and cannot do (in answer to the first question above) and show in the process that there can be significant complementarity among the various coding approaches (in answer to the second question above). In doing so, we provide a guide to the implementation of these methods in the domain of the social sciences more generally.
Because our aim is not only to inform debates among specialists but also to reach a more generalsocialscience audience, we take a different benchmarking tack than is common in the technical literature. Rather than benchmarking specific algorithms using data sets coded to test information retrieval (as computer scientists and computational linguists have done extensively), we benchmark the three computer-assisted approaches(dictionary, SML, and UML) on a data set hand coded to identify a complex and multifaceted theoretical concept Appendix. (inequality).We compare substantive findings acrossthemethods by provisionally treating the hand-coded results as the yardstick of measurement. The handcoding method’s wider familiarity and acceptance among social scientists, along with its known strengths and weaknesses, enables usto root debates about content analysis methods firmly in realistic, social science data.
Although our focus is on substantive outcomes across the methods, we also offer practical guidance in the use of available software for computer-assisted text analysis. Supervised and unsupervised machine learning programs are at the leading edge of the field, yet even packaged programs require at least some knowledge of programming languages such as Python, Java, and R. Weexamined the three most widely-used “off-the-shelf” packages for applying SML methods: RTextTools (Jurka et al. 2014; R Core Team 2014), Python’s scikit-learn (Pedregosa et al. 2011), and Stanford’s NLP Classifier (Manning et al. 2014). Given that these three packages vary in the exact machine-learning algorithms included, the implementation of these algorithms, and the default text-processing settings, we wanted to test whether they produced similar results or whether they varied in their ability to replicate hand-coding. We also sought to evaluate their ease of use, and we provide some practical advice and links to learning resources in an Online Supplemental Appendix.
The data set, hand-coding methods, and general analytical strategy for testing the automated programs, given the features of our hand-coding project, are described in the Data and Analytical Strategy section. We then describe the metrics used to evaluate the accuracy of the automated methods in reproducing the hand-coded results in the Measures of Fit section. In the Results section, we describe in greater detail the three automated approaches to textual analysis, and perform our empirical tests of these approaches, in three subsections on SML methods, the dictionary method, and UML methods. Finally, in the Discussion and Conclusion section, we compare and contrast our results across the methods in order to highlight their strengths and weaknesses from a substantive perspective and to summarize the ways in which research questions of a substantive and conceptual nature can be appropriately matched to the various content analysis strategies.

Data and Analytical Strategy
Data and Hand-Coding Methods
In the hand-coding project, our substantive objective was to determine whether and when the new issue of rising economic inequality was covered by the media (McCall 2013). Following leading studies in political science on related topics such as welfare and race (Gilens 1999; Kellstedt 2000), we used the Readers’ Guide to Periodical Abstracts to search for articles on economic inequality from 1980 to 2012 in the three major American newsweeklies of Newsweek, Time, and US News & World Report. The Readers’ Guide provides a predefined list of subject terms for each article, and we selected a set of terms that most closely described our subject matter (“income inequality,” “wage differentials,” “equality,” and “meritocracy”). A surprisingly small number of articles had been assigned these inequality subject terms, however, so we expanded the search to include all articles that were assigned any of the 63 subject terms contained in this smaller set ofarticles. Because this population of articles numbered in the many thousands (approximately 8,500), we were forced to take a random sample stratified by year (10–15 percent of the population in each year). This sample (N ¼ 1,253) is the data set of articles that we use in all subsequent analyses.
Crucial to the rationale for this article is the fact that we encountered such a variety of subject terms and complexity of subject matter that we felt no choice but to code each article by hand. Unlike comparable studies of media coverage of welfare and race, we assumed neither that all articles (selected using the method described above) were relevant, nor that a preset list of phrases was exhaustive or definitive enough for use in a computer-coding program. 3 Rather, coding by hand enabled a more flexible approach to identifying and classifying subject matter that varies in form (i.e., the particular words or phrases used) but not necessarily in content (i.e., the concept of interest). This flexibility is perhaps especially necessary when the subject of analysis is a new multifaceted social issue unfolding in real time, for which settled and durable cultural frames are unavailable. For instance, it was not feasible to deductively catalogue the complete set of metaphors for economic inequality that could be invoked over a three-decade span of news coverage (e.g., the metaphor of “Wall Street versus Main Street” spread wildly during the financial crisis in the late 2000s, whereas stories about “union busting” were more germane in the early 1980s). Nor could we generate an exhaustive list of terms that are used to describe every potentially relevant social class group (i.e., the wealthy, the rich, executives, managers, professionals, the middle class, the unemployed, the poor, minimum wage workers, etc.).
Our coding scheme—iteratively developed in severalstages using deductive and inductive reasoning (Chong and Druckman 2009; Ferree et al. 2002; Griswold 1987b)—attempted to encompass this wide range of coverage and, in addition, come to a better understanding of several gray areas of coverage (see Online Supplemental Appendix A for our definition of inequality). In fact, the challenges we faced in reliably coding the concept of inequality—material that conveyed the reality of inequality without necessarily relying on stock phrases
of inequality—meant that we had to abandon earlier efforts to also code the ways in which the issue was framed, particularly in terms of its causes and solutions. 4 As we discuss in subsequent sections, we anticipate using fully automated tools to perform these further analyses on the subset of articles identified by othermethods(i.e., hand-coding and SML) asmentioning inequality. (Thus, automated methods may be of use in conducting more detailed analyses of sampled data; that is, they are not applicable only to “big data.”)
Our hand-coded results are presented in Figures 1 and 2. Over a third of articles were ultimately deemed irrelevant 5 in the process of hand-coding and
Figure 1. Categorization of hand-coded articles.
Figure 2. Trends in preferred three-code scheme of hand-coded articles (explicit/implicit inequality versus general economic versus irrelevant categories).

the rest of the relevant articles were divided into two groups: (1) those that reference the topic of inequality, further broken down into articles with explicit references to inequality (e.g., using the term “inequality”) or implicit references (e.g., describing the diverging fortunes of executives and low-wage workers), respectively labeled explicit and implicit inequality, 6 or (2) those that fell into a residual category focusing on related but more general trends in income, employment, and the economy. This group, which we term general economic (or economic for short) is also broken down into two categories. Figure 1 provides a visual representation of the five underlying categories along a continuum from irrelevant to explicit inequality. The two aggregated relevant categories (explicit/implicit inequality and general economic) are also represented in Figure 1. Two-coder reliability tests were high for the irrelevant category (.78) and the combined explicit and implicit inequality category (.92 in the first round of coding and .85 in a second round), and thus we focus on replicating them, and especially the central category of interest, explicit/implicit inequality. The time trends for the two aggregated relevant categories plus the irrelevant category are charted in Figure 2.

General Analytical Strategy
In addition to the complexity of the coding scheme noted above, we highlight several other aspects of our data and coding process that have implications for how we perform our tests of the automated methods and for our expectations of the results. First, the coding and development of the coding instructions took place prior to the spread of the new automated approaches to textual analysis; thus, the coding was not performed in order to test the automated programs. Second, and relatedly, we sought to determine only whether the fact of economic inequality, as we defined it, was ever mentioned in an article. Notably, this means that many articles were coded as inequality even if the primary topic was another issue. As a consequence of these two aspects of the hand-coding process, the automated programs will have to tune out a considerable amount of noise in order to correctly classify the articles (i.e., to agree with the classification of the hand coders). At the same time, the distinctions among the categories could be challenging to detect because most of the articles contain economic material to some degree. As we discuss below, this may especially be the case for categories of articles that are by definition subtle, such as implicit inequality.
Thus, we have set a high bar for the computer-assisted methods to meet, even those that are trained by previously hand-coded data. With respect to the automated methods that do not have this built-in advantage, the bar may be unreachably high. Our tests are nevertheless instructive, as they clarify exactly what will result, substantively, from the application of each method
alone to the data, something we believe is fairly common practice. Specifically, we examine whether, starting from scratch, fully automated methods isolate the topic of theoretical interest (i.e., inequality) from the potentially numerous other ways in which our data can be categorized. Analogously, we examine whether sophisticated dictionary lists are exhaustive enough to detect the scope and variation of coverage of inequality over time. In short, we use the hand-coding results as a yardstick against which to empirically identify the relative strengths and weaknesses of each of the three broad approaches to computer-assisted textual analysis.

Measures of Fit
In performing our tests, we utilize three widely used measures of fit: precision, recall, and F1 scores (Van Rijsbergen 1979).
Precision refers to the proportion of positive results that are “true” positives according to the hand-coding. For instance, if half of the articles that anautomated program classified as mentioning inequality were similarly classified by the hand coders, then the precision score would be .50. Recall refers to the proportion of true positives that are also coded by the automated methods as positives. Thus, an analysis with high precision and low recall will be correct in most of its positive classifications but will miss a large proportion of articles that should have been classified as positive. F1 scores are the harmonic mean of precision and recall and provide a measure of overall accuracy for each category. While in most situations F1 scores are taken as the best indicator of fit, we found that precision and recall offered a better sense of where a model is succeeding and where it is erring. Accordingly, we pay as much, if not more, attention to these indicators as to the F1 score. Because these scores are calculated for each category (i.e., inequality, irrelevant, etc.), we also use a weighted average of precision, recall, and F1 scores across coding categories as an overall measure of method accuracy. 7
We add to these standard measures a comparison of the time trends estimated by each of the computer-assisted methods. Not only is the identification of time trends one of the most common objectives of a textual analysis project, but one concern about automated approaches is their potential insensitivity to changes in language over time (Hopkins and King 2010:242). We therefore test for the ability of computer-assisted approaches to reproduce the time trend from the original data, which is based on the proportion of articles in each year coded as falling into our predetermined categories, such as articles that contain explicit and/or implicit mentions of inequality. After using two-year moving averages to smooth the data, we use the correlation between these proportions for the automated programs and for the hand-coded method as a measure of accuracy. These analyses provide an answer to the question of whether computer-assisted coding will yield substantive conclusions similar to those derived from traditional methods.

Results
We begin with the method that is most similar to hand-coding in that it requires hand-coded input (SML). We then evaluate the more fully automated methods in the following sections. Sections on each method are in turn broken down into three subsections: (1) a brief overview of the method, including references to the technical literature in both the text and corresponding Online Supplemental Appendix for readers interested in greater detail, (2) a description of the analytical strategy, which differs slightly for each method as we calibrate our data and analysis to the particularities of the methods, and (3) the results.

SML Methods
Brief description. SML methods leverage both computers’ ability to detect patterns across large numbers of documents and human coders’ ability to interpret textual meaning. Based on a “training set” of documents hand coded into categories of interest, an SML analysis consists of three steps. First, documents are converted into “vector[s] of quantifiable textual elements,” which are called “features” (e.g., counts). Second, a machine learning algorithm is applied to find a relationship between these numeric feature-vectors and the hand-coded categories assigned to them, producing a model called a “classifier.” Finally, the analyst uses the classifier to code documents not in the training set (Burscher, Vliegenthart, and De Vreese 2015:124).
In SML methods, then, a document is represented as a vector of word counts, or “bag of words.” On its face, treating documents as bags of words seems wrongheaded, given how context can drastically change a word’s meaning. Because of the complexity of our hand-coding scheme, changes over time, and the concept of inequality itself, our analysis poses a difficult test for the bag of words approach. However, in practice, this strategy has been shown to perform well for many classification schemes of interest to researchers (Hopkins and King 2010). Our data allow us to experiment with different combinations of our five underlying content codes (see Figure 1) and thus to test classification schemes of varying types.
Analytical strategy. If we were performing an SML analysis from scratch, we would first hand code a subset of documents from our population of interest. This subset of hand-coded documents is the training set. Next, we would test our SML setup by selecting random subsets of the hand-coded documents to train SML classifiers and try to replicate the classification of the remaining hand-coded documents (called the “test set”). Low levels of agreement would suggest the need to refine the hand-coding scheme or change the specifications for training the SML classifier. Finally, once an acceptable level of agreement was reached (based on precision, recall, and F1 scores), we would train a classifier using all the hand-coded documents as the training set and then use it to classify the larger population of uncoded documents (called the “unseen set”).
Because our focus was on testing the ability of SML to replicate handcoding, we only applied our classifiers to already hand-coded documents. We constructed 25 artificial training and test sets by randomly selecting roughly 500 articles to be the training set and using the rest (roughly 750) as the test set. We present the range of metrics across the 25 sets for the weighted average precision, recall, and F1 scores across all categories (see columns 7– 9 in the first panel of Table 1) but focus our presentation and discussion on the metrics for the individual categories of the median performing set. The metrics for this set are the main entries in all columns of the first panel of Table 1, and the accompanying figures chart the proportion of articles classified by SML into the specified categories over time, again for this median performing set. 8
In addition to varying the training and test sets, we also tested three combinations of our five underlying categories. In the first coding scheme, relevant versus irrelevant, we distinguish between all substantively relevant articles and irrelevant articles. In the second coding scheme, inequality versus not inequality, we distinguish between articles mentioning inequality— whether explicitly or implicitly—and all other articles. In the third coding scheme, inequality versus economic versus irrelevant, we distinguish between articles mentioning inequality, those discussing general economic issues but not mentioning inequality, and irrelevant articles. (We also discuss results from an alternative three-code scheme that we tested.) By comparing SML’s performance among these various coding schemes, we evaluate the method’s ability to replicate distinctions of different types, greater or lesser complexity, and different levels of aggregation.
We performed our SML analysis using the three most widely adopted SML software packages: RTextTools, Stanford’s NLP routines, and Python’s scikit-learn. 9 We ran each program with comparable settings, within the limits of the options provided by each package, because we wanted to compare the “off-the-shelf” products and minimize the need for users to employ additional scripting (see Online Supplemental Appendix Table D1 for the settings for each program). Figures 3–5 show the time trends for all three programs to demonstrate their commensurability. Because our results were similar across software packages, and because Python’s scikit-learn is the most actively developed program of the three, we present results only from that package in Table 1 but include results from the other programs in Online Supplemental Appendix Table D2. We also include in Online Supplemental Appendix D brief descriptions of each program, along with links to helpful tutorials and learning resources.
Results. Our analyses reveal that SML methods perform well in terms of both precision and recall. Looking first at the metrics averaged across the categories for each of the three classification schemes (column 9 in Table 1), we find average F1 scores for the median test set close to or well above the .70 rule of thumb for good fit often followed in the literature (Caruana and
Note: Coding scheme A: relevant (explicit, implicit, general economic), irrelevant (irrelevant). Coding scheme B: inequality (explicit, implicit), not inequality (general economic, irrelevant). Coding scheme C: inequality (explicit, implicit), economic (general economic), irrelevant (irrelevant). Coding scheme D: inequality (explicit), irrelevant (implicit, general economic, irrelevant).
a)Weighted by the proportion of true positives in each category. See footnote 7.
b)Parentheses contain range across the 25 test/training set pairs.
c)Supervised ML values are for test set only.

Figure 3. Trends in supervised machine learning analysis of hand-coded articles for relevant versus irrelevant binary scheme (combined relevant substantive categories versus irrelevant category; combined relevant substantive categories shown).

Figure 4. Trends in supervised machine learning analysis of hand-coded articles for inequality versus not inequality binary scheme (explicit/implicit inequality versus all other categories; explicit/implicit inequality category shown).

Figure 5. Trends in supervised machine learning analysis of hand-coded articles for preferred three-code scheme (explicit/implicit inequality versus general economic versus irrelevant categories; explicit/implicit inequality category shown).

Niculescu-Mizil 2006). F1 scores are generally quite high for both the relevant versus irrelevant (.83) and inequality versus not inequality (.78) schemes, indicating that the inequality articles (combining explicit and implicit articles) and the irrelevant articles represent well-defined groupings. F1 scores are lower for the inequality versus economic versus irrelevant scheme (.69), suggesting that lower levels of aggregation lead to fuzzier distinctions among categories that are more challenging for the algorithms to recognize, at least in our data.
Taking a closer look at these results for the three-code scheme, the low F1 score stems from lower metrics for the middle economic category, which are not shown in Table 1. The F1 score forthis category was.52, compared to .69 and .80 for the inequality and irrelevant categories, respectively. The recall for this economic category was especially poor. Of the 200 test-set articles hand coded into this category, only 93 (47%) were correctly classified by the SML algorithm. The algorithm struggled most in distinguishing between the economic and inequality categories, classifying 67 (34%) of these 200 economic articles as inequality articles, and, as might be expected, most of these fall into the implicit category.
For example, an article from 1983 titled The Growing Gap in Retraining was hand coded into the economic category and misclassified by the SML algorithm into the inequality category. This article called for the Reagan administration to invest more in worker retraining programs. It has all the buzzwords and phrases associated with inequality: ever widening gap, pressing problem, displaced workers, lost value, and so on. But the article never actually mentions earnings or income inequality; instead, it discusses the
employment skills gap: “As the U.S. economy sloughs off its declining manufacturing industries and increases its dependence on faster-growing service and technology sectors, an ever widening gap has opened between the new jobs that are being created and the skills of available workers.” These types of articles, containing key words associated with income inequality but used in the context of educational or employment inequality, were consistently misclassified by our algorithms. We return to the methodological and substantive significance of this point in a moment after we finish reporting the main results.
As shown in Figures 3–5, the SML methods are also capable of reproducing the trend in media coverage of inequality found in the hand-coded data. Here, we measure coverage of inequality as the proportion (as opposed to number) of articles coded into the inequality or relevant category per year, and we include the whole sample (training and test sets) of articles to get the best estimate of actual coverage of inequality (with twoyear moving averages depicted along with the correlation between the hand-coded and SML trends). Just as in the hand-coded analysis, the SML results show peaks in inequality coverage in the early 1990s and around the period of the Great Recession. The temporal correlation between the handcoded and SML trends ranges from .69 to .75 (shown in column 10 of Table 1). Given the small variation in our metrics across our 25 data sets and our careful sampling procedures, we are confident that the patterns found in our 10–15 percent sample are representative of our larger population of articles.
While these results are certainly encouraging, an important takeaway from these and other analyses that we conducted is that the selection of classification schemes may depend more on the precision and recall metrics for individual categories of theoretical interest than on the average total F1 score across categories, which is a more common practice in the literature. For example, in testing different three-category coding schemes, we obtained a slightly higher overall F1 score with an explicit inequality versus implicit inequality/economic versus irrelevant scheme than with our theoretically preferred inequality versus economic versus irrelevant scheme presented in Table 1. This higher F1 value was due to much better precision and recall for the combined implicit inequality/economic category as compared to the economic category alone. Yet, the trade-off was a markedly worse
performance in identifying explicit inequality articles as compared to identifying a combination of explicit and implicit inequality articles (in our preferred three-category scheme). Given our substantive interest in inequality, then, we opted for a coding scheme that better identified articles mentioning inequality over one with slightly better performance overall.
In sum, SML models were not only successful at replicating the hand-coded results overall and over time, thus, importantly, boosting confidence in the reliability of those results, but they also prompted a deeper analysis and understanding of the subject matter. This pertains especially to the subtle distinctions between articles in the explicit and implicit inequality categories and between articles in the middle general economic category and the categories that bookend it. Keeping these productive tensions in mind, a researcher could proceed to gathering another sample or population of articles from sources aimed at different audiences (e.g., from the New York Times) and code them using these semiautomated methods, assuming coverage features are roughly equivalent across the different kinds of publications. Indeed, an object file containing the relevant information for classifying articles into categories based on our full set of hand-coded articles (as the training set) can be made available to other researchers. This not only eliminates the need for hand-coding within specific content domains (e.g., inequality) but facilitates the comparative analysis of diverse corpora of text.

Dictionary and Unsupervised Learning Methods
Because SML algorithms require a nontrivial number of hand-coded texts, social scientists are exploring more fully automated text analysis methods to circumvent the need for hand-coding text altogether. Yet, it is important to recognize that fully automated methods (Grimmer 2010) and dictionary methods (Loughran and McDonald 2011) cannot be mechanistically applied; their output is typically tested by hand post facto. That is, the methods are implemented on a corpus and then hand coders go back through a sample of the corpus to test the validity of the computerassisted codes. In this respect, dictionary and fully automated methods rely to a nontrivial degree on the judgment of the analyst to interpret and verify the results, at best using the most rigorous tests of reliability adopted by hand coders. Given that we have a large set of hand-coded results already at our disposal, our analysis is intended to make these judgment points explicit, along with the consequences for drawing substantive conclusions from the application of each method, had it been chosen originally as the only method of analysis of our data.
Dictionary method
Brief description. The dictionary method is the most straightforward and widely used of the automated textual analysis tools available. This is particularly the case when a media content analysis is not the central objective of a scholarly piece of research but instead is employed to quickly chart issue prevalence or salience in the media or to supplement findings from a surveybased analysis with more contextual data. On the subject of inequality, for instance, the Occupy Wall Street movement prompted what appeared to be an increase in media coverage of inequality. The impulse to quantify this shift led researchers to use key word searches of “inequality” in the news to draw conclusions about the extent to which the public was being exposed to new information, as this is considered a key determinant not only of issue salience but of issue-specific political and policy preferences (McCall 2013; Milkman, Luce, and Lewis 2013).
Dictionary methods consist, then, of a search through a corpus of documents for a list of words or phrases predetermined by the researcher, offering a quick and relatively easy way to code large volumes of data. Dictionary methods can be considerably more sophisticated, however, requiring a carefully curated list that describes the category of interest. Standard dictionaries such as the Linguistic Inquiry and Word Count (Tausczik and Pennebaker 2010) have been shown to be reliable, but only in limited domains. Creating specialized dictionaries has the benefit of being domain-specific, but it is still unclear whether dictionaries can be reliably used to code complex text and unsettled concepts, the focus of our analysis.

Analytical strategy. We use the text mining tools available in the statistical package R to search for articles with key words from a combined list of two comprehensive dictionaries on inequality (Enns et al. 2015; Levay 2013). Because these lists are composed of variations on the term inequality and its synonyms(i.e., divide, gap), we compare the results ofthis method to the results fromthe hand-coded explicit inequality category only.In a subsequent analysis, we also attempt to translate our own hand-coding instructionsin Online Supplemental Appendix A as closely as possible into a list of terms and search instructions to identify explicit and implicit mentions of inequality. Online Supplemental Appendices B and C provide these lists and instructions, respectively. If any term or phrase in the dictionary instructionsis present in an article, the article is placed in the inequality category; otherwise, the article is placed in the irrelevant category. 10 This is consistent with our hand-coding procedure, in which a single mention of relevant content is sufficient to place an article in the inequality category, and it is a lenient test of the dictionary method.

Figure 6. Trends in dictionary analysis of hand-coded articles (compare Levay-Enns to hand-coded explicit inequality trend; compare McCall to hand-coded explicit/ implicit inequality trend).

Results. The results of our analysis of the hand-coded articles using these two dictionaries are presented in the second panel of Table 1 and in Figure 6. We find that the carefully constructed lists of terms provided by Enns et al. (2015) and Levay (2013)—which are combined in our analysis—are remarkably successful at identifying articles hand coded as containing explicit coverage of inequality. With a precision of .91 (see column 1 of the second panel of Table 1), this method was highly unlikely to misidentify noninequality articles as inequality articles; that is, it resulted in few false positives. Yet, as is often the case, precision came at a cost: With a recall score of just .25 (see column 2 of Table 1), many of the articles hand coded as explicit were overlooked, not to mention articles that were coded as implicitly covering inequality (which we excluded from the inequality category for these tests). This substantial degree of underestimation is visually apparent in Figure 6, which compares the time trends revealed by the hand-coding and dictionary methods. 11 By contrast, the instructions intended to mirror the complexity of our own hand-coding process, including both implicit and explicit mentions of inequality, erred in the opposite direction: With high recall (.84) and low precision (.48), coverage of inequality was overidentified, as also illustrated in Figure 6.
We draw two conclusions from this exercise. First, dictionary lists can accurately identify the most explicit instances of coverage, and, somewhat to our surprise, even approximate a time trend of coverage (the correlation with the trend of articles hand coded as explicit was .42 when we use a two-year moving average, as shown in column 10 of Table 1), but they are likely to miss more nuanced portrayals of a topic and thus significantly underestimate overall occurrence. If absolute frequency of occurrence matters, then this is a serious shortcoming. 12 Second, a more complex set of instructions can effectively net a larger share of relevant articles, and even better approximate the time trend (r ¼ .66), but they will in the process erroneously categorize a large share of irrelevant articles as relevant. Although it may be possible to fine-tune the dictionary instructions to arrive at a happy medium between the two extremes represented by our two versions of the dictionary method, 13 we underscore again that researchers beginning from scratch will not know, as we would, when they have arrived at this happy medium.

UML methods
Brief description. Finally, there is hope that fully automated methods— including UML tools—can inductively identify categories and topics in text, thus replacing human input altogether, at least on the front end (Bearman and Stovel 2000; Carley 1994; Franzosi 2004; Grimmer and Stewart 2011; Lee and Martin 2015; Mohr 1998). Rather than classifying text into predetermined categories, as is the case with the dictionary and SML methods, fully automated text analysis techniques simultaneously generate categories and classify text into those categories. In theory, these techniques will inductively categorize text into the objectively “best” categories. In practice, there are multiple ways to classify text, with no clear metrics to determine which classification is better than others (Blei 2012; Grimmer and Stewart 2011). When a fully automated method offers multiple ways to group texts, researchers may qualitatively consider the topics covered as well as statistical fit. The complexity of these algorithms, the “black box” nature of their implementation and interpretation, and the sometimes cryptic output they generate, has meant that social science researchers, in particular sociologists who are attuned to the complexity of language and concepts, are hesitant to fully embrace their use (e.g., Lee and Martin 2015).
Because the coding scheme in our hand-coded data was done in part inductively as well, as is common in qualitative analysis, and the categories are detailed enough to represent bounded, though complex, topics, we have the opportunity to compare computationally inductive techniques to the hand-coding technique. Our findings thus build on debates about the potential to substitute (allegedly) faster and more replicable UML methods for traditional content analysis (e.g., Bail 2014; DiMaggio et al. 2013; Lee and Martin 2015).

Analytical strategy. We used three fully automated methods in an attempt to identify inequality themes in these data. 14 The first two are from the probabilistic topic modeling family (Blei 2012). Using the co-occurrence of words within documents, probabilistic topic models use repeated sampling methods to simultaneously estimate topics and assign topic weights to each document. In other words, topic models assume that each document is made up of multiple topics with varying levels of prevalence, rather than assigning each document to one topic or category. We estimate two topic models using two different algorithms, latent Dirichlet allocation (LDA), the most basic topic model (Blei 2012), and structural topic models (STM; Roberts et al. 2013), a topic modeling algorithm that provides a way to incorporate document-level covariates into the model. Because the language used to discuss inequality changed over time, we include the document year as a covariate in our STM. As with many fully automated methods, the researcher must choose the number of topics to be estimated by the algorithm, and we ranged the number of topics from 5 to 100 at various intervals for both algorithms, looking at the highest weighted words per topic to determine the content of the topic.
As noted, topic models do not assign articles to topics as hand coders do; rather, each document is a weighted distribution over all topics. In order to compare these results to those obtained using hand-coding methods, we classified an article as being about inequality if the associated topic weight was in the 95th percentile or above of the topic score among articles hand coded as irrelevant. This is intended to avoid classifying articles as about inequality if they simply contained routine mentions of the words (common in everyday language) associated with the inequality topic. In addition, because of the infrequency of our topic using these methods, we measure their performance against the category of articles hand coded as explicit only (and not implicit), much like in the evaluation of the first dictionary method.
While increasingly popular in the social sciences, topic modeling has been criticized outside of the social sciences for its poor predictive performance and its lack of reproducibility (e.g., Lancichinetti et al. 2015). Simpler clustering techniques often perform just as well, and sometimes better, than more complicated hierarchical and topic modeling techniques (Schmidt 2012; Steinbach, Karypis, and Kumar 2000). Our third fully automated technique is thus the relatively simple k-means clustering algorithm, an established and

Figure 7. Trends in unsupervised machine learning analysis of hand-coded articles (explicit inequality category shown).

ubiquitous algorithm that uses Euclidean distance measures to cluster articles into mutually exclusive groups (Jain 2010; Lloyd 1982). Like topic modeling, the number of clusters is determined by the researcher, using visual methods, mathematical methods (e.g., Bayesian information criterion [BIC]), or qualitatively by examining the coherence of the clusters (Rousseeuw 1987). We ranged the number of clusters from 2 to 70, looking at the most frequent words per cluster to determine the content of the cluster.

Results. The metrics are provided in the third panel of Table 1, and the time trend for the STM results is shown in Figure 7 (there were too few relevant articles from the k-means analysis to construct a time trend, and the LDA results were similar to the STM results). We begin with the k-means analysis before examining the more complex methods. Using the silhouette method (Rousseeuw 1987) combined with the BIC (Pelleg and Moore 2000), the 18- cluster model produced the most distinctive clusters, but none of these clusters were clearly about inequality. Beginning with the 20-cluster model, there was one cluster that seemed to center on inequality, and there were two such clusters in the 60-cluster model. The word “inequality” never appeared as a frequent word, however, in any of these clustering solutions (despite the fact that the SML methods were capable of distinguishing inequality content

a)aWords were stemmed using the Porter stemming algorithm.
b)k-means model with 30 clusters; most frequent words.
c)Structural topic models with 60 topics; highest weighted words.

from other content in the corpus of articles). The silhouette method isolated the 30-cluster model as having the most distinctive clusters in the second set of models (20–70 clusters, in which the BIC steadily declines after 20 clusters), and it included a cluster that appeared consistent with our theme (see the first column of Table 2 for the most frequent words in this inequality cluster). Yet with only 42 articles in this cluster, these methods appeared to quite dramatically undercount the number of articles about inequality in our data.
The results from this k-means analysis suggest two important conclusions. First, there is no guarantee that the clusters produced by the k-means algorithm will line up with the topics of interest to the researcher. Furthermore, the mathematically “best” clustering solution may not necessarily be the best solution from a substantive perspective, as was the case with our data (i.e., the mathematical methods were no better than visual inspection at identifying models containing clusters with an inequality theme). Second, these results confirm the intuition that, in our data, discussions of inequality are woven throughout articles whose main focus is a separate topic; that is, inequality as a dominant topic is relatively infrequent. K-means is thus better suited to the analysis of thematically focused articles, such as tweets or press releases, and does not perform well in picking up themes that may be buried within discussions of different topics.
The other fully automated method we use, topic modeling, is designed to address this shortcoming by picking up more minor themes running across many articles. After failing to find a computer-generated topic on the subject of inequality when the number of topics for the STM was set to 5, 10, and 20, one did emerge in the output of a 30-topic model. The 20-topic and the 60- topic models, however, produced the most coherent topics as measured by the distribution of the top weighted topic over all documents, a mathematical solution that indicates distinctive topics. Because the 60-topic model also produced an inequality topic, we analyze the results from this model (see the second column of Table 2 for a list of the top weighted words associated with this topic from the 60-topic model). 15
Generally speaking, the results mirror those for the first dictionary method, in which precision is high but recall is low. While the recall is extremely low using the k-means method (.14), there is somewhat more balance in the results from the STM method, in which a larger share of explicit articles is identified as compared to the first dictionary method (compare the recall score of .45 for STM with the recall score of .25 for the first dictionary method, as shown in column 2 of Table 1); consequently, the F1 score is also higher (.53 versus .40, as shown in column 3). The correlation of the two-year moving averages also improves (compare .58 for STM versus .42 for the first dictionary method). Given that our approach to handcoding was not “topical,” in the sense that we were searching for any coverage of inequality in articles on any subject matter (broadly on economic matters), it is perhaps impressive how well the topic modeling algorithms actually correspond to the hand-coded articles. On the other hand, like the first dictionary method, the fully automated methods are undercounting the number of “true” inequality articles. If we had used only these methods for the original analysis, as we suspect many content analysts are now doing, we would have missed almost all of the implicit discussions of inequality and many of the explicit ones as well (as demonstrated by the low recall of .45).
Given that some of the clustering or topic modeling solutions did not pick up an inequality topic, and given the low recall for STM, we suggest that this method is best used as an inductive, exploratory method and should not be used to identify known categories in text. This could be done in two ways. It could be used as the first, exploratory step in an inductive research project, with the goal of uncovering themes or patterns in your data (e.g., Nelson 2017). Or, it may be best to deploy this method after categories have been defined and articles classified in order to explore emergent themes within the primary category of interest. For instance, once articles mentioning inequality have been selected with some degree of confidence (e.g., using either conventional hand-coding metrics of reliability, the dictionary method, supervised learning methods, or some combination of these), one could use UML methods to identify the range of frames and topics with which inequality often co-occurs—such as the discussion of taxes, immigration, education, and so on. As topic modeling assumes each document is structured from multiple topics, this could be an appropriate method for doing so.
In short, using UML methods as an exploratory first step, or, alternatively, after articles have been thematically classified, may suggest new patterns to the researcher that they had previously not considered and may take the research project in new and potentially fruitful directions. By contrast, utilizing these methods to classify material into predetermined categories may lead researchers astray, and “null” findings may be deceiving, depending on how prevalent the themes of interest are and how they are distributed.

Discussion and Conclusion
Our main conclusion is that these new computer-assisted methods can effectively complement traditional human approaches to coding complex and multifaceted concepts in the specialized domain of sociology (and related disciplines), but the evidence is mixed as to whether they can fully replace traditional approaches. SML methods successfully approximated and thus may partially substitute for hand-coding, whereas the other methods are best implemented in conjunction with hand-coding (or SML), or, in the case of topic modeling and clusters, as an initial exploratory step (Nelson 2017). That is, we find that none of the methods replace the human researcher in the content analysis workflow. Regardless of technique, the researcher is making decisions every step of the way based on their deep substantive knowledge of the domain.
In this section, then, we highlight the strengths and weaknesses of the various approaches in evaluating our hand-coded data, focusing on the substantive conclusions that would have been drawn from the results produced by each method. The larger objective of this discussion is to shed
light on the pros and cons of each method for a broader array of substantively-based, text analysis projects. Taken together, our results confirm the effectiveness of each of these methods for specific roles in the workflow of a content analysis project.
To begin with the most widely used of the automated methods, the dictionary method successfully identified a subset of the most explicit discussions of inequality in our data, as evidenced by the dictionary-identified articles The Inequality Dodge; Rich America, Poor America; To the Rich, From America; and The Quagmire of Inequality. However, this method missed more nuanced but nonetheless obvious (to a knowledgeable coder) discussions of inequality. Specifically, this method failed to detect a noteworthy share of articles in the early 1990s that were hand coded as about inequality (see Figure 6). Media coverage at this time dealt primarily with the problem of rising wage and earnings inequality in the labor market, as opposed to Wall Street or the top 1 percent, in articles such as Bridging the Costly Skills Gap and Bob Reich’s Job Market. These articles discussed the simultaneous rise in productivity and stagnation of male wages, the gap in wages between college and noncollege-educated workers, and excessive executive pay. Concerns of fairness in the labor market were paramount as transformations in the economy appeared to threaten the financial security of many workers.
The SML algorithms, alternatively, confirmed the rise in coverage of inequality in the early 1990s that was identified by the hand coders (see Figure 4). The features (words) that most distinguish the inequality from not inequality categories include “class,” “middle,” “pay,” and “wage”—words indicative of the inequality discussion in the 1990s. However, they also include words one would not immediately associate with inequality, such as “benefit” or “families,” suggesting that the SML approach represents more than just a glorified dictionary method. One article in particular highlights the difference between the dictionary and SML methods. An article published in 1994 titled Reining in the Rich was correctly identified by the SML programs as mentioning inequality, but it was not so identified by the dictionary method. The story never uses words like income gap or income inequality. Instead, the discussion is about how social security subsidizes the lifestyles of the affluent:
The costliest welfare load isn’t for the poor, it’s for the well-to-do . . . [A rich retiree] knows he is being subsidized by the 12.4% payroll tax being paid by employers and their younger and lower-paid workers, like his granddaughter Amanda Fargo, 21, who earns $5 an hour as a receptionist in a beauty salon. Savage approves of taxpayer subsidiesfor the elderly poor, but adds, “It’s unconscionable . . . to take money away from these kids and give it to well-off people.”

While this article reflects on a well-known aspect of inequality, it does not contain any of the words or phrases in the carefully curated dictionary developed by previous researchers.
Our research thus suggests that dictionary methods will struggle with the identification of broader concepts but can play a role when specific phrases are of interest (e.g., the “1 percent”) or accuracy and prevalence are not at a premium. For example, tracking the use of the term “inequality” could be useful in revealing shifts in the way that the underlying concept of inequality is being represented, especially if it could be shown that the deployment of the inequality term itself has substantively meaningful consequences (e.g., for understanding how public discourse reflects or shapes public perceptions and views about inequality). By contrast, we show that dictionaries are not an appropriate method if the purpose is to identify complex concepts or themes with myriad meanings and definitions, particularly over long periods of time when the terms chosen to represent them are likely to vary.
SML algorithms, on the other hand, are well equipped to recognize these more complex concepts, even as the specific content related to the concepts changes over time; we were therefore able to almost completely replicate our hand-coding scheme using SML algorithms. The success of this method in discerning significant shifts in discussions of inequality gives us confidence that it can be used on most concepts or themes of interest to sociologists, provided they are reasonably bounded (recall the difficulties SML methods encountered distinguishing implicit inequality from general economic articles). This method does, however, require much more investment at the front end of the project to correctly hand code a nontrivial number of articles. With this caveat in mind, SML approaches can replace hand-coding approaches if the objective is to code large quantities of text and capture nuanced discussions of complex concepts.
Finally, structural topic modeling is also well equipped to identify salient clusters of words, and like the SML algorithms, it correctly identified the above article on inequalities in the social security tax and transfer system. Likewise, it correctly picked up the rise in discussion of inequality in the early 1990s. But, as the presentation of results using this method illustrated, UML approaches will not necessarily identify in every model the specific concepts or themes of interest to a researcher. And, if it does, the qualitative decision points involved, such as choosing the number of topics and types of words to include, should give deductive researchers pause. Additionally, to
tag an article as having mentioned inequality, we used a cutoff determined by the distribution of topic probabilities across articles formerly hand coded as irrelevant. If we were doing a content analysis project from scratch (i.e., without any prior hand-coding), we would not be able to perform this sort of benchmarking, creating another choice-point for the researcher. More likely, researchers using this method would examine the proportion of words structured from a topic—charting, for example, this proportion over time— rather than tagging entire documents into categories (and charting the proportion of articles falling into these categories over time, as we did).
If the goal is not to categorize documents into known categories but to inductively explore textual data and the themes that emerge from them, or to explore how topics co-occur in texts, topic models are a good solution. In particular, once relevant content has already been identified using other, more reliable methods, such as SML, fully automated methods can then be used to examine the content in greater detail and in a more exploratory and inductive fashion (e.g., in our case, we would investigate exactly how inequality is covered or framed in the relevant articles or which topics inequality is most commonly associated with). Our results demonstrate the ability of topic models to recognize patterns of theoretical interest in textual data, indicating that they can be used to complement other forms of analysis. If used in an exploratory way, topic models can suggest new and potentially fruitful patterns that may productively reroute research agendas or may help researchers form testable hypotheses about their thematically focused data.
In closing, we wish to underscore that even though our conclusion regarding the significant complementarities among the methods we discussed is based on the current state-of-the art, we believe it will continue to apply in the foreseeable future as new computer-assisted text analysis methods and techniques are being developed. For example, on the one hand, new work in word embeddings, which incorporate the context in which a word is used more effectively than in previous methods, can further improve the performance of NLP algorithms (Goth 2016; Mikolov et al. 2013). Sociologists would therefore benefit from an ongoing engagement with this literature to elevate their own application of computer-assisted techniques. Yet, on the other hand, we as a discipline should think carefully about exactly how these new methods correspond to the types of research questions and data at the core of our scholarly enterprise, including those that privilege humanistic interpretation. Comparing and contrasting automated methods to nuanced hand-coding methods provides an empirical foundation that has been lacking in debates over the relationship between
our methodological traditions and the new computer-assisted techniques and that we hope advances these debates to better understand the future of textual analysis in social science research.

Authors’ Note
A replication repository, containing both code and data, can be found at https:// github.com/lknelson/future-of-coding

Acknowledgments
We are grateful for funding from the Russell Sage Foundation and for extremely helpful comments from the reviewers and from John Levi Martin, James Evans, and Peter Enns on an earlier draft. We also thank Bart Bonikowski for introducing the hand coder among us to automated methods.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This article has been funded by the Russell Sage Foundation.

Supplemental Material
Supplemental material for this article is available online.

Notes
1.How exactly to make content analysis “scientific,” and if that is even possible, is of course contested (see, e.g., Biernacki 2012; Reed 2015; Spillman 2015).
2.Past research has used semiautomated methods to quantify the structural narrative of texts (Bearman and Stovel 2000; Franzosi et al. 2012), clustering and block modeling methods to measure latent cultural structures embedded in text (Martin 2000; Mische and Pattison 2000; Mohr and Duquenne 1997), and map and network analyses to measure relationships between concepts within texts (Carley 1994).
3.See Gilens (1999), Dyck and Hussey (2008), and Kellstedt (2000) for approaches that retain all articles from the search as relevant and then either code pictures only or use computerized methods to identify frames.
4.Sociologists “code” text in a variety of ways that vary in complexity, including classifying whole or parts of text into different categories, identifying different
themes or frames in text, and identifying rhetorical techniques such as persuasion, satire, or ambiguity, to name a few. We see our hand-coded data as a form of complex text classification, complex enough to entail a challenge for these automated methods. Further research could investigate different types of coding tasks in a similar way that we do here.
5.Irrelevant articles were on the following topics: racial or gender inequality, gay rights, inequality in other countries, individuals whose names are part of a subject term (e.g., Marc “Rich”), popular culture items that include part of a subject term (e.g., a movie named “Big Business”), clearly personal affairs about a single individual, noneconomic elites (e.g., in art or religion), and social class as a predictor of noneconomic phenomenon (e.g., health, drug use).
6. Online Supplemental Appendix A describes the distinction between explicit and implicit mentions of inequality (see in particular panel 4).
7. Specifically, we take the weighted average across categories: weighted average precision ¼ average of precision scores multiplied by the proportion of total rows that are true positives for each category; weighted average total recall ¼ average of recall scores multiplied by the proportion of total rows that are true positives for each category; weighted average total F1 ¼ (2 weighted_average_precision weighted_average_recall)/(weighted_average_precision þ weighted_average_ recall).
8. Although the table reports metrics for the test set, the graphs provide trends for the entire sample of articles, including both test and training sets, as the substantive results for the entire sample (and by inference, the population) are of interest to the researcher.
9. We also performed extensive tests of the ReadMe program, which is available as a package for R or as a stand-alone program (Hopkins et al. 2013). We include information about and results from that analysis in Online Supplemental Appendix D. However, because ReadMe directly estimates the proportion of documents falling in each category rather than classifying documents individually, it was not possible to create precision, recall, and F1 statistics.
10. To account for the fact that key words may occur by chance in articles not related to inequality, we also considered a threshold-based approach to classification, whereby, for example, an article would be placed in the inequality category only if the incidence of key words exceeded the 95th percentile of key word-incidence among articles hand coded as irrelevant. However, because there is no established procedure for setting such a threshold in the literature, we opted to present results for the simpler “one-occurrence” dictionary-coding scheme.
11. An alternative method for constructing a time trend from a key word dictionary is to chart the incidence of key words as a proportion of total words in each year, as
opposed to charting the proportion of articles containing at least one key word. We tested this alternative method but found that the trend in key word incidence was prone to wild swings from year to year and did not closely follow the trend constructed through hand-coding. The correlation between the trend in key word incidence and the proportion of articles hand coded as explicitly covering inequality was 0.46 compared to 0.59 between the proportion of articles containing at least one key word and the hand-coded trend (see column 11 in the second panel of Table 1).
12. On the other hand, if explicit and implicit coverage are correlated, then inferences about overall coverage and trends in coverage may not be overly biased (though a comparison of these trends in Figure 3 reveals that the trend for explicit articles differs from the trend for combined explicit and implicit articles).
13. For example, to improve the recall of the two-word, modifier-noun key word approach, we could expand the list of key words in order to capture more of the ways in which inequality is discussed. On the other hand, to improve the precision of our more complex scheme, we could require that two key words occur in the same sentence, or the same paragraph, rather than anywhere in the article.
14. For the topic models and the k-means model below, we performed common preprocessing steps: We converted all letters to lower case, removed punctuation, and stemmed words using the Porter stemming algorithm.
15. We also ran a 60-topic LDA model, and the results were similar to the structural topic model. With the LDA model, we identified 150 articles as having content on inequality, whereas we identified 190 articles with STM. The F1 score was similar for the two (.52 for LDA and .53 for STM), with recall higher for STM (.45 compared to .41 for the LDA model) and precision lower for STM (.63 compared to .71 for the LDA model). Given the similar F1 scores and the fact that STM flagged more articles, we focus on the results from the STM analysis only.

References
Andersen, Peggy M., Philip J. Hayes, Alison K. Huettner, Linda M. Schmandt, Irene B. Nirenburg, and Steven P. Weinstein. 1992. “Automatic Extraction of Facts from Press Releases to Generate News Stories.” Pp. 170-77 in Proceedings of the Third Conference on Applied Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics.
Bail, Christopher A. 2014. “The Cultural Environment: Measuring Culture with Big Data.” Theory and Society 43(3-4): 465-82.
Bamman, David and Noah A. Smith. 2015. “Open Extraction of Fine-Grained Political Statements.” Pp. 76-85 in Proceedings of the Conference on Empirical
Methods in Natural Language Processing. Lisbon, Portugal: Association for Computational Linguistics.
Bearman, Peter S. and Katherine Stovel. 2000. “Becoming a Nazi: A Model for Narrative Networks.” Poetics 27(2-3): 69-90.
Benoit, Kenneth, Michael Laver, and Slava Mikhaylov. 2009. “Treating Words as Data with Error: Uncertainty in Text Statements of Policy Positions.” American Journal of Political Science 53(2): 495-513.
Biernacki, Richard. 2012. Reinventing Evidence in Social Inquiry: Decoding Facts and Variables. New York: Palgrave Macmillan.
Blei, David M. 2012. “Probabilistic Topic Models.” Communications of the ACM
55(4): 77-84.
Bonikowski, Bart and Noam Gidron. 2016. “The Populist Style in American Politics: Presidential Campaign Rhetoric, 1952-1996.” Social Forces 94(4): 1593-621.
Burscher, Bjorn, Rens Vliegenthart, and Claes H. De Vreese. 2015. “Using Supervised Machine Learning to Code Policy Issues: Can Classifiers Generalize across Contexts?” The Annals of the American Academy of Political and Social Science
659(1): 122-31.
Carley, Kathleen. 1994. “Extracting Culture through Textual Analysis.” Poetics
22(4): 291-312.
Caruana, Rich and Alexandru Niculescu-Mizil. 2006. “An Empirical Comparison of Supervised Learning Algorithms.” Pp. 161-68 in Proceedings of the 23rd International Conference on Machine Learning. New York: ACM.
Chong, Dennis and James N. Druckman. 2009. “Identifying Frames in Political News.” Pp. 238-87 in Sourcebook for Political Communication Research: Methods, Measures, and Analytical Techniques, edited by E. P. Bucy and R. L. Holbert. New York: Routledge.
Cowie, Jim and Wendy Lehnert. 1996. “Information Extraction.” Communications of the ACM 39(1): 80-91.
DiMaggio, Paul, Manish Nag, and David Blei. 2013. “Exploiting Affinities between Topic Modeling and the Sociological Perspective on Culture: Application to Newspaper Coverage of U.S. Government Arts Funding.” Poetics 41(6): 570-606.
Dyck, Joshua and Laura Hussey. 2008. “The End of Welfare as We Know It? Durable Attitudes in a Changing Information Environment.” Public Opinion Quarterly
72(4): 589-618.
Enns, Peter, Nathan Kelly, Jana Morgan, and Christopher Witko. 2015. “Money and the Supply of Political Rhetoric: Understanding the Congressional (Non-) Response to Economic Inequality.” Paper presented at the APSA Annual Meetings, San Francisco, CA.
Evans, John H. 2002. Playing God? Human Genetic Engineering and the Rationalization of Public Bioethical Debate. Chicago, IL: University of Chicago Press.
Ferree, Myra Marx, William Anthony Gamson, Jurgen Gerhards, and Dieter Rucht. 2002. Shaping Abortion Discourse: Democracy and the Public Sphere in Germany and the United States. New York: Cambridge University Press.
Franzosi, Roberto. 2004. From Words to Numbers: Narrative, Data, and Social Science. Cambridge, England: Cambridge University Press.
Franzosi, Roberto, Gianluca De Fazio, and Stefania Vicari. 2012. “Ways of Measuring Agency: An Application of Quantitative Narrative Analysis to Lynchings in Georgia (1875–1930).” Sociological Methodology 42(1): 1-42.
Gilens, Martin. 1999. Why Americans Hate Welfare: Race, Media, and the Politics of Antipoverty Policy. Chicago, IL: University of Chicago Press.
Goth, Gregory. 2016. “Deep or Shallow, NLP is Breaking Out.” Communications of the ACM 59(3): 13-16.
Grimmer, Justin. 2010. “A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases.” Political Analysis 18(1): 1-35.
Grimmer, Justin and B. M. Stewart. 2011. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21(3): 267-97.
Griswold, Wendy. 1987a. “The Fabrication of Meaning: Literary Interpretation in the United States, Great Britain, and the West Indies.” American Journal of Sociology
92(5): 1077-117.
Griswold, Wendy. 1987b. “A Methodological Framework for the Sociology of Culture.” Sociological Methodology 17:1-35.
Hanna, Alex. 2013. “Computer-Aided Content Analysis of Digitally Enabled Movements.” Mobilization: An International Quarterly 18(4): 367-88.
Hopkins, Daniel and Gary King. 2010. “A Method of Automated Nonparametric Content Analysis for Social Science.” American Journal of Political Science 54(1): 229-47.
Hopkins, Daniel, Gary King, Matthew Knowles, and Steven Melendez. 2013.ReadMe: Software for Automated Content Analysis. Version 0.99836. Accessed 4 October 2017: (http://gking.harvard.edu/readme).
Jain, Anil K. 2010. “Data Clustering: 50 Years Beyond K-Means.” Pattern Recognition Letters 31(8): 651-66.
Jurka, Timothy P., Loren Collingwood, Amber E. Boydstun, Emiliano Grossman, and Wouter van Atteveldt. 2014. RTextTools: Automatic Text Classification via Supervised Learning. R package version 1.4.2. Accessed 4 October 2017: (https://cran.rproject.org/web/packages/RTextTools/index.html).
Kellstedt, Paul M. 2000. “Media Framing and the Dynamics of Racial Policy Preferences.” American Journal of Political Science 44(2): 239-55.
King, Gary, Jennifer Pan, and Margaret Roberts. 2013. “How Censorship in China Allows Government Criticism but Silences Collective Expression.” American Political Science Review 107(2): 1-18.
Krippendorff, Klaus. 1970. “Bivariate Agreement Coefficients for Reliability of Data.” Sociological Methodology 2:139-50.
Lancichinetti, Andrea, M. Irmak Sirer, Jane X. Wang, Daniel Acuna, Konrad Ko¨rding, and Lu´ıs A. Nunes Amaral. 2015. “High-Reproducibility and High-Accuracy Method for Automated Topic Classification.” Physical Review X 5(1): 011007.
Lang, Ken. 1995. “NewsWeeder: Learning to Filter Netnews.” Pp. 331-39 in Proceedings of the 12th International Machine Learning Conference. Morgan Kaufmann Publishers Inc.
Lee, Monica and John Levi Martin. 2015. “Coding, Culture, and Cultural Cartography.” American Journal of Cultural Sociology 3:1-33.
Levay, Kevin. 2013. “A Malignant Kinship: The Media and Americans’ Perceptions of Economic and Racial Inequality.” Unpublished paper, Northwestern University Department of Political Science, Evanston, IL.
Lloyd, Stuart P. 1982. “Least Squares Quantization in PCM.” IEEE Transactions on Information Theory 28(2): 129-37. doi:10.1109/TIT.1982.1056489.
Loughran, Tim and Bill McDonald. 2011. “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” The Journal of Finance 66(1): 35-65.
Manning, Christopher, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. “The Stanford CoreNLP natural language processing toolkit.” Pp. 55-60 in Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Baltimore, MD: Association for Computational Linguistics.
Martin, John Levi. 2000. “What Do Animals Do All Day? The Division of Labor, Class Bodies, and Totemic Thinking in the Popular Imagination.” Poetics 27(2-3): 195-231.
McCall, Leslie. 2013. The Undeserving Rich: American Beliefs about Inequality, Opportunity, and Redistribution. New York: Cambridge University Press.
Mikolov, Thomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” in Proceedings of Workshop at International Conference on Learning Representations. (https://research. google.com/pubs/pub41224.html)
Milkman, Ruth, Stephanie Luce, and Penny Lewis. 2013. Changing the Subject: A Bottom-Up Account of the Occupy Wall Street Movement in New York City. New York: The Murphy Institute, City University of New York.
Mische, Ann and Philippa Pattison. 2000. “Composing a Civic Arena: Publics, Projects, and Social Settings.” Poetics 27(2): 163-94.
Mohr, John W. 1998. “Measuring Meaning Structures.” Annual Review of Sociology 24(1): 345-70.
Mohr, John W., Robin Wagner-Pacifici, Ronald L. Breiger, and Petko Bogdanov. 2013. “Graphing the Grammar of Motives in National Security Strategies: Cultural Interpretation, Automated Text Analysis and the Drama of Global Politics.” Poetics 41(6): 670-700.
Mohr, John W. and Vincent Duquenne. 1997. “The Duality of Culture and Practice: Poverty Relief in New York City, 1888-1917.” Theory and Society 26(2/3): 305-56.
Nardulli, Peter F., Scott L. Althaus, and Mathew Hayes. 2015. “A Progressive Supervised-learning Approach to Generating Rich Civil Strife Data.” Sociological Methodology 45(1): 145-83.
Nelson, Laura K. 2017. “Computational Grounded Theory: A Methodological Framework.” Sociological Methods and Research. Retrieved April 02, 2018 (https://doi. org/10.1177/0049124117729703).
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, and E´ Duchesnay. 2011. “Scikit-learn: Machine Learning in Python.” Journal of Machine Learning Research 12:2825-30.
Pelleg, Dan and Andrew W. Moore. 2000. “X-Means: Extending K-Means with Efficient Estimation of the Number of Clusters.” Pp. 727-34 in Proceedings of the Seventeenth International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann Publishers Inc.
R Core Team. 2014. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Accessed 4 October 2017: (http://www.R-project.org/).
Reed, Isaac Ariail. 2015. “Counting, Interpreting and Their Potential Interrelation in the Human Sciences.” American Journal of Cultural Sociology 3(3): 353-64. “Reuters-21578 Test Collection.” n.d. Retrieved March 09, 2017. (http://www.david dlewis.com/resources/testcollections/reuters21578/).
Roberts, Margaret, Brandon Stewart, Dustin Tingley, and Edoardo M. Airoldi. 2013. “The Structural Topic Model and Applied Social Science.” Pp. 1-4 in Advances in Neural Information Processing Systems Workshop on Topic Models: Computation, Application, and Evaluation. https://scholar.princeton.edu/bstewart/publica tions/structural-topic-model-and-applied-social-science
Rousseeuw, Peter J. 1987. “Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis.” Computational and Applied Mathematics 20: 53-65.
Schmidt, Benjamin M. 2012. “Words Alone: Dismantling Topic Models in the Humanities.” Journal of Digital Humanities 2 (1). Retrieved April 2, 2018 (http://journal ofdigitalhumanities.org/2-1/words-alone-by-benjamin-m-schmidt/).
Spillman, Lyn. 2015. “Ghosts of Straw Men: A Reply to Lee and Martin.” American Journal of Cultural Sociology 3(3): 365-79.
Steinbach, Michael, George Karypis, and Vipin Kumar. 2000. “A Comparison of Document Clustering Techniques.” in KDD Workshop on Text Mining. Minneapolis: University of Minnesota. 400(1): 525-26
Tausczik, Yla R. and James W. Pennebaker. 2010. “The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods.” Journal of Language and Social Psychology 29(1): 24-54.
Van Rijsbergen, C. J. 1979. Information Retrieval. London, England: ButterworthHeinemann.

Author Biographies
Laura K. Nelson is an Assistant Professor of Sociology at Northeastern University, where she is also core faculty at NULab for Texts, Maps, and Networks and is on the Executive Committee for the Women’s, Gender, and Sexuality Studies program. She uses computational methods and open-source tools to research culture, social movements, organizations, and gender.
Derek Burk is a Senior Data Analyst on the IPUMS-International project at the Institute for Social Research and Data Innovation at the University of Minnesota. He specializes in developing data processing and analysis pipelines for survey and census data.
Marcel Knudsen is a Doctoral Candidate in the Department of Sociology at Northwestern University. His research focuses on workplaces and inequality, and his dissertation examines city minimum wage increases and their interaction with organizational cultures and hierarchies.
Leslie McCall is Presidential Professor of Sociology and Political Science and Associate Director of the Stone Center on Socio-Economic Inequality at the Graduate Center, City University of New York. Her research focuses on public opinion about inequality, opportunity, and related economic and policy issues; trends in actual earnings and family income inequality; and patterns of intersectional inequality.

编码的未来:手工编码与三种计算机辅助文本分析方法的比较

社会学方法与研究
2021年,第50卷(1)202-237
©作者(s) 2018
文章重用的指导方针;
sagepub.com/journals-permissions
DOI: 10.1177/0049124118769114
journals.sagepub.com/home/smr

劳拉K.纳尔逊1,德里克·伯克2,马塞尔·克努森2莱斯利麦考尔3

摘要
计算机科学和计算语言学的进步已经产生了新的、更快的计算方法来构造和分析文本数据。这些方法在信息提取等任务上表现得很好,但它们识别复杂的、社会构建的和不确定的理论概念的能力——社会学内容分析的一个核心目标——还没有得到验证。为了填补这一空白,我们比较了三种常见的计算机辅助方法——字典、监督机器学习(SML)和无监督机器学习——与通过对新闻不平等的严格手工编码分析产生的结果(N=1253篇文章)。虽然我们发现SML方法在复制手工编码的结果方面表现最好,但我们记录并澄清了每种方法的优缺点,包括它们如何补充其中一种方法
1.东北大学,波士顿,马萨诸塞州,美国
2.西北大学,埃文斯顿,美国伊利诺斯州
3.纽约城市大学研究生中心,纽约,纽约,美国

通信作者
劳拉·k·纳尔逊,东北大学社会学和人类学系,波士顿,MA 02115。
电子邮件l.nelson@northeastern.edu
另一个我们认为,社会科学领域的内容分析师最好将所有这些方法保存在他们的工具包中,并根据手头的任务有目的地部署它们。

关键字
有监督机器学习,手工编码方法,无监督机器学习,字典方法,内容/文本分析,不等式

基于文本的数据的内容分析是社会科学中一种成熟的方法,而收集和存储数据的技术以及计算能力和方法的进步,正在不断推动其向新的方向发展。这些进展通常旨在使该过程更加科学——更可靠、更有效和更可重复性。例如,以前的进展包括,编码器可靠性得分(例如,克里彭多夫1970),旨在验证多人的文本编码;定性数据分析软件,如Atlas.ti和NVivo,旨在使定性分析和定量识别模式支持定性结论;以及应用算法和数学模型提取客观模式的文本(比尔曼和斯托维尔2000;卡利1994;弗兰佐西、法齐奥和维卡里2012;马丁2000;米什和帕蒂森2000;莫尔和杜奎内1997年)。2
后者发展,算法和数学模型的应用基于文本的数据,看到内容分析师的活力,随着新兴方法在自然语言处理(NLP)和机器学习使新的,更快的计算方法构造和分析文本数据,包括“大”数据(迪MaggioNag和Blei 2013;格里默和斯图尔特2011;Mohr等人,2013年)。事实上,这些技术的承诺之一是,他们将允许研究人员用更少的资源做更多的事情,允许分析更多的数据或数据从更多样的来源(例如,报纸和电视),以及提取更细粒度模式的任何大小的数据集,包括在以前的手工编码的文本。鉴于手工编码技术的资源密集型性质,实现基于文本的数据分析的广度和深度几乎是不可能的。
我们在本文中研究的使用计算机来识别文本类别的具体进展是由计算机科学家和计算语言学家发起的,目的是将文本分类为预先指定的或未知的类别(Andsern等人,1992;Cowie和Lehnert,1996)。
为了测试这些算法的性能,计算机科学家和计算语言学家依赖于一些标准的、带有标记的文本集合,如路透社-21578数据集(“路透社-21578测试集合”n.d.)和20个新闻组数据集(Lang 1995)。这些基准数据集中的类别由收藏品策展人决定,包括诸如“计算机”、“娱乐”、“科学”和“经济学”等主题。
本研究的一般结论是,如果有足够的先前标记数据,研究人员可以找到一种算法,或一个算法集合,以准确地将未标记数据分类到所选的分类方案中。也就是说,一旦一组相对较小的手工编码文档被证明成功地“监督”计算机识别所需的内容,这类监督机器学习(SML)算法可以保证更高的效率、透明度和可复制性(Hanna 2013;金,潘和罗伯茨2013)。因此,已经开发了许多软件包来捆绑算法并简化它们在常规文本分析项目中的应用(例如,RTextTools、scikit-learn和斯坦福NLP,我们将在下面讨论)。
然而,随着可访问性的扩展,计算机科学之外的学者正在超越基准集合,并将它们应用到他们自己的、学科或特定领域的任务中。这就提出了三个方法论问题: (1)以标准集合为基准的算法能否在其他领域表现得更好?(2)如果是这样,这些算法和其他计算工具是否可以成功地纳入到特定领域的问题和分析的工作流中?(3)更雄心勃勃地说,它们能完全取代手工编码的工作吗?
我们从社会学领域(和相关学科)的角度来解决这些问题。学者们正在转向机器学习和其他计算方法来增加或取代社会学内容分析中最常见的任务之一:识别和编码文本中的主题、框架、概念和/或类别。但是,与计算机科学家和计算语言学家相比,社会科学家通常不感兴趣的大量文本分类为主导类别,因为他们在识别复杂、社会构建和不安的理论概念,通常界限模糊,如民粹主义、理性、歧义和不平等(博尼考斯基和吉德龙2016;埃文斯2002;格里斯沃尔德1987a)。大多数社会科学家继续依赖传统的人类编码方法作为分析这类现象的黄金标准(伯努瓦、拉弗和米哈伊洛夫2009;格里默和斯图尔特2011)。
我们在本文中的主要目的是实证测试三种最突出的计算机辅助内容编码方法——字典方法、SML方法和无监督机器学习(UML)方法——与社会学兴趣的复杂主题的严格手工编码的金标准。虽然有相当大的努力致力于开发特定领域和问题的新的算法(见,例如,班曼和史密斯2015;纳杜利、阿尔索斯和海耶斯2015),但缺乏实证研究来指导学者选择和应用已经建立和打包的自动化方法,特别是在复杂概念内容的分析方面。领先的全自动内容分析方法——字典和uml——能完全规避手工编码的需要吗?事实上,像SML这样的半自动化方法甚至可以胜任编码复杂内容的任务吗?
令人惊讶的是,在执行社会科学家(包括定性和定量研究人员)最感兴趣的复杂材料的内容编码时,还没有对各种技术与成熟的手工编码方法的表现进行全面的比较。然而,大多数社会科学家在开始他们自己的内容分析项目时,并没有资源来充分测试这些不同的方法。我们准确地描述了不同的自动化技术能够做什么和不能做什么(回答上面的第一个问题),并在这个过程中展示了各种编码方法之间可能存在显著的互补性(回答上面的第二个问题)。在此过程中,我们提供了在社会科学领域中实施这些方法的指南。
因为我们的目标不仅是为专家之间的辩论提供信息,而且是接触更广泛的社会科学受众,我们采用了不同于技术文献的基准策略。我们没有使用编码的数据集来测试信息检索的特定算法进行基准测试(正如计算机科学家和计算语言学家所做的那样),而是将三种计算机辅助方法(字典、SML和UML)基准测试在手工编码的数据集上,以识别一个复杂和多方面的理论概念附录。不平等我们通过暂时将手工编码的结果作为测量的标准来比较实质性的发现。手编码方法在社会科学家中更广泛的熟悉度和接受度,以及它已知的优点和缺点,使美国能够将关于内容分析方法的争论牢牢地扎根在现实的社会科学数据中。
虽然我们关注的是这些方法的实质性结果,但我们也提供了使用可用软件进行计算机辅助文本分析的实用指导。有监督和无监督的机器学习程序处于该领域的前沿,但即使是打包的程序也至少需要一些编程语言的知识,如Python、Java和R。我们检查了三个最广泛使用的应用SML方法的“现成”包:RTextTools (Jurka等2014年;R核心团队2014年)、Python的科学包学习(佩德雷戈萨等2011年)和斯坦福的NLP分类器(Manning等2014年)。考虑到这三个软件包在所包括的精确的机器学习算法、这些算法的实现和默认的文本处理设置上都有所不同,我们想测试它们是否产生了类似的结果,或者它们复制手工编码的能力是否不同。我们还试图评估它们的易用性,并在在线补充附录中提供了一些实用的建议和学习资源的链接。
根据我们的手工编码项目的特性,测试自动化程序的数据集、手工编码方法和一般分析策略在数据和分析策略部分中描述。然后,我们描述了在拟合度量部分中用于评估自动方法的准确性的指标。在结果部分,我们将更详细地描述三种文本分析的自动化方法,并在关于SML方法、字典方法和UML方法的三个小节中对这些方法进行实证测试。最后,在讨论和结论部分,我们比较和对比我们的结果在方法为了突出他们的优点和缺点从实质性的角度和总结的方式研究问题实质性的和概念的本质可以适当地匹配各种内容分析策略。

数据和分析策略
数据和手工编码方法
在手工编码项目中,我们的实质性目标是确定媒体是否以及何时报道日益加剧的经济不平等的新问题(McCall 2013)。在对福利和种族等相关主题的政治研究之后(1999;2000年),我们使用《读者期刊摘要指南》在《新闻周刊》、《时代》和《美国新闻与世界报道》三大美国新闻周刊》中搜索有关1980年至2012年经济不平等的文章。读者指南为每一篇文章提供了一个预定义的主题术语列表,我们选择了一组最接近地描述我们的主题的术语(“收入不平等”、“工资差异”、“平等”和“精英统治”)。然而,令人惊讶的是,少数文章被分配了这些不平等的主题术语,所以我们扩展了搜索,包括包含在这个较小的文章集合中的63个主题术语中的所有文章。因为这类文章有数千篇(大约8500篇),我们被迫按年进行随机抽样(每年占人口的10-15%)。这个示例(N¼1,253)是我们在所有后续分析中使用的文章的数据集。
本文的基本原理是,我们遇到了各种各样的主题术语和主题内容的复杂性,因此我们别无选择,只能手工编写每一篇文章。与媒体对福利和种族报道的可比研究不同,我们既不假设所有的文章(使用上述方法选择)都是相关的,也不假设预设的短语列表是详尽的或明确的,足以用于计算机编码程序。3相反,手工编码使一种更灵活的方法来识别和分类形式不同的主题(即所使用的特定单词或短语),但不一定是内容(即感兴趣的概念)。当分析的主题是一个实时展开的新的多方面的社会问题时,这种灵活性也许是特别必要的,因为这一点是无法获得确定和持久的文化框架的。例如,它是不可行的演绎目录完整的隐喻经济不平等可以调用在三十年的新闻报道(例如,“华尔街与普通民众”的隐喻在金融危机在2000年代末,而关于“工会破裂”更相关的故事在1980年代早期)。我们也不能生成一个详尽的术语列表,用来描述每一个潜在相关的社会阶层群体(即富人、富人、高管、经理、专业人士、中产阶级、失业者、穷人、最低工资工人等)。
我们的编码方案——使用演绎和归纳推理在各个阶段迭代开发(Chong和德鲁克曼2009;Ferree等人2002;格里斯沃尔德1987b)——试图涵盖这种广泛的覆盖范围,此外,更好地理解几个灰色区域(我们对不平等的定义见在线补充附录A)。事实上,我们在可靠地编码不平等的概念方面所面临的挑战——这些材料传达了不平等的现实,而不一定依赖于股票短语
不平等–意味着我们必须放弃早期的努力来界定问题的框架方式,特别是在其原因和解决方案方面。正如我们在接下来的章节中讨论的,我们期望使用完全自动化的工具对其他提到不平等的方法(即手工编码和SML)确定的文章子集进行进一步的分析。(因此,自动化的方法可以用于对采样数据进行更详细的分析;也就是说,它们并不只适用于“大数据”。)
我们的手工编码的结果如图1和图2所示。超过三分之一的文章在手工编码的过程中最终被认为是不相关的
图1。手工编码的文章的分类。
图2。手工编码文章的首选三代码方案的趋势(显性/隐性不平等与一般经济与不相关类别)。

其余的相关文章被分为两组: (1)引用不平等的主题,进一步分解为文章明确引用不平等(例如,使用术语“不平等”)或隐性引用(例如,描述高管和低薪工人的财富分化),分别标记明确和隐性不平等,6或(2)那些属于剩余类别关注相关但更普遍的收入、就业和经济趋势。这个群体,我们称之为一般经济(或简称经济),也被分为两类。图1提供了沿着从不相关不等式到显式不等式的连续体的五个基本类别的可视化表示。两个聚合的相关类别(显式/隐性不平等和一般经济类别)也在图1中表示。对于无关类别(0.78)和显式和隐式不平等类别(第一轮编码0.92,2轮编码0.85),双编码器可靠性测试较高,因此我们专注于复制它们,特别是兴趣的中心类别,显式/隐式不平等。图2显示了两个聚合的相关类别和不相关类别的时间趋势。

一般分析策略
除了上面提到的编码方案的复杂性之外,我们还强调了数据和编码过程的其他几个方面,这些方面对我们如何执行自动化方法的测试和我们对结果的期望有影响。首先,编码指令的编码和开发发生在新的文本分析的自动化方法的传播之前;因此,执行编码并不是为了测试自动化程序。其次,与此相关的是,我们只试图确定我们所定义的经济不平等的事实是否曾在一篇文章中提到过。值得注意的是,这意味着许多文章被编码为不平等,即使主要主题是另一个问题。由于手工编码过程的这两个方面,自动化程序将不得不排除相当多的噪声,以便正确地对文章进行分类(即,同意手工编码器的分类)。与此同时,这些类别之间的差异可能难以发现,因为大多数文章在一定程度上包含了经济材料。正如我们下面讨论的,对于定义微妙的文章类别可能尤其如此,比如隐性不平等。
因此,我们为计算机辅助的方法设定了一个很高的标准,即使是那些由以前手工编码的数据训练的方法。对于没有这种内置优势的自动化方法,门槛可能高得令人难以忍受。然而,我们的测试仍然具有指导意义,因为它们准确地阐明了每种方法的应用将会产生什么实质性的结果
仅对于数据,我们认为这是相当普遍的做法。具体来说,我们研究了从头开始,完全自动化的方法是否将理论感兴趣的主题(即不平等)与我们的数据可能被分类的许多其他方式分离开来。类似地,我们检查复杂的字典列表是否足够详尽,以检测不等式的范围和变化。简而言之,我们使用手工编码的结果作为一个标准,以经验来识别计算机辅助文本分析的三个广泛方法的相对优缺点。

适合度量
在进行我们的测试时,我们使用了三种广泛使用的拟合度量:精度、查全率和F1分数(Van Rijsbergen 1979)。
精度是指根据手工编码的“真”阳性结果的比例。例如,如果将自动化程序归类为提及不平等的文章中,有一半被手工编码人员进行了类似的分类,那么精度分数将是0.50。回忆指的是同样被自动化方法编码为阳性的真阳性的比例。因此,具有高精度和低召回率的分析在大多数阳性分类中都是正确的,但会遗漏大部分本应该被归类为阳性的文章。F1分数是精度和查全率的调和平均值,并为每个类别提供了总体准确性的测量。虽然在大多数情况下,f1分数被认为是拟合的最佳指标,但我们发现,精度和召回率可以更好地理解模型在哪里成功,在哪里犯了错误。因此,我们对这些指标的关注与更多的话。因为这些分数是为每个类别计算的(即不平等的、不相关的等),我们还使用跨编码类别的精确度、召回率和F1分数的加权平均,作为方法准确性的总体衡量标准。7
我们在这些标准测量中添加了一个由每种计算机辅助方法估计的时间趋势的比较。识别时间趋势不仅是文本分析项目最常见的目标之一,而且对自动化方法的一个担忧是它们对随着时间的推移的语言变化可能不敏感(Hoppkins和King 2010:242)。因此,我们测试了计算机辅助方法从原始数据中重现时间趋势的能力,这是基于每年编码为属于我们预定类别的文章的比例,例如包含明确和/或隐含提到不平等的文章。在使用两年的移动平均线来平滑数据之后,我们使用自动程序和手工编码方法的这些比例之间的相关性作为准确性的度量。这些分析为计算机辅助编码是否会产生与传统方法得出的实质性结论相似的问题提供了一个答案。

结果
我们从最类似于手工编码的方法开始,因为它需要手工编码的输入(SML)。然后,我们将在下面的部分中评估更全自动的方法。部分在每个方法又分为三个小节: (1)方法的简要概述,包括引用技术文献的文本和相应的在线补充附录为读者更详细地感兴趣,(2)分析策略的描述,略有不同的每个方法,我们校准我们的数据和分析方法的特殊性,和(3)的结果。

SML方法
简要说明。SML方法利用了计算机检测大量文档中的模式的能力和人类编码员解释文本含义的能力。基于手工编码成感兴趣类别的文档的“训练集”,SML分析包括三个步骤。首先,文档被转换为“可量化的文本元素的向量”,这被称为“特性”(例如,计数)。其次,应用机器学习算法来寻找这些数字特征向量和分配给它们的手工编码类别之间的关系,从而产生一个称为“分类器”的模型。最后,分析人员使用分类器对不在训练集中的文档进行编码(伯舍、弗利格塔特和De Vreese 2015:124)。
因此,在SML方法中,文档被表示为单词计数的向量,或“单词袋”。从表面上看,考虑到上下文可以彻底改变一个单词的含义,把文件当作文字袋似乎是错误的。由于我们的手工编码方案的复杂性,随着时间的变化,以及不等式的概念本身,我们的分析为单词袋方法提出了一个困难的测试。然而,在实践中,这一策略已被证明在许多研究者感兴趣的分类方案中表现良好(霍普金斯和金,2010年)。我们的数据允许我们对五种底层内容代码的不同组合进行实验(见图1),从而测试不同类型的分类方案。
分析策略。如果我们从头开始执行SML分析,我们将首先从我们感兴趣的人群中编码一个文档子集。这个手工编码的文档的子集是训练集。接下来,我们将通过选择手工编码文档的随机子集来训练SML分类器,并尝试复制剩余的手工编码文档的分类来测试SML设置。低水平的一致性将意味着需要改进手工编码方案或改变训练SML分类器的规范。最后,一旦达到可接受的一致性水平(基于精度、召回率和F1分数),我们将使用所有手工编码的文档作为训练集训练分类器,然后使用它来对更大的未编码文档(称为“看不见集”)进行分类。
因为我们的重点是测试SML复制手编码的能力,所以我们只将分类器应用于已经手工编码的文档。我们通过随机选择大约500篇文章作为训练集,并使用其余的(大约750篇)作为测试集,构建了25个人工训练和测试集。我们展示了25组类别的加权平均精度、召回率和F1分数的指标范围(见表1第一个面板的第7-9列),但将我们的介绍和讨论集中在中位数执行集的单个类别的指标上。这个集合的指标是表1第一个面板所有列的主要条目,附图显示了SML分类为指定类别的文章的比例,同样是这个中位数执行集。8
除了改变训练集和测试集之外,我们还测试了我们的五个基本类别的三种组合。在第一个编码方案中,相关的和不相关的,我们区分了所有实质性相关的文章和不相关的文章。在第二种编码方案中,不等与非不平等,我们区分了提到不平等的文章——无论是明确的还是含蓄的——和所有其他文章。在第三种编码方案中,即不平等与经济与不相关的文章,我们区分了提到不平等的文章、讨论一般经济问题但不提及不平等的文章和无关的文章。(我们还讨论了我们测试过的另一种三种代码方案的结果。)通过比较这些SML在不同编码方案之间的性能,我们评估了该方法复制不同类型、或多或少的复杂度和不同聚合级别的区别的能力。
我们使用三个最广泛采用的SML软件包进行了SML分析: RTextTools、斯坦福的NLP例程序和Python的scikit-learn。9我们在每个包提供的选项范围内,使用可比的设置运行每个程序,因为我们想比较“现成的”产品,并最小化用户使用额外脚本的需要(每个程序的设置见在线补充附录表D1)。图3-5显示了所有三个程序的时间趋势,以证明它们的可通约性。因为我们的结果在不同的软件包中是相似的,而且Python的scikit-learn是三个程序中开发最活跃的,我们只在表1中显示该包的结果,但在在线补充附录表D2中包括其他程序的结果。我们还在在线补充附录D中包括了每个程序的简要描述,以及有用的教程和学习资源的链接。
结果我们的分析显示,SML方法在精度和查全率方面都表现良好。首先看指标平均类别的三个分类方案(表1中列9),我们发现平均F1分数中位数测试集接近或远高于。70的经验法则经常在文献(Caruana和
注:编码方案A:相关(显、隐、一般经济)、不相关(不相关)。编码方案B:不平等(显式的,隐式的),而不是不平等(一般经济的,不相关的)。编码方案C:不平等(显式、隐性)、经济(一般经济)、不相关(不相关)。编码方案D:不平等(显式)、不相关(隐式、一般经济、不相关)。
a)由每个类别中真阳性的比例进行加权。见脚注7。
b)圆括号包含了横跨25个测试/训练集对的范围。
c)受监督的ML值仅用于测试集。

图3。对相关和不相关二进制方案的手工编码文章的监督机器学习分析的趋势(结合相关实质性类别与不相关类别;结合相关实质性类别显示)。

图4。手工编码的不等式与非不等式二进制格式(显式/隐式不等式与所有其他类别;外显式/隐式不等式类别显示)。

图5。对首选的三种代码方案的手工编码文章的监督机器学习分析的趋势(显式/隐性不平等与一般经济与不相关类别;外显/内隐不平等类别显示)。

2006年)。对于相关与不相关(.83)和不平等与不平等(.78)方案,F1分数通常都相当高,这表明不平等文章(结合显式和隐式文章)和不相关文章代表了定义良好的分组。与经济方案和不相关方案相比,不平等方案的F1分数较低(0.69),这表明较低的聚合水平导致类别之间的模糊差异,使算法更难识别,至少在我们的数据中是这样。
仔细研究一下三种代码方案的这些结果,较低的F1分数源于中等经济类别的较低指标,这在表1中没有显示出来。这个类别的F1分数是。52,而不平等类别和不相关类别分别为0.69和0.80。对这一经济类别的召回率尤其糟糕。在手工编码为该类别的200篇测试集文章中,只有93篇(47%)被SML算法正确分类。该算法在区分经济类别和不平等类别方面最困难,将这200篇经济文章中有67篇(34%)列为不平等文章,正如预期的那样,其中大多数属于隐式类别。
例如,1983年的一篇题为《再培训中不断增长的差距》的文章被手工编码为经济类别,并被SML算法误分类为不平等类别。这篇文章呼吁里根政府加大对工人再培训项目的投资。它有所有与不平等有关的流行语和短语:不断扩大的差距、紧迫的问题、失业的工人、失去价值等等。但这篇文章从来没有真正提到收入或收入不平等;相反,它讨论了
就业技能差距:“随着美国经济从衰退的制造业中衰退,并增加对增长更快的服务和技术部门的依赖,正在创造的新就业机会与现有工人技能之间的差距不断扩大。”这些类型的文章,包含与收入不平等相关的关键词,但在教育或就业不平等的背景下使用,一直被我们的算法错误分类。在我们报告主要结果之后,我们将回到这一点的方法和实质性意义。
如图3-5所示,SML方法也能够再现手工编码数据中媒体对不平等的报道趋势。在这里,我们衡量覆盖不平等的比例(而不是数量)的文章编码成不平等或相关类别,我们包括整个样本(训练和测试集)的文章得到实际覆盖的最佳估计的不平等(两年移动平均描述以及手工编码和SML趋势之间的相关性)。正如手工编码的分析一样,SML的结果显示不平等覆盖在20世纪90年代早期和大衰退时期达到顶峰。手编码和SML趋势之间的时间相关性范围从0.69到0.75(见表1的第10列)。考虑到我们的25个数据集的指标的微小变化和我们仔细的抽样程序,我们相信在我们10-15%的样本中发现的模式代表了我们更大的文章数量。
虽然这些结果当然是令人鼓舞的,一个重要的结论从这些和其他分析,我们进行的是分类方案的选择可能更依赖于精度和召回指标的个人类别的理论兴趣比平均总F1分数跨类别,这是一个更常见的文献。例如,在测试不同的三类编码方案时,我们获得的显式不平等/经济方案的总体F1得分略高于表1中理论上首选的不平等与经济与不相关的方案。这种较高的f1值是由于组合的隐性不平等/经济类别的精度和召回率远高于单独的经济类别。然而,权衡的结果明显更糟糕
与识别显式和隐式不等式文章的组合(在我们首选的三类方案中)相比,在识别显式不等式文章方面的性能。考虑到我们对不平等的实质性兴趣,我们选择了一种编码方案,该方案能更好地识别提到不平等的文章,而不是总体性能稍好的文章。
总之,SML模型不仅成功地在整体上和随时间的推移复制了手工编码的结果,因此,重要的是,提高了对这些结果可靠性的信心,而且还促使人们对主题进行更深入的分析和理解。这尤其涉及到显性和隐性不平等类别的文章之间的微妙区别,以及中间一般经济类别的文章和书结束它的类别之间的微妙区别。考虑到这些有效的紧张关系,研究人员可以从不同的受众(例如来自纽约时报)收集另一个样本或文章,并使用这些半自动化的方法对它们进行编码,假设不同类型的出版物的报道特性大致相等。实际上,一个对象文件可以提供给其他研究人员,其中包含基于我们完整的手工编码的文章集(作为训练集)将文章分类的相关信息。这不仅消除了在特定的内容域(例如,不平等)内进行手工编码的需要,而且有助于对不同的文本语料库进行比较分析。

字典和无监督的学习方法
因为SML算法需要大量的手工编码文本,社会科学家正在探索更完全自动化的文本分析方法,以完全规避对手工编码文本的需要。然而,重要的是要认识到,全自动方法(Grimmer 2010)和字典方法(洛朗和麦克唐纳2011)不能机械应用;它们的输出通常通过事后手工测试。也就是说,这些方法在一个语料库上实现,然后手工编码器通过一个语料库样本来测试计算机辅助代码的有效性。在这方面,字典和全自动的方法完全依赖于分析师的判断来解释和验证结果,最多只能使用手工编码员所采用的最严格的可靠性测试。鉴于我们有大量的手工编码的结果已经在我们的处置,我们的分析旨在使这些判断点明确,以及后果得出实质性的结论从每个方法的应用,最初选择唯一的方法分析我们的数据。
字典方法
简要说明。字典方法是最直接和最广泛使用的自动文本分析工具。当媒体内容分析不是学术研究的中心目标,而是用来快速绘制媒体中的问题流行率或突出性,或用更多的背景数据补充基于调查的分析的结果时,情况尤其如此。例如,在不平等问题上,“占领华尔街”运动促使媒体对不平等的报道似乎有所增加。冲动量化这种转变导致研究人员使用关键词搜索“不平等”的新闻得出结论对公众的程度接触新信息,这被认为是一个关键决定因素不仅问题突出但特定问题的政治和政策偏好(麦考尔2013;卢斯和刘易斯2013)。
然后,字典方法包括在文档语料库中搜索由研究人员预先确定的单词或短语列表,提供了一种快速和相对简单的方法来编码大量数据。然而,字典方法可以复杂得多,这需要一个精心策划的列表来描述感兴趣的类别。标准词典如语言调查和Word tousczik和彭尼贝克2010)已被证明是可靠的,但仅在有限的领域。创建专门的字典的好处是是特定于领域的,但目前仍然不清楚字典是否可以可靠地用于编码复杂的文本和未确定的概念,这是我们分析的重点。

分析策略。我们使用统计软件包R中可用的文本挖掘工具,从两个关于不平等的综合词典的组合列表中搜索具有关键词的文章(Enns et al. 2015;Levay 2013)。因为这些列表是由术语不等式及其同义词(即除法、间隙)的变化组成的,我们将这种方法的结果与手工编码的显式不等式类别的结果进行比较。在随后的分析中,我们还试图将在线补充附录a中自己的手工编码指令尽可能地翻译成术语列表和搜索指令列表,以识别不平等的明确和隐性提及。在线补充附录B和C分别提供了这些列表和说明。如果词典教学中的任何术语或短语出现在一篇文章中,则该文章将被列入不等式范畴;否则,该文章将被归入不相关的类别。这与我们的手工编码过程是一致的,即只需提到相关内容就足以将一篇文章归入不等式类别,这是对字典方法的一个宽松的测试。

图6。手工编码文章的字典分析趋势(比较Levay-Enns与手工编码的显式不平等趋势;比较McCall与手工编码的显式/隐式不平等趋势)。

结果我们对使用这两个字典手工编码的文章的分析结果显示在表1的第二个面板和图6中。我们发现,Enns等人(2015)和Levay(2013)提供的精心构建的术语列表——在我们的分析中结合起来——在识别手工编码为包含不平等明确覆盖的文章方面非常成功。该方法的精度为0.91(见表1第二面板第1列),不太可能将非不平等文章错误识别为不平等文章;也就是说,它很少导致假阳性。然而,通常如此,精度是有代价的:召回得分只有。25(见表1的第2栏),许多文章手工编码为明确被忽视,更不用说文章编码为隐式覆盖不平等(我们排除为这些测试的不平等类别)。这种严重程度的低估在图6中很明显,它比较了手工编码和字典方法所显示的时间趋势。11相反,旨在反映我们自己手工编码过程的复杂性的指令,包括隐式和显式提到不平等,错误的方向相反:由于高召回率(.84)和低精度(.48),不平等的覆盖范围被过度识别,如图6所示。
我们从这个练习中得出两个结论。首先,字典列表可以准确地识别最明确的覆盖实例,并且,让我们有些惊讶的是,甚至近似覆盖的时间趋势(当我们使用两年移动平均线时,与手工编码的文章趋势的相关性为。42,如表1第10列所示),但它们可能会错过对一个主题更细致的描述,从而显著低估了总体发生。如果发生的绝对频率很重要,那么这是一个严重的缺点。12第二,一组更复杂的指令可以有效地获得更大比例的相关文章,甚至更好地接近时间趋势(r¼.66),但它们会在此过程中错误地将大量不相关的文章归类为相关文章。虽然可以微调字典指令到达一个中间两个极端之间的两个版本的字典方法,13我们再次强调研究人员从零开始不知道,我们会,当他们已经到达这个好中。

UML方法
简要说明。最后,有希望完全自动化的方法——包括UML工具——能够归纳地识别文本中的类别和主题,从而在前端取代人类的输入,至少在前端(比尔曼和斯托维尔2000;卡利1994;弗兰佐西2004;格里默和斯图尔特2011;李和马丁2015;莫尔1998年)。不像字典和SML方法那样将文本分类为预先确定的类别,完全自动化的文本分析技术同时生成类别并将文本分类为这些类别。理论上,这些技术将归纳地将文本分为客观的“最佳”类别。在实践中,有多种方法可以对文本进行分类,没有明确的指标来确定哪种分类比其他分类更好(Blei 2012;格里默和斯图尔特2011)。当一种完全自动化的方法提供了多种分组方法时,研究人员可能会定性地考虑所涵盖的主题以及统计拟合。这些算法的复杂性,“黑盒”本质的实现和解释,有时神秘的输出,意味着社会科学研究人员,特别是社会学家适应语言和概念的复杂性,不愿完全接受他们的使用(例如,李和马丁2015)。
因为我们的手工编码数据中的编码方案也部分是归纳性的,这在定性分析中很常见,而且类别足够详细,可以表示有界但复杂的主题,所以我们有机会将计算归纳技术与手工编码技术进行比较。因此,我们的发现建立在关于用(据称)更快、更可复制的UML方法替代传统内容分析的潜力的争论之上(例如,Bail 2014;DiMaggio等人,2013;Lee和Martin,2015)。

分析策略。我们使用了三种全自动的方法,试图识别这些数据中的不平等主题。14前两个分别来自概率主题建模家族(Blei 2012)。在文档中使用单词的共存,概率主题模型使用重复抽样的方法来同时估计主题,并为每个文档分配主题权重。换句话说,主题模型假设每个文档都由多个流行程度不同的主题组成,而不是将每个文档分配给一个主题或类别。我们使用两种不同的算法估计了两个主题模型,潜在狄利克雷分配(LDA),最基本的主题模型(Blei 2012)和结构主题模型(STM;Roberts等人2013),一种主题建模算法,提供了一种将文档级协变量纳入模型的方法。因为用于讨论不平等的语言随着时间的推移而变化,所以我们将文档年作为STM中的协变量。与许多全自动方法,研究者必须选择主题的数量估计的算法,我们范围主题的数量从5到100在不同间隔的算法,看着最高加权词每个主题来确定主题的内容。
如前所述,主题模型不像手工编码员那样将文章分配给主题;相反,每个文档都是对所有主题的加权分布。为了将这些结果与使用手工编码方法获得的结果进行比较,我们将一篇文章如果在手工编码为不相关的文章的第95百分位或以上,则将其归类为不等式。这是为了避免将文章归类为关于不平等的文章,如果它们只是包含了与不平等主题相关的单词(在日常语言中很常见)。此外,由于我们的主题很少使用这些方法,我们根据仅手工编码为显式(而不是隐式)的文章类别来衡量它们的性能,很像对第一个字典方法的评估。
虽然主题建模在社会科学中越来越受欢迎,但因其预测性能差和缺乏再现性而在社会科学之外受到批评(例如,兰奇切内蒂等人2015年)。更简单的聚类技术通常比更复杂的层次和主题建模技术一样好,有时更好(施密特2012;施坦巴赫,卡里皮斯和库马尔2000)。因此,我们的第三种全自动技术是相对简单的k-means聚类算法,一个已建立的和

图7。手工编码文章的无监督机器学习分析的趋势(显式不等式类别显示)。

使用欧几里得距离度量将文章聚类为互斥组的通用算法(Jain 2010;Lloyd 1982)。与主题建模一样,集群的数量由研究人员使用视觉方法、数学方法(例如,贝叶斯信息准则[BIC]),或通过检查集群的相干性来确定(Rousseeuw 1987)。我们将集群的数量从2个调整到70个,通过查看每个集群中最常见的单词来确定集群的内容。

结果表1的第三个面板提供了指标,STM结果的时间趋势如图7所示(k-means分析的相关文章太少,无法构建时间趋势,LDA结果与STM结果相似)。在检查更复杂的方法之前,我们先进行k-means分析。使用剪影方法(Rousseeuw 1987)结合BIC(Pelleg和Moore 2000),18个聚类模型产生了最独特的聚类,但这些聚类都没有明显是关于不平等的。从20个集群模型开始,有一个集群似乎以不平等为中心,而在60个集群模型中有两个这样的集群。然而,在这些聚类解决方案中,“不平等”一词从未作为一个频繁的词出现过(尽管SML方法能够区分不平等的内容

a)使用波特符号算法进行单词。
b)有30个集群的k-means模型;最常见的单词。
c)具有60个主题的结构性主题模型;加权值最高的词。

从文章语料库中的其他内容)。轮廓方法孤立的30集群模型-custer最独特的集群在第二组模型(20-70集群,BIC稳步下降后20集群),它包括一个集群似乎符合我们的主题(见表2的第一列最常见的单词在这个不平等的集群)。然而,在这个聚类中只有42篇文章,这些方法似乎极大地低估了我们数据中关于不平等的文章数量。
从这个k-均值分析的结果中得出了两个重要的结论。首先,并不能保证由k-means算法产生的聚类会与研究者感兴趣的主题保持一致。此外,从实质性的角度来看,数学上的“最佳”聚类解决方案并不一定是最佳的解决方案,就像我们的数据一样(即,在识别包含具有不等式主题的聚类的模型时,数学方法并不比视觉检查更好)。其次,这些结果证实了这样一种直觉,即在我们的数据中,关于不平等的讨论交织在那些主要关注于独立主题的文章中;也就是说,不平等作为一个主导话题是相对少见的。因此,K-means更适合于分析重点关注主题的文章,如推文或新闻稿,并且在挑选可能隐藏在不同主题讨论中的主题方面表现不佳。
我们使用的另一种完全自动化的方法,即主题建模,旨在通过在许多文章中选取更多的小主题来解决这个缺点。当STM的主题数量被设置为5、10和20时,在未能找到计算机生成的关于不平等主题的主题后,在30个主题模型的输出中确实出现了一个主题。然而,20个主题和60个主题模型产生了最连贯的主题,这是通过所有文档中最高加权主题的分布来衡量的,这是一个表明独特主题的数学解决方案。因为60个主题模型也产生了一个不等式主题,我们分析了这个模型的结果(参见表2的第二列,从60个主题模型中与这个主题相关的顶部加权词的列表)。15
一般来说,其结果与第一种字典方法的结果相同,该方法的精度高,但查全率低。虽然使用k-means方法的召回率极低(.14),但STM方法的结果更加平衡,其中比第一种字典方法确定更大比例的显式文章(比较STM的召回率得分为0.45,第一种字典方法的召回率得分为0.25,如表1第2列所示);因此,F1得分也更高(0.53对.40,如第3列所示)。两年期移动平均线的相关性也有所改善(相比之下,STM为0.58,而第一种字典方法为0.42)。考虑到我们的手工编码方法不是“局部的”,因为我们正在寻找关于任何主题(广泛的经济问题)的文章中不平等的报道,主题建模算法与手工编码的文章的实际一致性可能令人印象深刻。另一方面,与第一种字典方法一样,全自动方法低估了“真实”不等式文章的数量。如果我们在原始分析中只使用这些方法,就像我们怀疑许多内容分析人员正在做的那样,我们将会错过几乎所有关于不平等的隐含讨论,以及许多显性的讨论(从0.45的低召回率所示)。
考虑到一些聚类或主题建模解决方案没有选择一个不平等的主题,并且考虑到STM的低召回率,我们建议该方法最好作为一种归纳的、探索性的方法,不应该用于识别文本中的已知类别。这可以通过两种方式来实现。它可以作为归纳研究项目的第一步、探索性步骤,目的是揭示数据中的主题或模式(例如,Nelson 2017)。或者,最好在定义了类别和对文章进行分类之后部署这种方法,以便探索感兴趣的主要类别中的突发主题。例如,一旦文章提到不平等已经选择与一定程度的信心(例如,使用传统的手工编码指标的可靠性,字典方法,监督学习方法,或一些组合),可以使用UML方法识别的框架和主题,如税收的讨论,移民、教育,等等。由于主题建模假设每个文档都是由多个主题构建的,因此这可能是这样做的适当方法。
简而言之,使用UML方法作为探索性的第一步,或者,在文章被主题分类之后,可能会向研究人员提出他们以前没有考虑过的新模式,并可能将研究项目转向新的和潜在的富有成果的方向。相比之下,利用这些方法将材料分类为预先确定的类别可能会导致研究人员误入歧途,而“零”的发现可能具有欺骗性,这取决于感兴趣的主题有多普遍以及它们是如何分布的。

讨论和结论
我们的主要结论是,这些新的计算机辅助方法可以有效地补充传统的人类方法来编码社会学专业领域(和相关学科)的复杂和多面概念,但关于它们是否能完全取代传统方法的证据是混合的。SML方法成功地近似,因此可能部分替代手工编码,而其他方法最好与手工编码(或SML)一起实现,或者,在主题建模和集群的情况下,作为一个初始的探索步骤(Nelson 2017)。也就是说,我们发现在内容分析工作流中,没有一种方法能取代人类研究者。无论技术如何,研究人员都是基于他们对该领域的深刻的实质性知识来做出决定的。
然后,在本节中,我们将强调在评估我们的手工编码数据时所采用的各种方法的优缺点,重点关注从每种方法所产生的结果中可能得出的实质性结论。这次讨论的更大目标是要摆脱困境
为更广泛的基于实质的文本分析项目的每种方法的优缺点。综上所述,我们的研究结果证实了这些方法对于内容分析项目工作流中的特定角色的有效性。
从最广泛使用的自动化方法开始,字典方法成功地确定了我们数据中关于不平等的最明确讨论的一个子集,正如字典识别的文章《不平等道奇》所证明的那样;富裕的美国,贫穷的美国;从美国到富人;以及不平等的泥潭。然而,这种方法错过了更微妙但(对知识渊博的编码者来说)更明显的不平等讨论。具体来说,这种方法未能检测到20世纪90年代早期手工编码为关于不平等的值得注意的文章(见图6)。当时的媒体报道主要处理劳动力市场上不断上升的工资和收入不平等的问题,而不是华尔街或前1%的人,如弥补昂贵的技能差距和鲍勃·赖希的就业市场。这些文章讨论了男性工资的生产力的同时增长和停滞,大学和非大学教育的工人之间的工资差距,以及过高的高管工资。随着经济的转型似乎威胁到许多工人的财务安全,对劳动力市场公平性的担忧至关重要。
另外,SML算法证实了在20世纪90年代早期,由手编码器确定的不平等覆盖范围的上升(见图4)。最能区分不平等与非不平等类别的特征(词)包括“阶级”、“中产阶级”、“工资”和“工资”——表明20世纪90年代不平等讨论的词汇。然而,它们也包括了一些人们不会立即与不平等联系在一起的词,比如“利益”或“家庭”,这表明SML方法不仅仅是一种美化的字典方法。其中一篇文章特别强调了字典方法和SML方法之间的区别。1994年发表的一篇题为《控制富人》的文章被SML程序正确地认定为提及不平等,但字典方法却没有这样指出。这个故事从来没有使用过像收入差距或收入不平等这样的词。相反,讨论的焦点是关于社会保障如何补贴富人的生活方式:
最昂贵的福利负担不是穷人,而是富人……一个富有的退休人员知道,他由雇主缴纳12.4%的工资税,比如他21岁的孙女阿曼达·法戈,她在美容院当接待员,每小时挣5美元。萨维奇赞成纳税人对穷人的补贴,但补充说:“这是不合理的……从这些孩子那里拿走钱给富裕的人。”

虽然这篇文章反映了不平等的一个众所周知的方面,但它没有包含由以前的研究者精心开发的字典中的任何单词或短语。
因此,我们的研究表明,字典方法将难以识别更广泛的概念,但当特定的短语(例如,“1%”)或准确性和流行率不高时,它可以发挥作用。例如,跟踪使用术语“不平等”可以有用的揭示变化的基本概念的变化,特别是如果它可以表明不平等术语本身的部署有实质性意义的后果(例如,理解公共话语如何反映或塑造公众对不平等的看法和观点)。相比之下,我们表明,如果目的是识别具有无数含义和定义的复杂概念或主题,特别是在选择表示它们的术语可能发生变化的长时间内,字典就不是一种合适的方法。
另一方面,SML算法能够很好地识别这些更复杂的概念,即使与这些概念相关的特定内容会随着时间的推移而变化;因此,我们能够使用SML算法几乎完全复制我们的手工编码方案。这种方法在识别不平等讨论中的重大变化方面的成功,使我们有信心,它可以用于社会学家感兴趣的大多数概念或主题,只要它们是合理的边界(回顾一下SML方法在区分隐性不平等和一般经济学文章时所遇到的困难)。然而,这种方法确实需要在项目的前端进行更多的投资来正确地手工编写大量的文章。考虑到这一点,如果目标是编码大量文本并捕获对复杂概念的微妙讨论,那么SML方法就可以取代手工编码方法。
最后,结构主题建模也可以很好地识别显著的单词簇,与SML算法一样,它正确地识别了上述关于社会保障税和转移系统中不平等的文章。同样地,它也正确地捕捉到了上世纪90年代初关于不平等的讨论的兴起。但是,正如使用这种方法的结果所说明的,UML方法不一定会在每个模型中识别研究者感兴趣的特定概念或主题。而且,如果是这样,所涉及的定性决策点,如选择主题的数量和单词的类型,应该让演绎研究人员暂停。此外,到
将一篇文章标记为提到了不等式,我们使用了一个由以前手工编码为不相关的文章之间的主题概率分布所决定的截止点。如果我们从头开始做一个内容分析项目(即,没有任何事先的手工编码),我们将无法执行这种基准测试,为研究人员创建另一个选择点。更有可能的是,使用这种方法的研究人员会检查由一个主题构成的单词的比例——例如,随着时间的推移绘制这个比例——而不是将整个文档标记到类别中(并像我们那样绘制随着时间的推移属于这些类别的文章的比例)。
如果目标不是将文档分类为已知的类别,而是归纳地探索文本数据和从中产生的主题,或者探索主题如何在文本中同时出现,那么主题模型是一个很好的解决方案。特别是,一旦相关内容已经确定使用其他更可靠的方法,如SML,全自动方法可以用来检查内容更详细和更探索性和归纳的方式(例如,在我们的例子中,我们将调查不平等是如何覆盖或框架相关文章或主题不平等是最常见的)。我们的研究结果表明,主题模型能够识别文本数据中理论感兴趣的模式,这表明它们可以用于补充其他形式的分析。如果以探索性的方式使用,主题模型可以提出新的和潜在的富有成效的模式,可能有效地改变研究议程,或可能帮助研究人员对他们关注主题的数据形成可验证的假设。
最后,我们希望强调,即使我们的结论关于重要的方法我们讨论的互补性是基于当前的艺术,我们相信它将继续应用在可预见的未来新的计算机辅助文本分析方法和技术正在开发。例如,一方面,单词嵌入方面的新工作,包括比以前的方法更有效地使用单词的上下文,可以进一步提高NLP算法的性能(Goth 2016;Mikolov et al. 2013)。因此,社会学家将受益于持续的参与这些文献,以提高他们自己的计算机辅助技术的应用。然而,另一方面,作为一门学科,我们应该仔细考虑这些新方法如何对应我们学术事业核心的研究问题和数据类型,包括那些特权人文解释的方法。比较和对比自动方法与细致的手工编码方法提供了一个经验基础,一直缺乏争论之间的关系
我们的方法学传统和新的计算机辅助技术,我们希望推进这些辩论,以更好地理解社会科学研究中文本分析的未来。

作者的笔记
可以在https:// github.com/lknelson/future-of-coding上找到

确认信息
我们非常感谢罗素·塞奇基金会的资助,也感谢他的审稿人以及约翰·列维·马丁、詹姆斯·埃文斯和彼得·恩斯对早期草稿提出的非常有益的评论。我们也感谢巴特·博尼科夫斯基向我们之间的手工编码器介绍了自动化方法。

利益冲突的声明
作者(s)声明与本文的研究、作者身份和/或发表有关,没有潜在的利益冲突。

资金
作者(s)披露了对本文的研究、作者和/或发表的以下财政支持:这篇文章已由罗素塞奇基金会资助。

补充材料
本文的补充材料可以在网上找到。

记下
1.如何准确地让内容分析“科学”,如果这是可能的话,当然是有争议的(例如,比尔纳基2012;里德2015;斯皮尔曼2015)。
2.过去的研究使用半自动化的方法来量化文本的结构叙述(比尔曼和斯托维尔2000;弗兰佐西等人2012),聚类和块建模方法来测量嵌入文本中的潜在文化结构(马丁2000;米什和帕蒂森2000;莫尔和杜奎内1997),以及地图和网络分析来衡量文本中概念之间的关系(Carley 1994)。
3.参见Gilens(1999)、Dyck和Hussey(2008)和Kellstedt(2000),了解将搜索中的所有文章保留为相关的方法,然后只编码图片或使用计算机方法来识别帧。
4.社会学家以各种复杂程度不同的方式来“编码”文本,包括将文本的全部或部分分类为不同的类别,识别不同的类别
文本中的主题或框架,以及识别修辞技巧,如说服、讽刺、或歧义,等等。我们将手工编码的数据视为一种复杂文本分类的形式,复杂到足以给这些自动化方法带来挑战。进一步的研究可以以我们在这里做的方式调查不同类型的编码任务。
5.无关的文章是关于以下主题:种族或性别不平等,同性恋权利,不平等在其他国家,个人的名字是一个主题术语的一部分(例如,马克“丰富”),流行文化项目包括一个主题术语的一部分(例如,一部电影名为“大企业”),显然个人事务关于一个人,非经济精英(例如,在艺术或宗教),和社会阶层作为非经济现象的预测因素(如健康、吸毒)。
6.在线补充附录A描述了显式和隐式提到不平等之间的区别(特别见面板4)。
7.具体来说,我们取不同类别的加权平均:加权平均精度¼平均精度分数乘以每个类别的真正数行的比例;总召回率¼平均值乘以每个类别的真正行数的比例;加权平均总F1¼(2加权平均精度weighted_average_recall)/(weighted_average_precisionþ加权平均召回率)。
8.虽然表格报告了测试集的度量,但图表提供了整个文章样本的趋势,包括测试集和训练集,因为研究者感兴趣的是整个样本(通过推理,总体)的实质性结果。
9.我们还对ReadMe程序进行了广泛的测试,该程序可作为R的软件包或独立程序获得(Hopkins et al. 2013)。我们在在线补充附录d中包含了关于该分析的信息和结果。然而,由于ReadMe直接估计了属于每个类别的文档的比例,而不是单独对文档进行分类,因此不可能创建精度、召回率和F1统计数据。
10.占的事实关键词可能偶然发生在文章无关不平等,我们也考虑了基于阈值的分类方法,例如,一篇文章将被放置在不平等类别只有当关键词的发生率超过95百分位的关键字发生率的文章手工编码为无关紧要的。然而,由于在文献中没有确定的程序来设置这样的阈值,我们选择了提供更简单的“一次出现”字典编码方案的结果。
11.从关键词字典构建时间趋势的另一种方法是绘制关键词的发生率占每年总单词的比例,如
而不是绘制至少包含一个关键词的文章的比例。我们对这种替代方法进行了测试,但发现关键词发病率的趋势每年都容易出现剧烈波动,并不密切遵循手工编码构建的趋势。关键词发病率的趋势和手工编码的文章的比例为0.46,而包含至少一个关键词的文章的比例和手工编码的趋势为0.59(见表1的第二组第11列)。
12.另一方面,如果显式和隐式覆盖相关,那么推断总体覆盖率和覆盖趋势可能不会过度偏见(尽管比较这些趋势在图3显示,显式文章的趋势不同于趋势结合显式和隐式的文章)。
13.例如,为了提高双词、修饰词-名词关键词方法的回忆能力,我们可以扩展关键词的列表,以捕捉更多讨论不等式的方式。另一方面,为了提高我们更复杂的方案的精度,我们可以要求两个关键字出现在同一个句子或同一个段落中,而不是出现在文章中的任何地方。
14.对于下面的主题模型和k-means模型,我们执行了常见的预处理步骤:我们将所有字母转换为小写,删除标点符号,并使用Porter符号算法简化单词。
15.我们还运行了一个60个主题的LDA模型,结果与结构主题模型相似。使用LDA模型,我们确定了150篇关于不平等内容的文章,而我们确定了190篇关于STM的文章。两者的F1评分相似(LDA为0.52,STM为0.53),STM的召回率更高(LDA模型为0.45比0.41),STM的召回率更低(LDA模型为0.63,LDA模型为0.71)。考虑到类似的F1分数和STM标记了更多文章的事实,我们只关注来自STM分析的结果。

参考文献
安徒生,佩吉M.,菲利普J.海耶斯,艾莉森K.韦特纳,琳达M.施曼特,艾琳B。尼伦伯格和史蒂文·韦恩斯坦。1992.“从新闻稿中自动提取事实,以生成新闻故事。”Pp.170-77,在第三次应用自然语言处理会议论文集上。宾夕法尼亚州斯特鲁斯堡:计算语言学协会。
保释金,克里斯托弗·A。2014.“文化环境:用大数据来衡量文化。”《理论与社会》43(3-4):465-82。
班曼,大卫和诺亚·史密斯。2015.“公开提取细粒度的政治声明。”Pp.关于经验的会议的会议记录
自然语言处理中的方法。葡萄牙里斯本:计算语言学协会。
比尔曼,彼得·s·斯托维尔和凯瑟琳·斯托维尔。2000.“成为一个纳粹分子:一个叙事网络的典范。”诗学27(2-3):69-90。
伯努瓦、肯尼斯、迈克尔·拉弗尔和斯拉瓦·米哈伊洛夫。2009.“错误地将词语视为数据:政策立场文本陈述中的不确定性。美国政治科学杂志第53(2):495-513。
比尔纳奇,理查德。2012.在社会调查中重新创造证据:解码事实和变量。纽约:帕尔格雷夫麦克米伦。
布莱,大卫·m.,2012年。“概率性的主题模型。”ACM的通信
55(4): 77-84.
博尼科夫斯基,巴特和诺姆·吉德龙。2016.《美国政治中的民粹主义风格:1952-1996。社会力量:94(4):1593-621年。
伯舍尔、比约恩、弗利根塔特案和克莱斯·德弗里斯案。2015.“使用有监督的机器学习来编码策略问题:分类器可以跨上下文进行概括吗?”美国政治和社会科学院年鉴
659(1): 122-31.
卡利,凯瑟琳。1994.“通过文本分析来提取文化。”《诗学》
22(4): 291-312.
卡鲁阿纳,里奇和亚历山大,尼古拉斯库-米兹尔。2006.“有监督学习算法的实证比较。”Pp.161-68,第23届机器学习国际会议论文集。纽约:ACM。
张冲,丹尼斯和詹姆斯·德鲁克曼。2009.“识别政治新闻中的框架。”Pp.238-87,参见《政治传播研究资料手册:方法、措施和分析技术》,由E. P.布西和R. L.霍尔伯特编辑。纽约:劳特利奇。
考伊,吉姆和温迪·莱纳特。1996.“信息提取。”ACM 39(1)的通信:80-91。
迪马吉奥,保罗,马尼什·纳格,和大卫·布莱。2013.“利用主题建模和文化的社会学视角之间的亲缘关系:在美国政府艺术资助的报纸报道中的应用。”诗学作品41(6):570-606。
戴克,约书亚和劳拉·赫西。2008.“我们所知道的福利的终结吗?”在不断变化的信息环境中保持持久的态度。”舆论季季度
72(4): 589-618.
恩斯,彼得,内森·凯利,珍娜·摩根和克里斯托弗·维特科。2015.“金钱和政治言论的供应:理解国会(非)对经济不平等的反应。”在旧金山APSA年会上发表的论文。
埃文斯,约翰H.,2002年。洛城疑云人类基因工程与公共生物伦理辩论的合理化。芝加哥大学出版社:芝加哥大学出版社。
费里,迈拉·马克思,威廉安东尼甘森,尤尔根格哈兹和迪特·鲁赫特。2002.塑造堕胎话语:德国和美国的民主和公共领域。纽约:剑桥大学出版社。
弗朗佐西,罗伯托。2004.从文字到数字:叙事、数据和社会科学。剑桥,英国:剑桥大学出版社。
弗朗佐西、罗伯托、吉安卢卡·德法齐奥和斯蒂芬尼亚·维卡里。2012.《测量机构的方法:定量叙述分析在乔治亚州私刑中的应用》(1875-1930年)。社会学方法学42(1):1-42。
吉伦斯,马丁。1999.为什么美国人讨厌福利:种族、媒体和反贫困政策的政治。芝加哥大学出版社:芝加哥大学出版社。
哥特人,格雷戈里。2016.“深或浅,NLP正在爆发。”ACM 59(3)的通信:13-16。
格里姆默,贾斯汀。2010.“政治文本的贝叶斯层次化主题模型:衡量参议院新闻稿中所表达的议程。”政治分析18(1):1-35。
格里默,贾斯汀和B. M.斯图尔特。2011.《文本作为数据:政治文本自动内容分析方法的前景与陷阱》政治分析21(3):267-97。
格里斯沃尔德,温迪。1987a.“意义的制造:在美国、英国和西印度群岛的文学诠释。”美国社会学杂志
92(5): 1077-117.
格里斯沃尔德,温迪。1987b.“一个文化社会学的方法论框架。”社会学方法学17:1-35。
汉娜,亚历克斯。2013.对数字功能运动的计算机辅助内容分析。动员:国际季度18(4):367-88。
霍普金斯,丹尼尔和加里·金。2010.“一种面向社会科学的自动化非参数内容分析方法。《美国政治科学杂志》54(1):229-47。
霍普金斯、丹尼尔、加里·金、马修·诺尔斯和史蒂文·梅伦德斯。2013.自动内容分析软件。版本0.99836。2017年10月4日访问:(http://gking.harvard.edu/readme)。
耆因,Anil K. 2010。“数据聚类:超过k均值50年。”模式识别字母31(8):651-66。
. 2014. 尤尔卡,蒂莫西P.,罗兰科林伍德,安布尔E.博伊德斯顿,埃米利亚诺格罗斯曼,和沃特范艺术工具:通过监督学习的自动文本分类。R软件包版本为1.4.2。2017年10月4日访问:(https://cran.rproject.org/web/packages/RTextTools/index.html).
凯尔斯泰特,保罗·M.,2000年。“媒体框架和种族政策偏好的动态。”美国政治科学杂志第44(2):239-55。
金,加里,詹妮弗·潘和玛格丽特·罗伯茨。2013.“中国的审查制度如何允许政府批评,却沉默集体表达。”《美国政治科学评论》第107(2):1-18页。
克里彭多夫,克劳斯。1970.“数据可靠性的双变量协议系数。社会学方法论2:139-50。
兰开希内蒂,安德里亚,伊尔马克西勒,王简¨,丹尼尔阿库纳,康拉德柯丁,和卢‘ıs努内斯阿马拉尔。2015.“自动主题分类的高重复性和高精度方法。物理检查X 5(1):011007。
朗,肯。1995.“《新闻周刊》:学习过滤网络新闻。”Pp.第331-39页,在第12届国际机器学习会议的会议记录。摩根考夫曼出版公司。
李,莫妮卡和约翰·列维·马丁。2015.“编码、文化和文化制图学。”美国文化社会学杂志》3:1-33。
莱文,凯文。2013.“恶性亲属关系:媒体和美国人对经济和种族不平等的看法。”未发表的论文,西北大学政治科学系,埃文斯顿,IL。
劳埃德,斯图尔特出版社,1982年。“PCM中的最小二乘量化。”IEEE《信息理论学报》28(2): 129-37. doi:10.1109/TIT.1982.1056489。
拉夫兰,蒂姆和比尔·麦克唐纳。2011.“什么时候责任才不是责任?”文本分析,字典,和10-ks。”《金融杂志》66(1):35-65。
曼宁、克里斯托弗、米海伊·苏尔迪亚努、约翰·鲍尔、珍妮·芬克尔、史蒂文·贝萨德和大卫·麦克洛斯基。2014.“斯坦福CoreNLP自然语言处理工具包。”Pp.第55-60页,发表在《计算语言学协会第52届年会的论文集:系统演示》中。巴尔的摩,医学博士:计算语言学协会。
马丁,约翰·利瓦伊。2000.“动物们整天都在做什么呢?”阶级主体的分工和大众想象中的图腾思维。”诗学27(2-3):195-231。
麦考尔,莱斯利。2013.不配的富人:美国人关于不平等、机会和再分配的信仰。纽约:剑桥大学出版社。
米科洛夫,托马斯,陈凯,格雷格科拉多,和杰弗里迪恩。2013.“对向量空间中的Word表示的有效估计。”在学习代表国际会议讲习班会议记录中。(https://research.google.com/pubs/pub41224.html)
送奶工,露丝,斯蒂芬妮·卢斯和佩妮·刘易斯。2013.改变主题:对纽约市华尔街占领运动的自底向上的描述。纽约:纽约城市大学的墨菲研究所。
米什,安和菲利帕·帕蒂森。2000.“组成一个公民舞台:公共、项目和社会环境。”诗学作品27(2):163-94。
莫尔,约翰,1998。“测量有意义的结构。”《社会学年鉴》24(1):345-70。
莫尔,约翰W.,罗宾瓦格纳-帕西菲西,罗纳德L。布雷格尔和佩特科·波格达诺夫。2013.“绘制国家安全战略中的动机语法:文化解释、自动文本分析和全球政治的戏剧性。”诗学作品41(6):670-700。
莫尔,约翰W。和文森特·杜奎恩尼。1997.《文化与实践的双重性:1888-1917年纽约市的贫困救济》。《理论与社会》26(2/3):305-56。
纳杜利,彼得F.,斯科特奥尔索斯和马修海耶斯。2015.“一种渐进的监督学习方法来生成丰富的民事冲突数据。”社会学方法论45(1):145-83。
纳尔逊,劳拉K. 2017。“计算基础理论:一个方法论框架”。”社会学的方法和研究。已于2018年4月2日检索(https://doi。org/10.1177/0049124117729703).
佩德雷戈萨,F.,G。Varocux,A。格兰福特,米歇尔,蒂里翁,格里塞尔和公爵夫人。2011.Scikit-学习:Python中的机器学习。机器学习研究杂志12:2825-30。
佩勒格,丹和安德鲁·摩尔。2000.x均值:扩展k均值,有效地估计簇数。Pp.第727-34号决议,发表在第十七届机器学习国际会议的会议记录中。旧金山,加州:摩根考夫曼出版社。
R的核心团队。2014.R:一种用于统计计算的语言和环境。R统计计算基金会,维也纳,奥地利。2017年10月4日访问:(http://www.R-project.org/).
里德,艾萨克·阿里亚尔。2015.“计数、解释及其在人类科学中的潜在相互关系。”美国文化社会学杂志,3(3):353-64。“路透社-21578测试收集。”已于2017年3月9日检索。(http://www.david dlewis.com/resources/testcollections/reuters21578/).
罗伯茨、玛格丽特、布兰登·斯图尔特、达斯汀·廷利和爱德华尔多·m·艾罗尔迪。2013.《结构性主题模型与应用社会科学》。Pp.1-4在神经信息处理系统的进展,主题模型研讨会:计算、应用和评估。https://scholar.princeton.edu/bstewart/publica tions/structural-topic-model-and-applied-social-science
鲁西乌,彼得·J。1987.剪影:对聚类分析的解释和验证的图形辅助。计算和应用数学20: 53-65。
施密特,本杰明·M. 2012。“词汇本身:分解人文学科中的主题模式。”数字人文学科杂志2 (1)。检索于2018年4月2日(http://期刊ofdigitalhumanities.org/2-1/words-alone-by-benjamin-m-schmidt/).
Spillman, Lyn.2015.“稻草人的幽灵:对李和马丁的回答。”美国文化社会学杂志,3(3):365-79。
斯坦巴赫,迈克尔,乔治·卡里皮斯和维平·库马尔。2000.“文档聚类技术的比较。在KDD的文本挖掘研讨会上。明尼阿波利斯大学:明尼苏达大学。400(1): 525-26
陶斯齐克,伊拉R.和詹姆斯W.彭尼贝克。2010.词汇的心理意义: LIWC与计算机化文本分析方法。《语言与社会心理学杂志》29(1):24-54。
范·里伊斯伯根,C. J.。1979.信息检索英国伦敦:巴特沃斯,海涅曼.

作者传记
劳拉·k·纳尔逊是东北大学的社会学助理教授,她也是NULab大学文本、地图和网络方面的核心教员,也是女性、性别和性研究项目执行委员会的成员。她使用计算方法和开源工具来研究文化、社会运动、组织和性别。
德里克·伯克是明尼苏达大学社会研究和数据创新研究所的IPUMS-国际项目的高级数据分析师。他专门为调查和人口普查数据开发数据处理和分析管道。
马塞尔·克努森是西北大学社会学系的博士候选人。他的研究重点是工作场所和不平等,他的论文考察了城市最低工资的提高及其与组织文化和等级制度的相互作用。
莱斯利·麦考尔是纽约市大学研究生中心社会学和政治学的总统教授,斯通社会经济不平等中心的副主任。她的研究重点是公众对不平等、机会以及相关经济和政策问题的看法;实际收入和家庭收入不平等的趋势;以及交叉不平等的模式。

  • 11
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值