纽约时报的强势女性

Python中的API数据收集和文本分析(API Data Collection and Text Analysis in Python)

The goal of this project is to investigate women’s representation in The New York Times throughout the past 70 years by means of sentiment analysis, frequent term visualization and topic modeling.

项目目标是通过情感分析,频繁术语可视化和主题建模等方法,调查过去70年中《纽约时报》中女性的代表性。

For this investigation I scraped The New York Times data through the Archive API of The New York Times Developer Portal. First, you have to obtain the API key here. It’s free! The NYT just likes the concept of the regulated flood gate. Since this type of the API is good for the bulk data collection, it doesn’t allow for effective prior filtering. Please follow the instructions in the Jupyter notebooks posted on Github if you wish to re-create the experiment. If you prefer a video version of this post, you can access it here.

对于此调查,我通过《纽约时报》开发人员门户网站的Archive API抓取了《纽约时报》的数据。 首先,您必须在此处获取API密钥。 免费! NYT就像是受管制的防洪闸门的概念。 由于这种类型的API非常适合批量数据收集,因此不允许进行有效的事先过滤。 如果您想重新创建实验,请按照Github上发布的Jupyter笔记本中的说明进行操作。 如果您喜欢这篇文章的视频版本,可以在此处访问。

Image for post
Analysis pipeline. Image by author. Icons by Freepik.
分析管道。 图片由作者提供。 Freepik制作的图标。

All the instructions, code notebooks and results can also be accessed through my project repository on GitHub for smoother replication.

也可以通过GitHub上的项目存储库访问所有说明,代码笔记本和结果,以实现更流畅的复制。

通过Archive API进行数据收集以及使用SpaCy和Gensim进行主题建模 (Data Collection via Archive API and Topic Modeling with SpaCy and Gensim)

Before I proceed any further with my analysis, I decided to run topic modeling on the bulk of the articles from The New York Times between January 2019 and present day, September 2020, to analyze the headlines, keywords and the lead paragraphs. My goal was to distinguish the most prevalent issues and enduring topics in order to make sure that my research goes along the lines of the NYT mission statement, and I’m not misrepresenting their journalism style.

在继续进行分析之前,我决定对2019年1月至2020年9月的《纽约时报》上的大部分文章进行主题建模,以分析标题,关键词和主要段落。 我的目标是区分最普遍的问题和历久不衰的主题,以确保我的研究遵循《纽约时报》的使命宣言,并且我不会歪曲他们的新闻风格。

The data collection blueprint for this part of the analysis was inspired by a very informative tutorial by Brienna Herold.

此部分分析的数据收集蓝图的灵感来自Brienna Herold的非常有用的教程

Let’s import the necessary tools and libraries:

让我们导入必要的工具和库:

import os
import pandas as pd
import requests
import json
import time
import dateutil
import datetime
from dateutil.relativedelta import relativedelta
import glob

Determine the timeframe of the analysis:

确定分析的时间范围:

end = datetime.date.today()
start = datetime.date(2019, 1, 1)
print('Start date: ' + str(start))
print('End date: ' + str(end))

Breaking the data into the monthly groups:

将数据分为每月的组:

months_in_range = [x.split(' ') for x in pd.date_range(start, end, freq='MS').strftime("%Y %-m").tolist()]

The following set of helper functions (see the tutorial) extracts the NYT data through the API and saves it into the specific csv files:

以下辅助函数集(请参阅本教程)通过API提取NYT数据并将其保存到特定的csv文件中:

def send_request(date):
'''Sends a request to the NYT Archive API for given date.'''
base_url = 'https://api.nytimes.com/svc/archive/v1/'
url = base_url + '/' + date[0] + '/' + date[1] + '.json?api-key=' + 'F9FPP1mJjiX8pAEFAxBYBg08vZECa39n'
try:
response = requests.get(url, verify=False).json()
except Exception:
return None
time.sleep(6)
return response


def is_valid(article, date):
'''An article is only worth checking if it is in range, and has a headline.'''
is_in_range = date > start and date < end
has_headline = type(article['headline']) == dict and 'main' in article['headline'].keys()
return is_in_range and has_headline


def parse_response(response):
'''Parses and returns response as pandas data frame.'''
data = {'headline': [],
'date': [],
'doc_type': [],
'material_type': [],
'section': [],
'keywords': [],
'lead_paragraph': []}

articles = response['response']['docs']
for article in articles: # For each article, make sure it falls within our date range
date = dateutil.parser.parse(article['pub_date']).date()
if is_valid(article, date):
data['date'].append(date)
data['headline'].append(article['headline']['main'])
if 'section' in article:
data['section'].append(article['section_name'])
else:
data['section'].append(None)
data['doc_type'].append(article['document_type'])
if 'type_of_material' in article:
data['material_type'].append(article['type_of_material'])
else:
data['material_type'].append(None)
keywords = [keyword['value'] for keyword in article['keywords'] if keyword['name'] == 'subject']
data['keywords'].append(keywords)
if 'lead_paragraph' in article:
data['lead_paragraph'].append(article['lead_paragraph'])
else:
data['lead_paragraph'].append(None)
return pd.DataFrame(data)


def get_data(dates):
'''Sends and parses request/response to/from NYT Archive API for given dates.'''
total = 0
print('Date range: ' + str(dates[0]) + ' to ' + str(dates[-1]))
if not os.path.exists('headlines'):
os.mkdir('headlines')
for date in dates:
print('Working on ' + str(date) + '...')
csv_path = 'headlines/' + date[0] + '-' + date[1] + '.csv'
if not os.path.exists(csv_path): # If we don't already have this month
response = send_request(date)
if response is not None:
df = parse_response(response)
total += len(df)
df.to_csv(csv_path, index=False)
print('Saving ' + csv_path + '...')
print('Number of articles collected: ' + str(total))

Let’s take a closer look at the helper functions:

让我们仔细看看辅助函数:

  • send_request(date) sends a request into the archive for a given date, converts into the json format, returns response.

    send_request(date)将给定日期的请求发送到存档,转换为json格式,返回响应。

  • is_valid(article, date) checks whether an article is within the requested timeframe, confirms the presence of the headline and returns is_in_range and has_headline verdict.

    is_valid(article,date)检查文章是否在请求的时间范围内,确认标题的存在并返回is_in_rangehas_headline判决。

  • parse_response(response) transforms the response into a DataFrame. data is a dictionary that contains columns of our DataFrame, which are empty at first, but will get appended to by this function. The function returns the final DataFrame.

    parse_response(response)将响应转换为DataFrame。 data是一个字典,包含我们DataFrame的列,这些列起初为空,但将被此函数附加到后面。 该函数返回最终的DataFrame

  • get_data(dates), where dates correspond to the range specified by the user, utilizes send_request() and parse_response() functions. Saves headlines and other info to .csv files, one file per month per year within the range.

    get_data(dates) (其中日期对应于用户指定的范围)利用send_request()parse_response()函数。 将标题和其他信息保存到.csv文件中,范围内每年每月一个文件。

Once we get our monthly csv files for each year within the range, we can concatenate them for further use. glob library is an excellent tool for that. make sure your path to the headlines folder matches the path in your code. I used a relative path as opposed to the absolute path for mine.

一旦获得了该范围内每年的每月csv文件,我们就可以将它们串联起来以备将来使用。 glob库是一个很好的工具。 确保您的标题文件夹路径与代码中的路径匹配。 我使用相对路径,而不是我的绝对路径。

# get data file names
path = "headlines/"
filenames = glob.glob("*.csv")

dfs = []
print(filenames)
for filename in filenames:
dfs.append(pd.read_csv(filename))

# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)

big_frame is a DataFrame that consists all our files from the headlines folder concatenated into one frame. This is the expected output:

big_frame是一个DataFrame,由我们的标题文件夹中的所有文件串联而成。 这是预期的输出:

Image for post

135,954 articles and their data were pulled.

提取了135,954条文章及其数据。

Now, we are ready for topic modeling. The purpose of the analysis below is to run topic modeling on headlines, keywords and lead paragraphs of The New York Times articles for the past year and a half. I want to make sure that headlines are consistent with the introductory paragraphs and keywords.

现在,我们准备进行主题建模。 以下分析的目的是在过去一年半的时间里,对《纽约时报》文章的标题,关键字和主要段落进行主题建模。 我想确保标题与介绍性段落和关键字一致。

Importing tools and libraries:

导入工具和库:

from collections import defaultdict  
import re, string #regular expressions
from gensim import corpora # this is the topic modeling library
from gensim.models import LdaModel

Let’s take a closer look:

让我们仔细看看:

  • defaultdict is useful for counting the unique words and their appearances.

    defaultdict对于计数唯一单词及其出现非常有用。

  • re and string are useful when we’re looking for a match in the text, either a full or a fuzzy one. Regular expressions are going to appear often if you’re interested in text analysis; here’s a handy tool to practice those.

    当我们在文本中寻找完整或模糊的匹配项时, restring很有用。 如果您对文本分析感兴趣,则经常会出现正则表达式。 这是一个练习这些的便捷工具

  • gensim is a library we are going to use for topic modeling. It is user-friendly once you get the necessary dependencies sorted out.

    gensim是我们将用于主题建模的库。 一旦您整理出必要的依赖项,它就非常友好。

Since we are looking at three different columns of the DataFrame, three different instances of the corpus will be instantiated: a corpus that holds headlines, a corpus for the keywords and a corpus for the lead paragraphs. This is meant to be a sanity check to make sure headlines and keywords and lead paragraphs are consistent with the article’s content.

由于我们正在查看DataFrame的三个不同列,因此将实例化语料库的三个不同实例:一个包含标题的语料库,一个用于关键字的语料库和一个用于主要段落的语料库。 这旨在进行健全性检查,以确保标题和关键字以及引文段落与文章的内容一致。

big_frame_corpus_headline = big_frame['headline']
big_frame_corpus_keywords = big_frame['keywords']
big_frame_corpus_lead = big_frame['lead_paragraph']

In order to for the text data to be usable, it needs to be pre-processed. In general, it will look like that: lowercasing and punctuation removal, stemming, lemmatization and tokenization, then stop-word removal and vectorization. The first four operations are shown as a cluster, because the order of those operations often depends on data, and in certain cases it might make sense to switch the order of operations.

为了使文本数据可用,需要对其进行预处理。 通常,它看起来像是:删除小写和标点符号,词干,词形化和标记化,然后停用词和向量化。 前四个操作显示为群集,因为这些操作的顺序通常取决于数据,并且在某些情况下,切换操作的顺序可能很有意义。

Image for post
Text pre-processing steps. Image by author. Icons by Freepik
文本预处理步骤。 图片由作者提供。 Freepik的图标

Let’s talk about pre-processing.

让我们谈谈预处理。

from nltk.corpus import stopwordsheadlines = [re.sub(r'[^\w\s]','',str(item)) for item in big_frame_corpus_headline]keywords = [re.sub(r'[^\w\s]','',str(item)) for item in big_frame_corpus_keywords]lead = [re.sub(r'[^\w\s]','',str(item)) for item in big_frame_corpus_lead]stopwords = set(stopwords.words('english')) 
# please note: you can append to this list of pre-defined stopwords if needed

More pre-processing:

更多预处理:

headline_texts = [[word for word in document.lower().split() if word not in stopwords] for document in headlines]keywords_texts = [[word for word in document.lower().split() if word not in stopwords] for document in keywords]lead_texts = [[word for word in document.lower().split() if word not in stopwords] for document in lead]

Removing less frequent words:

删除不常用的单词:

frequency = defaultdict(int)
for headline_text in headline_texts:
for token in headline_text:
frequency[token] += 1
for keywords_text in keywords_texts:
for token in keywords_text:
frequency[token] += 1
for lead_text in lead_texts:
for token in lead_text:
frequency[token] += 1

headline_texts = [[token for token in headline_text if frequency[token] > 1] for headline_text in headline_texts]
keywords_texts = [[token for token in keywords_text if frequency[token] > 1] for keywords_text in keywords_texts]
lead_texts = [[token for token in lead_text if frequency[token] > 1] for lead_text in lead_texts]dictionary_headline = corpora.Dictionary(headline_texts)
dictionary_keywords = corpora.Dictionary(keywords_texts)
dictionary_lead = corpora.Dictionary(lead_texts)headline_corpus = [dictionary.doc2bow(headline_text) for headline_text in headline_texts]
keywords_corpus = [dictionary.doc2bow(keywords_text) for keywords_text in keywords_texts]
lead_corpus = [dictionary.doc2bow(lead_text) for lead_text in lead_texts]

Let’s decide on the optimal number of topics for our case:

让我们为案例确定最佳主题数:

NUM_TOPICS = 5  
ldamodel_headlines = LdaModel(headline_corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=12)
ldamodel_keywords = LdaModel(keywords_corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=12)
ldamodel_lead = LdaModel(lead_corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=12)

Here’s the result:

结果如下:

topics_headlines = ldamodel_headlines.show_topics()
for topic_headlines in topics_headlines:
print(topic_headlines)topics_keywords = ldamodel_keywords.show_topics()
for topic_keywords in topics_keywords:
print(topic_keywords)topics_lead = ldamodel_lead.show_topics()
for topic_lead in topics_lead:
print(topic_lead)

Let’s organize those into dataframes:

让我们将它们组织成数据框:

word_dict_headlines = {};for i in range(NUM_TOPICS):
words_headlines = ldamodel_headlines.show_topic(i, topn = 20)
word_dict_headlines['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words_headlines]
pd.DataFrame(word_dict_headlines)for i in range(NUM_TOPICS):
words_keywords = ldamodel_keywords.show_topic(i, topn = 20)
word_dict_keywords['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words_keywords]
pd.DataFrame(word_dict_keywords)for i in range(NUM_TOPICS):
words_lead = ldamodel_lead.show_topic(i, topn = 20)
word_dict_lead ['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words_lead]
pd.DataFrame(word_dict_lead)

Remember: even though the algorithm can sort the words into the corresponding topics, it’s still up to a human to interpret and label them.

请记住:即使算法可以将单词分类到相应的主题中,也仍然需要人工来解释和标记它们。

Image for post
Topic modeling results. Image by author. icons by Freepik.
主题建模结果。 图片由作者提供。 Freepik制作的图标。

A variety of topics showed up. All of them are very serious and important in our society. In this particular research, we are going to investigate gender representation.

出现了各种各样的话题。 所有这些在我们的社会中都是非常认真和重要的。 在这项特殊的研究中,我们将研究性别代表性。

1950年至今:数​​据收集和关键字分析。 (1950 — Present: Data Collection and Keyword Analysis.)

We will use the previously mentioned helper functions in order to get the data from January 1st, 1950 to present day, which is September 2020. I recommend using smaller increments of time, e.g. a decade in order to prevent the API timing-out.

我们将使用前面提到的辅助函数来获取从1950年1月1日到2020年9月的数据。我建议使用较小的时间增量,例如十年,以防止API超时。

The data will be collected into the headlines.csv and then concatenated into one dataframe using the methods illustrated above. Once you get the dataframe that you worked so hard to get, I suggest pickling it for further use:

数据将被收集到headlines.csv中,然后使用上述方法连接到一个数据帧中。 一旦获得了非常努力的数据框,我建议对其进行腌制以备将来使用:

import pickle
with open('frame_all.pickle', 'wb') as to_write:
pickle.dump(frame, to_write)

Here’s how you extract the pickled files:

这是提取腌制文件的方法:

with open('frame_all.pickle', 'rb') as read_file:
df = pickle.load(read_file)
Image for post
Total articles found vs. relevant articles for the timeframe of 70 years. Image by author. template by Slidesgo.
在70年的时间范围内,发现的文章总数与相关文章的比率。 图片由作者提供。 幻灯片模板。

Let’s convert the date column into the datetime format so that the articles can be sorted chronologically. We will also be removing nulls and duplicates.

让我们将date列转换为datetime格式,以便可以按时间顺序对文章进行排序。 我们还将删除null和重复项。

df['date'] = pd.to_datetime(df['date'])df = df[df['headline'].notna()].drop_duplicates().sort_values(by='date')df.dropna(axis=0, subset=['keywords'], inplace = True)

Examining the relevant keywords:

检查相关关键字

import ast
df.keywords = df.keywords.astype(str).str.lower().transform(ast.literal_eval)keyword_counts = pd.Series(x for l in df['keywords'] for x in l).value_counts(ascending=False)len(keyword_counts)

58,298 unique keywords.

58,298个唯一关键字。

I used my personal judgement to determine which keywords are relevant to the topic of strong women and their representation: politics, social activism, entrepreneurship, science, technology, military achievement, athletic breakthroughs and female leadership. This analysis is not in any way meant to exclude any groups or individuals from the notion of strong women. I am open to additions and suggestions, so please don’t hesitate to reach out if you think there’s something that can be done to make this project more comprehensive. A quick reminder if you find the code in the cells challenging to copy due to formatting issues, please refer to the code and instructions in my project repository.

我根据自己的判断来确定哪些关键字与坚强女性及其代表相关:政治,社会积极性,企业家精神,科学,技术,军事成就,运动成绩和女性领导能力。 这种分析绝不意味着将任何群体或个人排除在坚强女性的观念之外。 我愿意接受补充和建议,因此,如果您认为可以采取一些措施使该项目更加全面,请随时与我们联系。 快速提醒您,如果由于格式问题在单元格中发现难以复制的代码,请参考我的项目存储库中的代码和说明。

project_keywords1 = [x for x in keyword_counts.keys() if 'women in politics' in x 
or 'businesswoman' in x
or 'female executive' in x
or 'female leader' in x
or 'female leadership' in x
or 'successful woman' in x
or 'female entrepreneur' in x
or 'woman entrepreneur' in x
or 'women in tech' in x
or 'female technology' in x
or 'female startup' in x
or 'female founder' in x ]

Above is a sample query for relevant keywords. A more detailed explanation on relevant keyword search and article headline extraction can be found in this notebook.

上面是相关关键字的查询示例。 可以在此笔记本中找到有关关键字搜索和文章标题提取的更详细说明。

Now, let’s examine the headlines that have to do with women in politics.

现在,让我们研究与政治女性有关的头条新闻。

First, we normalize them by lowercasing:

首先,我们通过小写将它们标准化:

df['headline'] = df['headline'].astype(str).str.lower()

Examine the headlines that contain words like woman, politics and power:

检查包含妇女,政治和权力等字眼的标题:

wip_headlines = df[df['headline'].str.contains(('women' or 'woman' or 'female')) & df['headline'].str.contains(('politics' or 'power' or 'election'))]

‘wip’ stands for ‘women in politics’.

“ wip”代表“政治女性”。

Our search returned only 185 headlines. Let’s look at the keywords to supplement that.

我们的搜索仅返回185个标题。 让我们看一下关键词以补充这一点。

df['keywords'].dropna()
df['keywords_joined'] = df.keywords.apply(', '.join)
df['keywords_joined'] = df['keywords_joined'].astype(str)
import re
wip_keywords = df[df['keywords_joined'].str.contains(r'(?=.*women)(?=.*politics)',regex=True)]
Image for post
Women in politics: resulting DataFrame
妇女参政:数据框架

The DataFrame above contains 2579 articles based on relevant keywords. We will perform an outer join on the keywords and the headlines dataframes in order to obtain a more comprehensive one:

上面的数据框包含2579条基于相关关键字的文章。 我们将对关键字和标题数据框架执行外部联接,以获得更全面的联接:

wip_df = pd.concat([wip_headlines, wip_keywords], axis=0, sort = True)

Using the same techniques, we will be able to obtain more data about women in military, science, sports, entrepreneurship and other forms of achievement. For example, if we were to look for the articles about feminism:

使用相同的技术,我们将能够获得有关妇女在军事,科学,体育,创业和其他成就形式方面的更多数据。 例如,如果我们要寻找有关女权主义的文章:

feminist_keywords = df[df['keywords_joined'].str.contains(r'(?=.*women)(?=.*feminist)',regex=True)]
Image for post
Articles based on the keyword search: feminism
基于关键字搜索的文章:女权主义

#metoo movement:

#运动:

metoo_keywords = df[df['keywords_joined'].str.contains(r'(?=.*women)(?=.*metoo)(?=.*movement)',regex=True)]

Regular expressions and fuzzy matching allow for nearly endless possibilities. You can see more queries in this notebook.

正则表达式和模糊匹配提供了几乎无限的可能性。 您可以在此笔记本中看到更多查询。

The final DataFrame, after all the querying is complete, will further be referred to as project_df in the code notebooks on GitHub and in this article.

完成所有查询后,最终的DataFrame在GitHub和本文中的代码笔记本中将进一步称为project_df

Let’s look at the article distribution over the years:

让我们看看这些年来文章分布

ax = df.groupby(df.date.dt.year['headline'].count().plot(kind='bar', figsize=(20, 6))
ax.set(xlabel='Year', ylabel='Number of Articles')
ax.yaxis.set_tick_params(labelsize='large')
ax.xaxis.label.set_size(18)
ax.yaxis.label.set_size(18)
ax.set_title('Total Published Every Year', fontdict={'fontsize': 24, 'fontweight': 'medium'})
plt.show()
Image for post
ax = project_df.groupby('year')['headline'].count().plot(kind='bar', figsize=(20, 6))
ax.set(xlabel='Year', ylabel='Number of Articles')
ax.yaxis.set_tick_params(labelsize='large')
ax.xaxis.label.set_size(18)
ax.yaxis.label.set_size(18)
ax.set_title('Articles About Strong Women (based on relevant keywords) Published Every Year', \
fontdict={'fontsize': 20, 'fontweight': 'medium'})
plt.show()
Image for post

If we were to superimpose these two graphs, the blue one nearly disappears:

如果我们要叠加这两个图,蓝色的图几乎消失了:

Image for post
Relevant publications, based on keywords and headlines, are almost invisible once compared to the bulk of articles published over time.
与一段时间以来发表的大量文章相比,基于关键字和标题的相关出版物几乎是不可见的。

The coverage of women’s issues appears to be modest. I believe it may be due to the fact that the keywords weren’t always coded properly: some were either missing or misleading, thus making it more difficult for a researcher to find the wanted material through the Archive API.

妇女问题的报道似乎很少。 我认为这可能是由于关键字的编码并不总是正确的原因:一些关键字丢失或误导,从而使研究人员更难通过Archive API查找所需的材料。

Throughout my analysis, I made an interesting discovery. In the early 1950’s, according to the analysis of n-grams, there were many mentions of professional opportunities for women. A lot of them graduated from universities to become doctors in order to join the navy. I attribute this spike of publicity to the aftermath of the World War II: women were encouraged to join the workforce in order to supplement the military efforts. Remember the Rosie the Riveter poster?

在整个分析过程中,我发现了一个有趣的发现。 根据对n-gram的分析,在1950年代初期,有很多提到了女性职业机会。 他们中的许多人从大学毕业成为医生,以加入海军。 我将宣传的高峰归因于第二次世界大战的后果:鼓励妇女加入劳动力大军,以补充军事力量。 还记得铆钉罗茜(Rosie the Riveter)的海报吗?

Image for post
TimesMachine, the NYT archive of publications. Image was created by author using those clippings. TimesMachine获得的。 图片是作者使用这些剪裁创建的。

Even though it’s heart-warming and uplifting to see those kinds of opportunities available to women during the times when not too many doors were open for them, I really wish it wasn’t due to warfare.

尽管在没有太多机会为妇女打开大门的时候看到这些机会给妇女带来了振奋和振奋,但我真的希望这不是因为战争。

N-gram,WordCloud和情感分析。 (N-grams, WordCloud and Sentiment Analysis.)

To explore overall term frequencies in headlines:

要探索标题中的整体术语频率:

from sklearn.feature_extraction.text import CountVectorizerword_vectorizer = CountVectorizer(ngram_range=(1,3), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(corpus)
frequencies = sum(sparse_matrix).toarray()[0]
ngram_df_project = pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])from wordcloud import WordCloud, STOPWORDS
all_headlines = ' '.join(project_df['headline'].str.lower())stopwords = STOPWORDS
stopwords.add('will')
# Note: you can append your own stopwords to the existing ones.wordcloud = WordCloud(stopwords=stopwords, background_color="white", max_words=1000, width = 480, height = 480).\
generate(all_headlines)plt.figure(figsize=(20,10))
plt.imshow(wordcloud)
plt.axis("off");
Image for post
WordCloud created by the code above: most frequent terms are displayed in larger font.
由上面的代码创建的WordCloud:最常用的术语以较大的字体显示。

We can also create wordclouds based on features such as various timeframes, or specific keywords. Refer to the notebook for more visuals.

我们还可以基于各种时间范围或特定关键字等功能创建词云。 有关更多视觉效果,请参考笔记本

Let’s talk about sentiment analysis. We are going to analyze the sentiment associated with the headlines, using the NLTK’s Vader library. Can we actually pick up on how the journalists felt about an issue while writing an article?

让我们谈谈情绪分析。 我们将使用NLTK的Vader库来分析与标题相关情绪。 我们实际上可以在撰写文章时了解记者对某个问题的看法吗?

import nltk 
nltk.download('vader_lexicon')from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIAsia = SIA()
results = []for line in project_df.headline:
pol_score = sia.polarity_scores(line)
pol_score['headline'] = line
results.append(pol_score)print(results[:3])

Output:

输出:

[{'neg': 0.0, 'neu': 0.845, 'pos': 0.155, 'compound': 0.296, 'headline': 'women doctors join navy; seventeen end their training and are ordered to duty'}, {'neg': 0.18, 'neu': 0.691, 'pos': 0.129, 'compound': -0.2732, 'headline': 'n.y.u. to graduate 21 women doctors; war gave them, as others, an opportunity to enter a medical school'}, {'neg': 0.159, 'neu': 0.725, 'pos': 0.116, 'compound': -0.1531, 'headline': 'greets women doctors; dean says new york medical college has no curbs'}]

Sentiment as a dataframe:

情感作为数据框:

sentiment_df = pd.DataFrame.from_records(results)
Image for post
dates = project_df['year']
sentiment_df = pd.merge(sentiment_df, dates, left_index=True, right_index=True)

The code above allows us to have a timeline for our sentiment. To simplify the sentiment analysis, we are going to create some new categories for positive, negative and neutral.

上面的代码使我们能够为自己的情感设定时间表。 为了简化情绪分析,我们将为积极,消极和中立创建一些新类别。

sentiment_df['label'] = 0
sentiment_df.loc[sentiment_df['compound'] > 0.2, 'label'] = 1
sentiment_df.loc[sentiment_df['compound'] < -0.2, 'label'] = -1
sentiment_df.head()

To visualize overall sentiment distribution:

可视化整体情绪分布:

sentiment_df.label.value_counts(normalize=True) * 100
Image for post
Image by author. Template by Slidesgo.
图片由作者提供。 幻灯片制作模板。

To visualize sentiment over time:

可视化一段时间内的情绪:

sns.lineplot(x="year", y="label", data=sentiment_df) 
plt.show()
Image for post
Sentiment is fluctuating due to the problem complexity
由于问题的复杂性,情绪波动

As you can see, the sentiment fluctuates. It’s not at all unexpected, since women’s issues often encompass heavy subject matter, such as violence and abuse. In these cases, we expect the sentiment to be skewed towards the negative end of the spectrum.

如您所见,情绪在波动。 这一点也不出乎意料,因为妇女问题通常涉及诸如暴力和虐待等沉重的主题。 在这些情况下,我们预计市场情绪将朝频谱的负面方向倾斜。

I created a Tableau Dashboard where viewers can interact with the visualization. It’s available through my Tableau Public profile. This dashboard illustrates the keyword distribution over the decades.

我创建了一个Tableau仪表板,查看器可以在其中与可视化进行交互。 可通过我的Tableau Public个人资料获得。 此仪表板说明了几十年来的关键字分布。

Image for post
Image by author.
图片由作者提供。

结论 (Conclusions)

The New York Times has visibly improved on equal gender representation throughout the years. If I were to make a suggestion, I would recommend adding on to the keyword listings. When we go further into the past of the Archive API, more comprehensive and robust keywords could facilitate the search.

多年来,《纽约时报》在男女平等代表方面取得了明显的进步。 如果我有建议,我建议添加到关键字列表中。 当我们深入探讨Archive API的过去时,更全面,更强大的关键字可以促进搜索。

It is important to keep showcasing female leadership, until it becomes just leadership. Imagine the world, where the adjective “female” is no longer needed to describe achievement, as it becomes redundant. Imagine the world, where there are no “female doctors” or “female engineers”: just doctors and engineers. Founders and politicians. Entrepreneurs, scientists and trailblazers. Our goal as a society is to develop a solid mental model of these titles being held by diverse groups of people. Together, we can achieve that by constantly reminding ourselves and the society around us, that no gender or nationality can be barred from those opportunities.

重要的是要不断展示女性领导才能成为领导。 想象世界,不再需要形容词“女性”来描述成就,因为它变得多余了。 想象世界,那里没有“女医生”或“女工程师”:只有医生和工程师。 创始人和政治家。 企业家,科学家和开拓者。 作为一个社会,我们的目标是建立由不同人群持有的这些头衔的可靠心智模型。 在一起,我们可以不断提醒自己和周围的社会,以实现性别和国籍不受这些机会的限制。

翻译自: https://towardsdatascience.com/strong-women-through-the-lens-of-the-new-york-times-f7f7468a2645

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值