路透社文章的文本数据分析与可视化

最新推荐文章于 2023-09-20 09:48:06 发布

VIP文章磐创 AI

最新推荐文章于 2023-09-20 09:48:06 发布

阅读量649

点赞数

本文链接：https://blog.csdn.net/fendouaini/article/details/108957453

版权

作者|Manmohan Singh 编译|VK 来源|Towards Datas Science

当我要求你解释文本数据时，你会怎么做？你将采取什么步骤来构建文本可视化？

本文将帮助你获得构建可视化和解释文本数据所需的信息。

从文本数据中获得的见解将有助于我们发现文章之间的联系。它将检测趋势和模式。对文本数据的分析将排除噪音，发现以前未知的信息。

这种分析过程也称为探索性文本分析(ETA)。运用K-means、Tf-IDF、词频等方法对这些文本数据进行分析。此外，ETA在数据清理过程中也很有用。

我们还使用Matplotlib、seaborn和Plotly库将结果可视化到图形、词云和绘图中。

在分析文本数据之前，请完成这些预处理任务。

从数据源检索数据

有很多非结构化文本数据可供分析。你可以从以下来源获取数据。

来自Kaggle的Twitter文本数据集。
Reddit和twitter数据集使用API。
使用Beautifulsoup从网站上获取文章、。

我将使用路透社的SGML格式的文章。为了便于分析，我将使用beauthoulsoup库从数据文件中获取日期、标题和文章正文。

使用下面的代码从所有数据文件中获取数据，并将输出存储在单个CSV文件中。

from bs4 import BeautifulSoup
import pandas as pd
import csv

article_dict = {}
i = 0
list_of_data_num = []

for j in range(0,22):
    if j < 10:
        list_of_data_num.append("00" + str(j))
    else:
        list_of_data_num.append("0" + str(j))

# 循环所有文章以提取日期、标题和文章主体
for num in list_of_data_num:
    try:
        soup = BeautifulSoup(open("data/reut2-" + num + ".sgm"), features='lxml')
    except:
        continue
    print(num)
    data_reuters = soup.find_all('reuters')
    for data in data_reuters:
        article_dict[i] = {}
        for date in data.find_all('date'):
            try:
                article_dict[i]["date"] = str(date.contents[0]).strip()
            except:
                article_dict[i]["date"] = None
            # print(date.contents[0])
        for title in data.find_all('title'):
            article_dict[i]["title"] = str(title.contents[0]).strip()
            # print(title.contents)
        for text in data.find_all('text'):
            try:
                article_dict[i]["text"] = str(text.contents[4]).strip()
            except:
                article_dict[i]["text"] = None
        i += 1


dataframe_article = pd.DataFrame(article_dict).T
dataframe_article.to_csv('articles_data.csv', header=True, index=False, quoting=csv.QUOTE_ALL)
print(dataframe_article)

还可以使用Regex和OS库组合或循环所有数据文件。
每篇文章的正文以开头，因此使用find_all('reuters')。
你也可以使用pickle模块来保存数据，而不是CSV。

清洗数据

在本节中，我们将从文本数据中移除诸如空值、标点符号、数字等噪声。首先，我们删除文本列中包含空值的行。然后我们处理另一列的空值。

import pandas as pd import re

articles_data = pd.read_csv(‘articles_data.csv’) print(articles_data.apply(lambda x: sum(x.isnull()))) articles_nonNull = articles_data.dropna(subset=[‘text’]) articles_nonNull.reset_index(inplace=True)

def clean_text(text):

‘’’Make text lowercase, remove text in square brackets,remove \n,remove punctuation and remove words containing numbers.’’’

    text = str(text).lower()
    text = re.sub(‘<.*?>+’, ‘’, text)
    text = re.sub(‘[%s]’ % re.escape(string.punctuation), ‘’, text)
    text = re.sub(‘\n’, ‘’, text)
    text = re.sub(‘\w*\d\w*’, ‘’, text)
    return text

articles_nonNull[‘text_clean’]=articles_nonNull[‘text’]\
                                  .apply(lambda x:clean_text(x))

当我们删除文本列中的空值时，其他列中的空值也会消失。
我们使用re方法去除文本数据中的噪声。

数据清理过程中采取的步骤可能会根据文本数据增加或减少。因此，请仔细研究你的文本数据并相应地构建clean_text()方法。

随着预处理任务

最低0.47元/天解锁文章

磐创 AI

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
路透社文章的文本数据分析与可视化

作者|Manmohan Singh编译|VK来源|Towards Datas Science当我要求你解释文本数据时，你会怎么做？你将采取什么步骤来构建文本可视化？本文将帮助你获得构建可视化和解释文本数据所需的信息。从文本数据中获得的见解将有助于我们发现文章之间的联系。它将检测趋势和模式。对文本数据的分析将排除噪音，发现以前未知的信息。这种分析过程也称为探索性文本分析(ETA)。运用K-means、Tf-IDF、词频等方法对这些文本数据进行分析。此外，ETA在数据清理过程中也很有用。我们还使用
复制链接

扫一扫