文本数据可视化
Stuck behind the paywall? Read this article with my friend link here.
卡在收费墙后面? 在这里与我的朋友链接阅读本文。
What will you do when I ask you to explain textual data? What steps will you take to build the textual visualization story? Here, I am not going to explain how you create a visualization story.
当我要求您解释文本数据时,您会怎么做? 您将采取什么步骤来建立文本可视化故事? 在这里,我将不解释您如何创建可视化故事。
But this article will help you to get the required information to build the visualization story and explain the textual data.
但是本文将帮助您获取构建可视化故事并解释文本数据所需的信息。
Insights from textual data will help us to discover the connection between the articles. It will detect trends and patterns. Analysis of textual data will set aside the noise and uncovers previously unknown information.
来自文本数据的见解将帮助我们发现文章之间的联系。 它将检测趋势和模式。 对文本数据的分析将消除噪音,并发现以前未知的信息。
This analysis process is also known as Exploratory Text Analysis (ETA). With the help of K-means, Tf-IDF, word frequency, etc. method, we will analyze these textual data. Also, ETA is useful in the data cleaning process.
此分析过程也称为探索文本分析(ETA)。 借助K-means,Tf-IDF,词频等方法,我们将分析这些文本数据。 同样,ETA在数据清理过程中很有用。
We also visualize the results in graphs, word clouds, and plots using Matplotlib, seaborn, and Plotly libraries.
我们还使用Matplotlib,seaborn和Plotly库在图形,词云和图中可视化结果。
Before analyzing the Textual Data, complete these pre-processing tasks.
在分析文本数据之前,请完成这些预处理任务。
从数据源检索数据 (Retrieve data from Data Source)
There is a lot of unstructured text data available for analysis. You can get data from the below sources.
有许多可用于分析的非结构化文本数据。 您可以从以下来源获取数据。
1. Twitter text dataset from Kaggle.
1.来自Kaggle的Twitter文本数据集。
2. Reddit and twitter dataset using API.
2.使用API进行Reddit和Twitter数据集。
3. Scrape articles from a website using Beautifulsoup and Requests python library.
3.使用Beautifulsoup和Requests python库从网站上抓取文章。
I am going to use Reuters’ article available in SGML format. For analysis purposes, I will fetch the date, title, and article body from data files using the Beautifulsoup library.
我将使用SGML格式的路透社文章 。 出于分析目的,我将使用Beautifulsoup库从数据文件中获取日期,标题和文章正文。
Use the below code to fetch the data from all data files and store the output in a single CSV file.
使用以下代码从所有数据文件中获取数据,并将输出存储在单个CSV文件中。
from bs4 import BeautifulSoup
import pandas as pd
import csv
article_dict = {}
i = 0
list_of_data_num = []
for j in range(0,22):
if j < 10:
list_of_data_num.append("00" + str(j))
else:
list_of_data_num.append("0" + str(j))
# loop all the articles to extract date, title and article body
for num in list_of_data_num:
try:
soup = BeautifulSoup(open("data/reut2-" + num + ".sgm"), features='lxml')
except:
continue
print(num)
data_reuters = soup.find_all('reuters')
for data in data_reuters:
article_dict[i] = {}
for date in data.find_all('date'):
try:
article_dict[i]["date"] = str(date.contents[0]).strip()
except:
article_dict[i]["date"] = None
# print(date.contents[0])
for title in data.find_all('title'):
article_dict[i]["title"] = str(title.contents[0]).strip()
# print(title.contents)
for text in data.find_all('text'):
try:
article_dict[i]["text"] = str(text.contents[4]).strip()
except:
article_dict[i]["text"] = None
i += 1
dataframe_article = pd.DataFrame(article_dict).T
dataframe_article.to_csv('articles_data.csv', header=True, index=False, quoting=csv.QUOTE_ALL)
print(dataframe_article)
1. You can also use the Regex and OS library to combine or loop all the data files.
1.您还可以使用Regex和OS库来组合或循环所有数据文件。
2. Each article’s body starts with <Reuters>, so use find_all(‘reuters’).
2.每篇文章的正文均以<Reuters>开头,因此请使用find_all('reuters')。
3. You can also use the pickle module to save data, instead of CSV.
3.您也可以使用pickle模块而不是CSV保存数据。
数据清理流程 (Data Cleaning Process)
In this section, we remove noise such as null values, punctuation, numbers, etc. from the textual data. First, we remove the rows which contain null values in the text column. Then we deal with other column’s null values.
在本节中,我们从文本数据中消除了诸如空值,标点符号,数字等的干扰。 首先,我们删除文本列中包含空值的行。 然后,我们处理其他列的空值。
import pandas as pd import rearticles_data = pd.read_csv(‘articles_data.csv’) print(articles_data.apply(lambda x: sum(x.isnull()))) articles_nonNull = articles_data.dropna(subset=[‘text’]) articles_nonNull.reset_index(inplace=True)def clean_text(text):‘’’Make text lowercase, remove text in square brackets,remove \n,remove punctuation and remove words containing numbers.’’’ text = str(text).lower()
text = re.sub(‘<.*?>+’, ‘’, text)
text = re.sub(‘[%s]’ % re.escape(string.p