DW-NLP-Task02 数据读取与数据分析

最新推荐文章于 2024-09-17 16:25:27 发布

叶xinwu

最新推荐文章于 2024-09-17 16:25:27 发布

阅读量235

点赞数 1

分类专栏： DW-NLP

本文链接：https://blog.csdn.net/ycs1010647987/article/details/107521825

版权

DW-NLP 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

数据读取

因为出现全部读取数据会出现memoryerror问题，暂未解决，因此只读取前100行。

#数据读取
import pandas as pd
train_df = pd.read_csv('F:/学习/DW-NLP/train_set.csv', sep='\t',nrows=100)
'''
这里的read_csv由三部分构成：
读取的文件路径,
分隔符sep，为每列分割的字符，设置为\t即可；
读取行数nrows，为此次读取文件的行数，是数值类型（这里设置100）；
'''
print(train_df.head())
'''
head()是pandas里面的函数，n的默认值为5，也就是默认读取前5行数据，可以用train_df.head(n = 10)来读取前10行的数据
'''

结果：

数据分析

句子长度分析

#数据分析
#句子长度分析
%pylab inline
'''
 in essence adds numpy and matplotlib in to your session.
 This was added in IPython as a transition tool and current recommendation is that you should not use it.
 The core reason is that below sets of commands imports too much in the global namespace and also it doesn't
 allow you to change the mode for matplotlib from UI to QT or something else.
This is what %pylab does:

import numpy
import matplotlib
from matplotlib import pylab, mlab, pyplot
np = numpy
plt = pyplot

from IPython.core.pylabtools import figsize, getfigs

from pylab import *
from numpy import *
 '''
train_df['text_len'] = train_df['text'].apply(lambda x: len(x.split(' ')))
'''
1.lambda是匿名函数
lambda语句中，冒号前是参数，可以有多个，用逗号隔开，冒号右边的返回值
2.函数格式为：apply(func,*args,**kwargs)
用途：当一个函数的参数存在于一个元组或者一个字典中时，用来间接的调用这个函数，并将元组或者字典中的参数按照顺序传递给参数
解析：args是一个包含按照函数所需参数传递的位置参数的一个元组，是不是很拗口，意思就是，假如A函数的函数位置为 A(a=1,b=2),
那么这个元组中就必须严格按照这个参数的位置顺序进行传递(a=3,b=4)，而不能是(b=4,a=3)这样的顺序
kwargs是一个包含关键字参数的字典，而其中args如果不传递，kwargs需要传递，则必须在args的位置留空
apply的返回值就是函数func函数的返回值
3.
split()描述
Python split() 通过指定分隔符对字符串进行切片，如果参数 num 有指定值，则分隔 num+1 个子字符串
语法
split() 方法语法：
str.split(str="", num=string.count(str)).
参数
str -- 分隔符，默认为所有的空字符，包括空格、换行(\n)、制表符(\t)等。
num -- 分割次数。默认为 -1, 即分隔所有。
返回值
返回分割后的字符串列表。
'''
print(train_df['text_len'].describe())#pandas的describe可以用来展示数据的一些描述性统计信息

结果：
在这里插入图片描述

画图

#画图
_ = plt.hist(train_df['text_len'], bins=200)#hist即histogram,bins是条形数
plt.xlabel('Text char count')
plt.title("Histogram of char count")

结果：
在这里插入图片描述

新闻类别分布

#新闻类别分布
train_df['label'].value_counts().plot(kind='bar')
plt.title('News class count')
plt.xlabel("category")
'''
value_counts()是一种查看表格某列中有多少个不同值的快捷方法，并计算每个不同值有在该列中有多少重复值。
value_counts()是Series拥有的方法，一般在DataFrame中使用时，需要指定对哪一列或行使用
'''

结果：
在这里插入图片描述

字符分布统计

#字符分布统计
from collections import Counter
all_lines = ' '.join(list(train_df['text']))#用' '将所有新闻文本连起来
word_count = Counter(all_lines.split(" "))#split用于去掉字符串中的空格，Counter用来计数字符串中的字符数量
word_count = sorted(word_count.items(), key=lambda d:d[1], reverse = True)
'''
1.sorted 语法：

sorted(iterable, key=None, reverse=False)  
参数说明：

iterable -- 可迭代对象。
key -- 主要是用来进行比较的元素，只有一个参数，具体的函数的参数就是取自于可迭代对象中，指定可迭代对象中的一个元素来进行排序。
reverse -- 排序规则，reverse = True 降序 ， reverse = False 升序（默认）。
返回值
返回重新排序的列表。
2.items:Python 字典(Dictionary) items() 函数以列表返回可遍历的(键, 值) 元组数组。
3.lambda d:d[1]是以（键、值）中的值为关键字进行排序，key=lambda d:d[0]是以（键、值）中的键为关键字进行排序，这里是对各个字符出现的次数进行排序
'''
print(len(word_count))

print(word_count[0])

print(word_count[-1])

结果：
在这里插入图片描述

字符分布统计——标点推断

'''
这里还可以根据字在每个句子的出现情况，反推出标点符号。下面代码统计了不同字符在句子中出现的次数，其中字符3750，
字符900和字符648在20w新闻的覆盖率接近99%，很有可能是标点符号。
'''
from collections import Counter
train_df['text_unique'] = train_df['text'].apply(lambda x: ' '.join(list(set(x.split(' ')))))
'''
用set()对train_df['text']中的每条新闻统计每个字是否出现
'''
all_lines = ' '.join(list(train_df['text_unique']))#所有新闻中是否出现每个字的频率

word_count = Counter(all_lines.split(" "))

word_count = sorted(word_count.items(), key=lambda d:int(d[1]), reverse = True)

print(word_count[0])

print(word_count[1])

print(word_count[2])
'''
因为只读取了100份，而2750,900,648几乎在每个文本中都出现，可以判断可能是标点符号'''

结果：
在这里插入图片描述

作业

参考 https://blog.csdn.net/qq_36831845/article/details/107514046

假设字符3750，字符900和字符648是句子的标点符号，请分析赛题每篇新闻平均由多少个句子构成？

#将3750,900,648分别当做句号
for char in ['3750','900','648',]:
    train_df['n_sentences_by_{}'.format(char)] = train_df['text'].apply(lambda x:Counter(x.split(' '))[char])
    _ = plt.hist(train_df['n_sentences_by_{}'.format(char)], bins=100) 
    plt.xlabel('sentences count by {}'.format(char))
    plt.title("Histogram of sentences count")
    plt.show()
    print(train_df['n_sentences_by_{}'.format(char)].describe())

结果：
在这里插入图片描述

#分析每段新闻末尾的字符
train_df['last_word'] = train_df.text.apply(lambda x:x.split(' ')[-1])
last_word_count = Counter(train_df['last_word'])
last_word_count.most_common(10)

结果：
在这里插入图片描述
在只读取前100行的分析结果中无法判断哪个字符是句号，而根据阅读其他同学的作业结果可知对总样本分析的结果表明900出现的频率远高于其他两个最可能是句号。

#输出以900为结尾的新闻的数目
for char in ["3750","900","648"]:
    print(last_word_count[char])

所分析样本量太小，结果略。

统计每类新闻中出现次数最多的字符

grouped = train_df[['label','text']].groupby('label')
# word_count_in_label = {}
for name, group in grouped:
    all_lines = ' '.join(list(group.text))
    word_count = Counter(all_lines.split(' '))
    # word_count_in_label[name] = word_count
    print("标签为{:>2d}组，出现次数最多的五个字符为{}".format(name, word_count.most_common(5)))

结果略。