学习链接: 数据读取与数据分析
数据读取
import pandas as pd
#数据过大,读取前100
train_df = pd.read_csv(r'D:\python\python3.6\pysl\Pre_\nlp_data\train_set.csv', sep='\t', nrows=100)
数据分析
-
句子长度分析
split() 方法可以实现将一个字符串按照指定的分隔符切分成多个子串,这些子串会被保存到列表中(不包含分隔符),作为方法的返回值反馈回来。
s.split(sep=None, maxsplit=-1)
import matplotlib.pyplot as plt In: train_df['text_len'] = train_df['text'].apply(lambda x : len(x.split())) train_df['text_len'].describe() Out: count 100.000000 mean 872.320000 std 923.138191 min 64.000000 25% 359.500000 50% 598.000000 75% 1058.000000 max 7125.000000 Name: text_len, dtype: float64
s.split(sep,maxsplit)
vlen = plt.hist(train_df[‘text_len’], bins=200) #可视化
plt.xlabel(‘Text char count’)
plt.title(‘Histogram of char count’)
```
- 新闻类别分布
- 字符分布统计
import matplotlib.pyplot as plt
#新闻类别分布
train_df['label'].value_counts().plot(kind='bar')
plt.xlabel('category')
plt.title('News category count')
#字符分布统计
from collections import Counter
all_lines = ' '.join(list(train_df['text'])) #把训练集中所有句子进行拼接,以空格分隔
word_count = Counter(all_lines.split(' ')) #划分为字符
word_count = sorted(word_count.items(), key=lambda d: d[1], reverse=True)
train_df['text_unique'] = train_df['text'].apply(lambda x: ' '.join(list(set(x.split(' ')))))
all_lines = ' '.join(list(train_df['text_unique']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:int(d[1]), reverse = True)
本章作业
1.假设字符3750,字符900和字符648是句子的标点符号,请分析赛题每篇新闻平均由多少个句子构成?
def Mean(x):
mnum = 0
for i in x:
if i in ['3750', '900', '648']:
mnum += 1
return mnum
train_df['mean_sentence'] = train_df['text'].apply(lambda x: Mean(x.split(' ')))
2.统计每类新闻中出现次数对多的字符