我的NLP实践之旅02_[data.astype(np.int-CSDN博客

本文链接：https://blog.csdn.net/m0_37671786/article/details/107521260

我的NLP实践之路02

经历了上一篇博客的数据介绍和模型介绍后，我们来正式开始进行数据分析

数据读取

废话少说，上代码

import pandas as pd
import numpy as np

path = './data/'
train=pd.read_csv(path+'train_set.csv', sep='\t')
test=pd.read_csv(path+'test_a.csv', sep='\t')

train.head()

这里用到了pandas的函数，sep=’\t’表示分隔符为空格，根据需要，可以自行设置，这里要说明的是本人在用read_csv()函数读存储过的数据时经常会读到标题列，这个时候只有设置属性index_col=0就可以避免了

pandas存储csv文件的函数为，df.to_csv（），df代表读到的csv文件，代码如下：

train.to_csv('./data/train.csv')

此外对于文件较大的情况可以将csv存储为其他格式加快读取速度，如h5格式

train.to_hdf("train.h5", "train", format ="table", mode="w")
train=pd.read_hdf("train.h5", "train")

这样可以加快读取速度，这个本人深有体会，当读取像7/8G的文件时读取数据会明显变长，所以需要加快读取速度。

此外，读取较大的csv文件会占用大量内存，笔者由于电脑配置的原因经常内存不足，后来参考了大佬的代码，发现运用改变数据类型可以降低内存的消耗，代码如下：

def reduce_memory(data):
    start_memory = data.memory_usage().sum() / 1024**2 
    print("Memory usage of properties dataframe is :",start_memory," MB")
    NAlist = [] # Keeps track of columns that have missing values filled in. 
    
    for col in data.columns:
        if ('int' in data[col].dtype.name) or ('float' in data[col].dtype.name):  # Exclude strings
            try:
                # Print current column type
                print("******************************")
                print("Column: ",col)
                print("dtype before: ",data[col].dtype)

                # make variables for Int, max and min
                IsInt = False
                value_max = data[col].max()
                value_min = data[col].min()

                # Integer does not support NA, therefore, NA needs to be filled
                if not np.isfinite(data[col]).all(): 
                    NAlist.append(col)
                    data[col].fillna(value_min-1,inplace=True)  

                # test if column can be converted to an integer
                asint = data[col].fillna(0).astype(np.int64)
                result = (data[col] - asint)
                result = result.sum()
                if result > -0.01 and result < 0.01:
                    IsInt = True


                # Make Integer/unsigned Integer datatypes
                if IsInt:
                    if value_min >= 0:
                        if value_max < 255:
                            data[col] = data[col].astype(np.uint8)
                        elif value_max < 65535:
                            data[col] = data[col].astype(np.uint16)
                        elif value_max < 4294967295:
                            data[col] = data[col].astype(np.uint32)
                        else:
                            data[col] = data[col].astype(np.uint64)
                    else:
                        if value_min > np.iinfo(np.int8).min and value_max < np.iinfo(np.int8).max:
                            data[col] = data[col].astype(np.int8)
                        elif value_min > np.iinfo(np.int16).min and value_max < np.iinfo(np.int16).max:
                            data[col] = data[col].astype(np.int16)
                        elif value_min > np.iinfo(np.int32).min and value_max < np.iinfo(np.int32).max:
                            data[col] = data[col].astype(np.int32)
                        elif value_min > np.iinfo(np.int64).min and value_max < np.iinfo(np.int64).max:
                            data[col] = data[col].astype(np.int64)    

                # Make float datatypes 32 bit
                else:
                    data[col] = data[col].astype(np.float32)

                # Print new column type
                print("dtype after: ",data[col].dtype)
                print("******************************")
            except:
                print("dtype after: Failed")
        else:
            print("dtype remain: ",data[col].dtype)
    
    # Print final result
    print("___MEMORY USAGE AFTER COMPLETION:___")
    end_memory = data.memory_usage().sum() / 1024**2 
    print("Memory usage is: ",end_memory," MB")
    print("This is ",100*start_memory/end_memory,"% of the initial size")
    print("Missing Value list", NAlist)
    return data

数据分析

训练集数据总数：200000
测试集数据总数：50000

标签共有14个类别，从表格可看出，越往后的类别训练集数量越少

train.groupby('label').count()/len(train)

代码输出
统计文本长度：

train['count']=train['text'].apply(lambda x:len(x.split(' ')))
train['count'].describe()

代码结果
利用describe函数可观察到文本长度的最大值、最小值、均值等信息，可以看到文本均值为907左右

figure = plt.figure()
ax1=figure.add_subplot(3,1,1)
ax1.plot(train_count['label'],train_count['max'])
ax2=figure.add_subplot(3,1,2)
ax2.plot(train_count['label'],train_count['min'])
ax3=figure.add_subplot(3,1,3)
ax3.plot(train_count['label'],train_count['mean'])

对各类别的数量分析
各便签的数量分析，好像没什么用。。。。。

1.假设字符3750，字符900和字符648是句子的标点符号，请分析赛题每篇新闻平均由多少个句子构成？

import re
train['count']=train['text'].apply(lambda x:len(re.split('3750|900|648',x)))
print(train['count'].mean())
#输出：80.80237

2.统计每类新闻中出现次数对多的字符

from collections import Counter
train=pd.read_csv('./data/train_set.csv',sep='\t')
for i in range(14):
    tmp=train[train['label']==i]['text']
    word_count = Counter(" ".join(tmp.values.tolist()).split())
    print(i,word_count.most_common(1)[0])
#输出：
# 0 ('3750', 1267331)
# 1 ('3750', 1200686)
# 2 ('3750', 1458331)
# 3 ('3750', 774668)
# 4 ('3750', 360839)
# 5 ('3750', 715740)
# 6 ('3750', 469540)
# 7 ('3750', 428638)
# 8 ('3750', 242367)
# 9 ('3750', 178783)
# 10 ('3750', 180259)
# 11 ('3750', 83834)
# 12 ('3750', 87412)
# 13 ('3750', 33796)