达观杯”文本智能处理挑战赛

最新推荐文章于 2019-07-08 22:58:14 发布

jassy_shan

最新推荐文章于 2019-07-08 22:58:14 发布

阅读量441

点赞数

分类专栏：数据挖掘与算法竞赛文章标签：达观杯算法竞赛

本文链接：https://blog.csdn.net/weixin_38966454/article/details/89046445

版权

数据挖掘与算法竞赛专栏收录该内容

5 篇文章 0 订阅

订阅专栏

竞赛信息

1.网址
http://www.dcjingsai.com/common/cmpt/“达观杯”文本智能处理挑战赛_竞赛信息.html
2.任务
建立模型通过长文本数据正文(article)，预测文本对应的类别(class)
3.数据
数据包含2个csv文件：
train_set.csv：此数据集用于训练模型，每一行对应一篇文章。文章分别在“字”和“词”的级别上做了脱敏处理。共有四列：
第一列是文章的索引(id)，第二列是文章正文在“字”级别上的表示，即字符相隔正文(article)；第三列是在“词”级别上的表示，即词语相隔正文(word_seg)；第四列是这篇文章的标注(class)。
注：每一个数字对应一个“字”，或“词”，或“标点符号”。“字”的编号与“词”的编号是独立的！
test_set.csv：此数据用于测试。数据格式同train_set.csv，但不包含class。
注：test_set与train_test中文章id的编号是独立的。
4.评分标准

在这里插入图片描述
采用各个品类F1指标的算术平均值，它是Precision 和 Recall 的调和平均数。
其中，Pi是表示第i个种类对应的Precision， Ri是表示第i个种类对应Recall。

第一天数据初识

#导入必要的库
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

#读取数据
test_path=r"D:\Desktop\new_data\test_set.csv"
train_path=r"D:\Desktop\new_data\train_set.csv"
train_data=pd.read_csv(train_path)
test_data=pd.read_csv(test_path)

#描述数据信息

train_data.head(5)
id	article	word_seg	class
0	0	7368 1252069 365865 755561 1044285 129532 1053...	816903 597526 520477 1179558 1033823 758724 63...	14
1	1	581131 165432 7368 957317 1197553 570900 33659...	90540 816903 441039 816903 569138 816903 10343...	3
2	2	7368 87936 40494 490286 856005 641588 145611 1...	816903 1012629 957974 1033823 328210 947200 65...	12
3	3	299237 760651 299237 887082 159592 556634 7489...	563568 1239563 680125 780219 782805 1033823 19...	13
4	4	7368 7368 7368 865510 7368 396966 995243 37685...	816903 816903 816903 139132 816903 312320 1103...	12

test_data.head(5)

id	article	word_seg
0	0	7368 146447 316564 42610 55736 297797 93042 53...	816903 565958 726082 764656 335008 75094 20282...
1	1	985531 473628 1044285 1121849 206763 462208 11...	729468 520477 529032 101368 335130 520477 1113...
2	2	7368 7368 7368 7368 7368 7368 7368 7368 7368 7...	816903 816903 816903 816903 816903 816903 8169...
3	3	529819 1226459 856005 1177293 663773 272235 93...	231664 1033823 524850 330478 507199 520477 618...
4	4	42610 1252069 1077049 955883 1125260 1044285 2...	545370 379223 162767 520477 1194630 1197475 11...

#下面看一看数据完整信息，可以发现训练集有102277条数据，而测试集也有102277条数据

train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102277 entries, 0 to 102276
Data columns (total 4 columns):
id          102277 non-null int64
article     102277 non-null object
word_seg    102277 non-null object
class       102277 non-null int64
dtypes: int64(2), object(2)
memory usage: 3.1+ MB

test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102277 entries, 0 to 102276
Data columns (total 3 columns):
id          102277 non-null int64
article     102277 non-null object
word_seg    102277 non-null object
dtypes: int64(1), object(2)
memory usage: 2.3+ MB

train_data['class'].describe()

count    102277.000000
mean         10.262356
std           5.370785
min           1.000000
25%           6.000000
50%          10.000000
75%          15.000000
max          19.000000
Name: class, dtype: float64

train_data.isnull().any()

id          False
article     False
word_seg    False
class       False
dtype: bool
test_data.isnull().any()
id          False
article     False
word_seg    False
dtype: bool
#由以上信息可知训练数据分布均匀

#划分训练数据集
X_train, X_valid, y_train, y_valid = train_test_split(train_data[['article','word_seg']],train_data['class'],test_size=0.3, random_state=2019)
print(X_train.shape,y_train.shape,X_valid.shape,y_valid.shape)
(71593, 2) (71593,) (30684, 2) (30684,)

X_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 71593 entries, 96040 to 40008
Data columns (total 2 columns):
article     71593 non-null object
word_seg    71593 non-null object
dtypes: object(2)
memory usage: 1.6+ MB

X_valid.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30684 entries, 82049 to 73655
Data columns (total 2 columns):
article     30684 non-null object
word_seg    30684 non-null object
dtypes: object(2)
memory usage: 719.2+ KB