竞赛信息
1.网址
http://www.dcjingsai.com/common/cmpt/“达观杯”文本智能处理挑战赛_竞赛信息.html
2.任务
建立模型通过长文本数据正文(article),预测文本对应的类别(class)
3.数据
数据包含2个csv文件:
train_set.csv:此数据集用于训练模型,每一行对应一篇文章。文章分别在“字”和“词”的级别上做了脱敏处理。共有四列:
第一列是文章的索引(id),第二列是文章正文在“字”级别上的表示,即字符相隔正文(article);第三列是在“词”级别上的表示,即词语相隔正文(word_seg);第四列是这篇文章的标注(class)。
注:每一个数字对应一个“字”,或“词”,或“标点符号”。“字”的编号与“词”的编号是独立的!
test_set.csv:此数据用于测试。数据格式同train_set.csv,但不包含class。
注:test_set与train_test中文章id的编号是独立的。
4.评分标准
采用各个品类F1指标的算术平均值,它是Precision 和 Recall 的调和平均数。
其中,Pi是表示第i个种类对应的Precision, Ri是表示第i个种类对应Recall。
第一天 数据初识
#导入必要的库
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
#读取数据
test_path=r"D:\Desktop\new_data\test_set.csv"
train_path=r"D:\Desktop\new_data\train_set.csv"
train_data=pd.read_csv(train_path)
test_data=pd.read_csv(test_path)
#描述数据信息
train_data.head(5)
id article word_seg class
0 0 7368 1252069 365865 755561 1044285 129532 1053... 816903 597526 520477 1179558 1033823 758724 63... 14
1 1 581131 165432 7368 957317 1197553 570900 33659... 90540 816903 441039 816903 569138 816903 10343... 3
2 2 7368 87936 40494 490286 856005 641588 145611 1... 816903 1012629 957974 1033823 328210 947200 65... 12
3 3 299237 760651 299237 887082 159592 556634 7489... 563568 1239563 680125 780219 782805 1033823 19... 13
4 4 7368 7368 7368 865510 7368 396966 995243 37685... 816903 816903 816903 139132 816903 312320 1103... 12
test_data.head(5)
id article word_seg
0 0 7368 146447 316564 42610 55736 297797 93042 53... 816903 565958 726082 764656 335008 75094 20282...
1 1 985531 473628 1044285 1121849 206763 462208 11... 729468 520477 529032 101368 335130 520477 1113...
2 2 7368 7368 7368 7368 7368 7368 7368 7368 7368 7... 816903 816903 816903 816903 816903 816903 8169...
3 3 529819 1226459 856005 1177293 663773 272235 93... 231664 1033823 524850 330478 507199 520477 618...
4 4 42610 1252069 1077049 955883 1125260 1044285 2... 545370 379223 162767 520477 1194630 1197475 11...
#下面看一看数据完整信息,可以发现训练集有102277条数据,而测试集也有102277条数据
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102277 entries, 0 to 102276
Data columns (total 4 columns):
id 102277 non-null int64
article 102277 non-null object
word_seg 102277 non-null object
class 102277 non-null int64
dtypes: int64(2), object(2)
memory usage: 3.1+ MB
test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102277 entries, 0 to 102276
Data columns (total 3 columns):
id 102277 non-null int64
article 102277 non-null object
word_seg 102277 non-null object
dtypes: int64(1), object(2)
memory usage: 2.3+ MB
train_data['class'].describe()
count 102277.000000
mean 10.262356
std 5.370785
min 1.000000
25% 6.000000
50% 10.000000
75% 15.000000
max 19.000000
Name: class, dtype: float64
train_data.isnull().any()
id False
article False
word_seg False
class False
dtype: bool
test_data.isnull().any()
id False
article False
word_seg False
dtype: bool
#由以上信息可知训练数据分布均匀
#划分训练数据集
X_train, X_valid, y_train, y_valid = train_test_split(train_data[['article','word_seg']],train_data['class'],test_size=0.3, random_state=2019)
print(X_train.shape,y_train.shape,X_valid.shape,y_valid.shape)
(71593, 2) (71593,) (30684, 2) (30684,)
X_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 71593 entries, 96040 to 40008
Data columns (total 2 columns):
article 71593 non-null object
word_seg 71593 non-null object
dtypes: object(2)
memory usage: 1.6+ MB
X_valid.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30684 entries, 82049 to 73655
Data columns (total 2 columns):
article 30684 non-null object
word_seg 30684 non-null object
dtypes: object(2)
memory usage: 719.2+ KB