数据读取
使用pandas读取数据
from google.colab import drive
drive.mount('/content/drive')
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly
Enter your authorization code:
··········
Mounted at /content/drive
import pandas as pd
train_df = pd.read_csv(r'/content/drive/My Drive/天池比赛/train_set.csv', sep='\t')
train_df.head()
label | text | |
---|---|---|
0 | 2 | 2967 6758 339 2021 1854 3731 4109 3792 4149 15... |
1 | 11 | 4464 486 6352 5619 2465 4802 1452 3137 5778 54... |
2 | 3 | 7346 4068 5074 3747 5681 6093 1777 2226 7354 6... |
3 | 2 | 7159 948 4866 2109 5520 2490 211 3956 5520 549... |
4 | 3 | 3646 3055 3055 2490 4659 6065 3370 5814 2465 5... |
这里的read_csv由三部分构成:
- 读取文件的路径
- 分隔符sep,为每列的分隔符,设置为\t即可
- 读取行数nows
数据分析
句子长度分析
可以直接统计单词的个数得到每个句子的长度。
train_df['text_len'] = train_df['text'].apply(lambda x: len(x.split()))
train_df.head()
label | text | text_len | |
---|---|---|---|
0 | 2 | 2967 6758 339 2021 1854 3731 4109 3792 4149 15... | 1057 |
1 | 11 | 4464 486 6352 5619 2465 4802 1452 3137 5778 54... | 486 |
2 | 3 | 7346 4068 5074 3747 5681 6093 1777 2226 7354 6... | 764 |
3 | 2 |