主题:论文种类分类
利用已有数据建模,对新论文进行类别分类
使用论文标题
完成类别分类
import seaborn as sns
from bs4 import BeautifulSoup
import re
import json
import pandas as pd
import matplotlib.pyplot as plt
def readArxivFile(path, columns=['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi',
'report-no', 'categories', 'license', 'abstract', 'versions',
'update_date', 'authors_parsed'], count=None):
'''
定义读取文件的函数
path: 文件路径
columns: 需要选择的列
count: 读取行数
'''
data = []
with open(path, 'r') as f:
for idx, line in enumerate(f):
if idx == count:
break
d = json.loads(line)
d = {
col : d[col] for col in columns}
data.append(d)
data = pd.DataFrame(data)
return data
data = readArxivFile('arxiv-metadata-oai-snapshot.json',
['id', 'title', 'categories', 'abstract'],
200000)
data.head(2)
id | title | categories | abstract | |
---|---|---|---|---|
0 | 0704.0001 | Calculation of prompt diphoton production cros... | hep-ph | A fully differential calculation in perturba... |
1 | 0704.0002 | Sparsity-certifying Graph Decompositions | math.CO cs.CG | We describe a new algorithm, the $(k,\ell)$-... |
1. 数据处理
data1 = data