数据分析第三次作业
一.读取数据并简单查看
data = []
#使用with语句优势:1.自动关闭文件句柄;2.自动显示(处理)文件读取数据异常
with open(“arxiv-metadata-oai-snapshot.json”, ‘r’) as f:
for idx, line in enumerate(f):
d = json.loads(line)
d = {‘abstract’: d[‘abstract’], ‘categories’: d[‘categories’], ‘comments’: d[‘comments’]}
data.append(d)
data = pd.DataFrame(data)
data.shape
(1796911, 3)
data.head()
二.对论文pages进行统计
df_pages = data[‘comments’].astype(‘string’).str.extract(’(\d+) pages’)
df_pages
将comments列转成string类型,然后利用提取它的文本对象并使用extract方法提取页数,注意返回数据的类型是df类型。
对缺失值进行处理后,查看相关统计信息:
df_pages.dropna(how=‘all’)[0].astype(‘int’).describe().astype(‘int’)
count 1089180
mean 17
std 22
min 1
25% 8
50% 13
75% 22
max 11232
Name: 0, dtype: int32
三.按分类对论文pages的均值进行统计
1.pages列预处理
先把上面整理好的pages列插入到df数据中:
data[‘pages’] = data[‘comments’].astype(‘string’).str.extract(’(\d+) pages’)
data.head()
然后删去含有缺失数据的行,并将pages列的dtype转成int型,以便后续操作:
df_demo = data.dropna(how=‘all’,subset=[‘pages’])
df_demo[‘pages’] = df_demo[‘pages’].astype(‘int’)
df_demo
2.categories列预处理
def myfunc(x):
return x.split(’ ‘)[0].split(’.’)[0]
df_demo[‘categories’] = df_demo[‘categories’].apply(myfunc)
df_demo
3.按分类求论文pages的均值并可视化
s_cat_pages = df_demo.groupby([‘categories’])[‘pages’].mean()
s_cat_pages
categories
acc-phys 14.634146
adap-org 15.296137
alg-geom 24.200000
ao-sci 16.125000
astro-ph 15.822272
atom-ph 13.015625
bayes-an 28.222222
chao-dyn 16.471174
chem-ph 18.100000
cmp-lg 16.546961
comp-gas 20.381818
cond-mat 12.083426
cs 16.020549
dg-ga 23.909091
econ 31.352399
eess 12.221323
funct-an 23.888489
gr-qc 17.420338
hep-ex 15.991570
hep-lat 14.285822
hep-ph 17.576237
hep-th 23.547554
math 24.435306
math-ph 25.385919
mtrl-th 12.829787
nlin 16.107926
nucl-ex 12.938907
nucl-th 14.746383
patt-sol 14.975342
physics 15.129162
plasm-ph 14.652174
q-alg 20.921588
q-bio 19.522364
q-fin 22.188853
quant-ph 13.606077
solv-int 17.011445
stat 24.078195
supr-con 11.981481
Name: pages, dtype: float64
可视化
plt.figure(figsize=(12, 6))
s_cat_pages.plot(kind=‘bar’)