20200216_re数据处理

这个单子因为时间以及在家的缘故,做一半就没时间了,还有就是遇到一个问题,当正则表达式修改之后格式有点看不懂,需要重新append到列表进行循环遍历输出,这样才更好进行再修改
在这里插入图片描述

import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
%matplotlib inline
raw=pd.read_excel('汇总.xlsx')
raw.head()
txt
0PT P
1PN JP2019213676-A
2TI Odor modulation agent useful for modulating...
3AU BAO X X
4TSUJIMOTO H
# 章节判断
chapnum = 0
for i in range(len(raw)):
    if raw['txt'][i] == "PT P":
        chapnum += 1
    raw.loc[i, 'chap'] = chapnum
rawgrp = raw.groupby('chap')
chapter = rawgrp.agg(sum) # 只有字符串列的情况下,sum函数自动转为合并字符串
chapter.head()
txt
chap
1.0PT PPN JP2019213676-ATI Odor modulation agent ...
2.0PT PPN WO2019237159-A1TI Treating solid waste ...
3.0PT PPN US2019382319-A1; EP3581551-A1TI Bacteri...
4.0PT PPN WO2019237134-A1TI Aerobic waste compost...
5.0PT PPN CN110550724-ATI Biological filler usefu...
def pn(x):
    patten=re.compile('PN (.*?)TI',re.S)
    a=re.findall(patten,x)
    return a
chapter['pn']=chapter.apply(lambda x: pn(x['txt']), axis=1)
def UT(x):
    patten=re.compile('UT (.*?)ER',re.S)
    return re.findall(patten,x)
chapter['UT']=chapter.apply(lambda x: UT(x['txt']), axis=1)
def TI(x):
    patten=re.compile('TI (.*?)AU',re.S)
    return re.findall(patten,x)
chapter['TI']=chapter.apply(lambda x: TI(x['txt']), axis=1)
# chapter['TI']=chapter['txt'].str.extract("(TI (.*?)AU)")
chapter.head()
txtpnUTTI
chap
1.0PT PPN JP2019213676-ATI Odor modulation agent ...[JP2019213676-A][DIIDW:2019A6103W][Odor modulation agent useful for modulating s...
2.0PT PPN WO2019237159-A1TI Treating solid waste ...[WO2019237159-A1][DIIDW:2019A6029L][Treating solid waste extracted from liquid wa...
3.0PT PPN US2019382319-A1; EP3581551-A1TI Bacteri...[US2019382319-A1; EP3581551-A1][DIIDW:2019A6044K][Bacterially decomposing organic waste materia...
4.0PT PPN WO2019237134-A1TI Aerobic waste compost...[WO2019237134-A1][DIIDW:2019A4913N][Aerobic waste composting chamber used for sol...
5.0PT PPN CN110550724-ATI Biological filler usefu...[CN110550724-A][DIIDW:2019A5191S][Biological filler useful for treating livesto...
# chapter['pn']=chapter['pn'].str.extract('([^\[\]]+)')
chapter['pn']=chapter['pn'].str.replace('[','')
# chapter['pn']=chapter['pn'].str.replace(']','')
chapter.head()
txtpnUTTI
chap
1.0PT PPN JP2019213676-ATI Odor modulation agent ...NaN[DIIDW:2019A6103W][Odor modulation agent useful for modulating s...
2.0PT PPN WO2019237159-A1TI Treating solid waste ...NaN[DIIDW:2019A6029L][Treating solid waste extracted from liquid wa...
3.0PT PPN US2019382319-A1; EP3581551-A1TI Bacteri...NaN[DIIDW:2019A6044K][Bacterially decomposing organic waste materia...
4.0PT PPN WO2019237134-A1TI Aerobic waste compost...NaN[DIIDW:2019A4913N][Aerobic waste composting chamber used for sol...
5.0PT PPN CN110550724-ATI Biological filler usefu...NaN[DIIDW:2019A5191S][Biological filler useful for treating livesto...
def UT(x):
    patten=re.compile('\d\d\d\d',re.S)
    a=re.findall(patten,str(x))
    return str(a).split(",")[0]
chapter['UT']=chapter.apply(lambda x: UT(x['UT']), axis=1)
chapter['UT']=chapter.apply(lambda x: UT(x['UT']), axis=1)
# chapter['UT']=chapter['UT']
# def UT(x):
#     return str(x).split(";")[0]
# chapter['pn']=chapter.apply(lambda x: UT(x['pn']), axis=1)
chapter=chapter.drop(['txt'],axis=1)
chapter.head()
pnUTTI
chap
1.0NaN['2019'][Odor modulation agent useful for modulating s...
2.0NaN['2019'][Treating solid waste extracted from liquid wa...
3.0NaN['2019'][Bacterially decomposing organic waste materia...
4.0NaN['2019'][Aerobic waste composting chamber used for sol...
5.0NaN['2019'][Biological filler useful for treating livesto...
chapter['pn']=chapter['pn'].str.extract('([^\[\]\']+)')
chapter['UT']=chapter['UT'].str.extract('([^\[\]\']+)')
chapter.head()
pnUTTI
chap
1.0NaN2019[Odor modulation agent useful for modulating s...
2.0NaN2019[Treating solid waste extracted from liquid wa...
3.0NaN2019[Bacterially decomposing organic waste materia...
4.0NaN2019[Aerobic waste composting chamber used for sol...
5.0NaN2019[Biological filler useful for treating livesto...

爬出从1963年到2019年,每年专利的数量,列excel表

chapter.dropna(inplace=True)
chapter['UT']=chapter['UT'].drop(chapter[chapter['UT']=='5575'].index)
request1=chapter.groupby(by='UT')['pn'].count()
request1.head()
UT
1981    1
1984    1
1990    3
1991    1
1992    2
Name: pn, dtype: int64
from pylab import mpl
mpl.rcParams['font.sans-serif'] = ['FangSong'] # 指定默认字体
mpl.rcParams['axes.unicode_minus'] = False # 解决保存图像是负号'-'显示为方块的问题
#设置画布
asd,sdf = plt.subplots(1,1,dpi=100)
#获取排前10条类型
request1.plot(kind='bar',title='数量分布',ax=sdf)
# plt.legend(['数量'])
plt.show()

在这里插入图片描述

各个国家和前十的机构,都申请了多少专利

chapter['organization']=chapter['pn'].str[:2]
chapter['organization'].value_counts()
CN    275
KR     22
WO     15
ID     11
JP     11
EP      7
IN      6
DE      5
FR      4
US      4
RU      4
BR      3
NL      2
GB      1
PL      1
PH      1
TW      1
Name: organization, dtype: int64
# import pandas as pd
# f = pd.ExcelFile('汇总.xlsx')
# f.sheet_names  # 获取工作表名称
# data = pd.DataFrame()
# for i in f.sheet_names:
#     d = pd.read_excel('汇总.xlsx', sheetname=i)
#     data = pd.concat([data, d])

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值