基于Python的数据的规整化

最新推荐文章于 2020-12-16 23:44:04 发布

lynn_Dai

最新推荐文章于 2020-12-16 23:44:04 发布

阅读量262

点赞数

文章标签： python json 正则表达式字符串

本文链接：https://blog.csdn.net/lynn_dai/article/details/105595507

版权

1、合并数据集

from pandas import DataFrame,Series
import pandas as pd
import numpy as np
df1 = DataFrame({'key':['b','b','a','c','a','a','b'],
                'data1':range(7)})
df2 = DataFrame({'key':['b','b','d'],
                'data2':range(3)})

多对一的合并,指定列进行连接

pd.merge(df1,df2,on='key')

默认情况下，做的是inner链接，外链接求取的是键的并集

df3 = DataFrame({'lkey':['b','b','a','c','a','a','b'],
                'data1':range(7)})
df4 = DataFrame({'rkey':['a','b','d'],
                'data2':range(3)})
pd.merge(df3,df4,left_on='lkey',right_on='rkey')

2、轴向连接

arr=np.arange(12).reshape((3,4))

列向连接

np.concatenate([arr,arr],axis=1)

若合并的是三个没有重叠的索引的series

s1 = Series([0,1],index=['a','b'])
s2 = Series([2,3],index=['c','d'])
s3 = Series([4,5],index=['e','f'])
pd.concat([s1,s2,s3])

合并重叠数据

a = Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index=['f','e','d','c','b','a'])
b = Series(np.arange(len(a),dtype=np.float64),index=['f','e','d','c','b','a'])

表达的意思是(a is null) if b else a，用参数对象中的数据为调用者对象的缺失数据“打补丁”。

np.where(pd.isnull(a),b,a)

3、移除重复数据

data = DataFrame({'k1':['one']*3+['two']*4,'k2':[1,1,2,3,3,4,4]})
data.duplicated()
#返回移除了重复行的
data.drop_duplicates()

4、利用函数或映射进行数据转换

data = DataFrame({'food':['bacon','pulled pork','bacon','Pastrami',
                          'corned beef','Bacon','pastrami','honey ham',
                          'nova lox'],
                  'ounces':[4,3,12,6,7.5,8,3,5,6]})

添加一列表示该肉类食物来源的动物类型

meat_to_animal = {
    'bacon':'pig',
    'pulled pork':'pig',
    'pastrami':'cow',
    'corned beef':'cow',
    'honey ham':'pig',
    'nova lox':'salmon'
}

首先将各个值转换为小写，然后接受一函数或者含有映射关系的字典型对象

data['animal']=data['food'].map(str.lower).map(meat_to_animal)
data['food'].map(lambda x: meat_to_animal[x.lower()])

5、替换值

data = Series([1,-999,2,-1000,3,0])
data.replace([-999,-1000],np.nan)

6、排列和随机取样

表示需要排列的轴的长度调用permutation，产生一个表示新顺序的整数数组

df = DataFrame(np.arange(5*4).reshape((5,4)))
sampler = np.random.permutation(5)
df.take(sampler)
#也可以用一段代码代替
df.take(np.random.permutation(len(df)))

7、字符串操作

用逗号分割的字符串可以用split拆分成数段

val = 'a,b, guido'
val.split(',')
Out: ['a', 'b', ' guido']

修剪空白符

pieces = [x.strip( ) for x in val.split(',')]
pieces
Out: ['a', 'b', 'guido']

利用加法，可以将字符串以双冒号分隔符的形式连接起来

first,second,third = pieces
first+'::'+second+'::'+third
##也可以
"::".join(pieces)

8、正则表达式（regex）

`re`模块的函数分为：模式匹配、替换、拆分。

拆分字符串，分隔符为数量不定的一组空白符（制表符、空格、换行符）

import re
text = "foo bar \t baz \tqux"
re.split('\s+',text)

正则表达式先被编译

regex=re.compile('\s+')
regex.split(text)

得到匹配regex的字符串中的所有项

regex.findall(text)

如果打算对许多字符串应用同一条正则表达式，建议用re.compile创建regex对象，可以节省大量的CPU时间。

re.IGNORECASE的作用是使正则表达式对大小写不敏感

text = """ Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""

正则表达式，能够识别大部分电子邮件的正则表达式

pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
regex = re.compile(pattern,flags=re.IGNORECASE)
regex.findall(text)

match只匹配出现在字符串开头的模式

正则表达式，返回各个地址分成三个部分：用户名、域名以及域名后缀

pattern1 = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern1,flags=re.IGNORECASE)
m = regex.match('wesm@bright.net')
m.groups()
Out: ('wesm', 'bright', 'net')

regex.findall(text)
Out:
[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]