The Data integration and The Datacleaning(chapter 5)

Data Integration and Data Cleaning

#coding:utf-8

import pandas as pd
import numpy as np
sample = pd.DataFrame(np.random.randn(4,5),
columns = [‘a’,‘b’,‘c’,‘d’,‘e’])
print(sample)

choose column a

sample[‘a’] # return the Series
sample.ix[:,‘a’]# the index can number or string
sample.iloc[1,:]# only choose number as index
sample.loc[:,‘a’]# only choose string as index
sample[[‘a’]]# return the DataFrame

to create a new column

sample[‘new_col1’] = sample[‘a’] - sample[‘b’]# to create a new column
sample.assign(new_col2 = sample[‘a’] - sample[‘b’]) # the method must be return to a new dataframe

to delete a column

sample.drop(‘a’,axis = 1) # delete one column
sample.drop([‘a’,‘b’],axis = 1) # delete two columns

condition find

sample = pd.DataFrame({‘name’:[‘Bob’,‘Lindy’,‘Mark’,‘Miki’,‘Sully’,‘Rose’],
‘score’:[98,78,87,77,65,67],
‘group’:[1,1,1,2,1,2]})

one condition

sample[sample.score > 70]

more conditions

sample[(sample.score > 70) & (sample.group == 1)] # and
sample[~(sample.group == 1)] # not
sample[(sample.group == 1) | (sample.group == 2)] # or

use query

sample.query(‘score > 90’)
sample.query(’(group == 2) | (group == 1)’)

other find method

sample[sample[‘score’].between(70,80,inclusive = True)] # between weather include the edge
smaple[sample[‘name’].isin([‘Bob’,‘Lindy’])] # isin
sample[sample[‘name’].str.contains(’[M]+’)] # use zhengze

JOIN

The first like SQL join

df1 = pd.DataFrame({‘id’:[1,2,3],
‘col1’:[‘a’,‘b’,‘c’]})
df2 = pd.DataFrame({‘id’:[4,3],
‘col2’:[‘d’,‘e’]})

inner join

df1.merge(df2,how = ‘inner’ ,on = ‘id’)
df1.merge(df2,how=‘inner’,left_on = ‘id’,right_on = ‘id’)

left join

df1.merge(df2,how = ‘left’,on=‘id’)

right join

df1.merge(df2,how = ‘right’ , on =‘id’)

out join

df1.merge(df2,how = ‘outer’ , on = ‘id’)

Row index join

df1 = pd.DataFrame({‘id1’:[1,2,3],
‘col1’:[‘a’,‘b’,‘c’]},
index = [1,2,3])
df2 = pd.DataFrame({‘id2’:[1,2,3],
‘col2’:[‘aa’,‘bb’,‘cc’]},
index = [1,3,2])

pd.concat([df1,df2],axis = 1)
df1.join(df2)

Column index join

pd.concat([df1,df2],ignore_index = Ture,axis =0)
pd.concat([df1,df2],ingore_index = True,axis =0).drop_duplicates()

Sort

sort_values,sort_index,sortlevel

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值