The Data integration and The Datacleaning(chapter 5)_datacleaning data integration-CSDN博客

本文链接：https://blog.csdn.net/BOTAK_/article/details/90315666

Data Integration and Data Cleaning

#coding:utf-8

import pandas as pd
import numpy as np
sample = pd.DataFrame(np.random.randn(4,5),
columns = [‘a’,‘b’,‘c’,‘d’,‘e’])
print(sample)

choose column a

sample[‘a’] # return the Series
sample.ix[:,‘a’]# the index can number or string
sample.iloc[1,:]# only choose number as index
sample.loc[:,‘a’]# only choose string as index
sample[[‘a’]]# return the DataFrame

to create a new column

sample[‘new_col1’] = sample[‘a’] - sample[‘b’]# to create a new column
sample.assign(new_col2 = sample[‘a’] - sample[‘b’]) # the method must be return to a new dataframe

to delete a column

sample.drop(‘a’,axis = 1) # delete one column
sample.drop([‘a’,‘b’],axis = 1) # delete two columns

condition find

sample = pd.DataFrame({‘name’:[‘Bob’,‘Lindy’,‘Mark’,‘Miki’,‘Sully’,‘Rose’],
‘score’:[98,78,87,77,65,67],
‘group’:[1,1,1,2,1,2]})

one condition

sample[sample.score > 70]

more conditions

sample[(sample.score > 70) & (sample.group == 1)] # and
sample[~(sample.group == 1)] # not
sample[(sample.group == 1) | (sample.group == 2)] # or

use query

sample.query(‘score > 90’)
sample.query(’(group == 2) | (group == 1)’)

other find method

sample[sample[‘score’].between(70,80,inclusive = True)] # between weather include the edge
smaple[sample[‘name’].isin([‘Bob’,‘Lindy’])] # isin
sample[sample[‘name’].str.contains(’[M]+’)] # use zhengze

JOIN

The first like SQL join

df1 = pd.DataFrame({‘id’:[1,2,3],
‘col1’:[‘a’,‘b’,‘c’]})
df2 = pd.DataFrame({‘id’:[4,3],
‘col2’:[‘d’,‘e’]})

inner join

df1.merge(df2,how = ‘inner’ ,on = ‘id’)
df1.merge(df2,how=‘inner’,left_on = ‘id’,right_on = ‘id’)

left join

df1.merge(df2,how = ‘left’,on=‘id’)

right join

df1.merge(df2,how = ‘right’ , on =‘id’)

out join

df1.merge(df2,how = ‘outer’ , on = ‘id’)

Row index join

df1 = pd.DataFrame({‘id1’:[1,2,3],
‘col1’:[‘a’,‘b’,‘c’]},
index = [1,2,3])
df2 = pd.DataFrame({‘id2’:[1,2,3],
‘col2’:[‘aa’,‘bb’,‘cc’]},
index = [1,3,2])

pd.concat([df1,df2],axis = 1)
df1.join(df2)