1. venn diagram
full outer join (union)
inner join (intersection)
2. merge (join horizontally)
staff_df = pd.DataFrame([{
'Name': 'Kelly', 'Role': 'HR'}, {
'Name': 'Sally', 'Role': 'TA'}, {
'Name': 'James', 'Role': 'GD'}])
staff_df = staff_df.set_index('Name')
stu_df = pd.DataFrame([{
'Name': 'James', 'School': 'Com'}, {
'Name': 'Mike', 'School': 'Bus'}, {
'Name': 'Sally', 'School': 'Eng'}])
stu_df = studf.set_index('Name')
merge
- use indices
union
pd.merge(staff_df, stu_df, how='outer', left_index=True, right_indtx=True)
intersection
pd.merge(staff_df, stu_df, how='inner', left_index=True, right_indtx=True)
left join
pd.merge(staff_df, stu_df, how='left', left_index=True, right_indtx=True)
Name | Role | School |
---|---|---|
Kelly | HR | NaN |
Sally | TA | Eng |
James | GD | Bus |
right join (ommited)
- use cloumns
pd.merge(staff_df, stu_df, how='right', on='Name')
- when 2 dataframes have conflitc
staff_df = pd.DataFrame([{
'Name': 'Kelly', 'Role': 'HR', 'Location': 'KKK'},
{
'Name': 'Sally', 'Role': 'TA', 'Location': 'SSS'},
{
'Name': 'James', 'Role': 'GD', 'Location': 'JJJ'}])
stu_df = pd.DataFrame([{
'Name': 'James', 'School': 'Bus','Location': '1024jjj'},
{
'Name': 'Mike', 'School': 'Law', 'Location': 'mmm22'},
{
'Name': 'Sally', 'School': 'Eng', 'Location': '512sss'}])
pd.merge(staff_df, stu_df, how='left', on='Name')
Name Role Location_x School Location_y
multi-indexing
pd.merge(staff_df, stu_df, how='inner', on=['FName','LName'])
3. concat (join vertically)
3.1 cell magic “%%capture”
3.2 concat
frames = [df_2011, df_2012, df_2013]
pd.concat(frames, keys=['2011', '2-12', '2013'] )
#set an extra level of indices with 'keys'
Concatenation also has inner and outer method. If two dataframes do not have identical columns, and choose the outer method, some cells will be NaN. choose to do inner, then some observations will be dropped due to NaN values.
4. pandas idioms - pandorable
4.1 method chaining
def first_approach():
global df
return(df.where(df['SUMLEV']==50)
.dropna()
.set_index()['STNAEME','CTYNAME']
.rename(columns={
'ESTIMATESBASE2010':'Estimates Base 2010'}))
df = pd.read_csv('datasets/census.csv')
timeit.timeit(first_approac