【cookbook pandas】学习笔记 chap 11. Combining Pandas Objects

Introduction

a wide variety of options are available to combine two or more DataFrame or Series together;
the append method is the least flexibility and only allows for new rows to be appended to a DataFrame.
the concat method is very versatile and can combine any number of DataFrame or Series on either axis.
the join method provides fast lookups by aligning a column of one DataFrame to the index of others.
the merge method provides SQL-like capabilities to join two DataFrame together .

appending new rows to DataFrames

crop up 出其不意地发生;
appending rows to a small dataset with the .loc attribute and then transition to using the .append method
sneaky adj, 悄悄的,偷偷摸摸的

import pandas as pd
import numpy as np
names=pd.read_csv('names.csv')
names
Out[5]: 
       Name  Age
0  Cornelia   70
1     Abbas   69
2  Penelope    4
3      Niko    2
# create a list that contains new data 
# and use the ,loc attribute to set a single row label equal to this new data
new_data=['Aria',1]
names.loc[4]=new_data
# the .loc attribute uses labels to refer to rows matching the integer location
# also append more rows with non-integer labels
names.loc['five']=['Zach',3]
names
Out[14]: 
          Name  Age
0     Cornelia   70
1        Abbas   69
2     Penelope    4
3         Niko    2
4         Aria    1
five      Zach    3
# to be more explicit in associating variables to values, 
# you may use a dictionary
# dynamically choose by setting the new labels to the length of the DataFrame
names.loc[len(names)]=['Dean',32]
names
Out[19]: 
          Name  Age
0     Cornelia   70
1        Abbas   69
2     Penelope    4
3         Niko    2
4         Aria    1
five      Zach    3
6         Dean   32
names.loc[len(names)]={Name:'Dean',Age:32}
Traceback (most recent call last):
 ... ....
NameError: name 'Name' is not defined

names.loc[len(names)]={'Name':'Dean','Age':32}
# a Series can hold the new data as well
names.loc[len(names)]=pd.Series({'Name':'Dean','Age':32})
# use the .loc attribute makes change to the `names` DataFrame in-place
# and there is no separate copy of the DataFrame that is returned 
# the .append method does not modify the calling DataFrame
# and returns new copy of the DataFrame with the appended rows
# the first parameter to .append must be either another DataFrame, Series, dictionary, or a list of these, but not a list like ['Aria',1]
names.append({'Name':'Tom','Age':3},ignore_index=True)
Out[29]: 
       Name  Age
0  Cornelia   70
1     Abbas   69
2  Penelope    4
3      Niko    2
4      Aria    1
5      Zach    3
6      Dean   32
7      Dean   32
8      Dean   32
9       Tom    3
# ignore_index=True, then the old index will be removed complete 
# and relaced with a RangeIndex from 0 to -1
# now, we specify the index for names DataFrame
names.index=list('abcdefghij')
Traceback (most recent call last):
    raise ValueError(
ValueError: Length mismatch: Expected axis has 9 elements, new values have 10 elements

names.index=list('abcdefghi')
names
Out[35]: 
       Name  Age
a  Cornelia   70
b     Abbas   69
c  Penelope    4
d      Niko    2
e      Aria    1
f      Zach    3
g      Dean   32
h      Dean   32
i      Dean   32
# append a Series that has a names attribute with the .append method
s=pd.Series({'Name':'Jame','Age':10},name='j')
names.append(s)
Out[38]: 
       Name  Age
a  Cornelia   70
b     Abbas   69
c  Penelope    4
d      Niko    2
e      Aria    1
f      Zach    3
g      Dean   32
h      Dean   32
i      Dean   32
j      Jame   10
# .append method is more flexible than the .loc attribute 
# since it supports appending multiple rows at the same time
s1=pd.Series({'Name':'Zaa','Age':3},name=len(names))
s2=pd.Series({'Name':'Zrr','Age':9},name=len(names)+1)
names.append([s1,s2])
Out[43]: 
        Name  Age
a   Cornelia   70
b      Abbas   69
c   Penelope    4
d       Niko    2
e       Aria    1
f       Zach    3
g       Dean   32
h       Dean   32
i       Dean   32
9        Zaa    3
10       Zrr    9
# for large dataset, using dictionary to append is difficult
# select a single row as a Series and chain 'to_dict' method to it
# to get an example row as a dictionary
bball_16=pd.read_csv('baseball16.csv')
bball_16
Out[48]: 
     playerID  yearID  stint teamID lgID  ...   IBB   HBP   SH   SF  GIDP
0   altuvjo01    2016      1    HOU   AL  ...  11.0   7.0  3.0  7.0  15.0
1   bregmal01    2016      1    HOU   AL  ...   0.0   0.0  0.0  1.0   1.0
2   castrja01    2016      1    HOU   AL  ...   0.0   1.0  1.0  0.0   9.0
3   correca01    2016      1    HOU   AL  ...   5.0   5.0  0.0  3.0  12.0
4   gattiev01    2016      1    HOU   AL  ...   6.0   4.0  0.0  5.0  12.0
5   gomezca01    2016      1    HOU   AL  ...   2.0   4.0  3.0  0.0  11.0
6   gonzama01    2016      1    HOU   AL  ...   1.0   5.0  6.0  1.0  16.0
7   gourryu01    2016      1    HOU   AL  ...   0.0   1.0  0.0  1.0   7.0
8    kempto01    2016      1    HOU   AL  ...   0.0   0.0  1.0  1.0   5.0
9   marisja01    2016      1    HOU   AL  ...   0.0   3.0  4.0  1.0   4.0
10  rasmuco01    2016      1    HOU   AL  ...   0.0   0.0  1.0  4.0   5.0
11   reedaj01    2016      1    HOU   AL  ...   0.0   0.0  0.0  1.0   1.0
12  springe01    2016      1    HOU   AL  ...   2.0  11.0  0.0  1.0  12.0
13  tuckepr01    2016      1    HOU   AL  ...   0.0   2.0  0.0  0.0   2.0
14  valbulu01    2016      1    HOU   AL  ...   2.0   1.0  3.0  2.0   5.0
15  whitety01    2016      1    HOU   AL  ...   1.0   2.0  0.0  2.0   6.0
[16 rows x 22 columns]

data_dict=bball_16.iloc[0].to_dict()
data_dict
Out[51]: 
{'playerID': 'altuvjo01',
 'yearID': 2016,
 'stint': 1,
 'teamID': 'HOU',
 'lgID': 'AL',
 'G': 161,
 'AB': 640,
 'R': 108,
 'H': 216,
 '2B': 42,
 '3B': 5,
 'HR': 24,
 'RBI': 96.0,
 'SB': 30.0,
 'CS': 10.0,
 'BB': 60,
 'SO': 70.0,
 'IBB': 11.0,
 'HBP': 7.0,
 'SH': 3.0,
 'SF': 7.0,
 'GIDP': 15.0}
# clear the old values with a dictionary comprehension 
# by assigning any previous string values as an empty string and all others as missing values
new_data_dict={k: '' if isinstance(v,str) else np.nan 
               for (k,v) in data_dict.items()}
new_data_dict
Out[55]: 
{'playerID': '',
 'yearID': nan,
 'stint': nan,
 'teamID': '',
 'lgID': '',
 'G': nan,
 'AB': nan,
 'R': nan,
 'H': nan,
 '2B': nan,
 '3B': nan,
 'HR': nan,
 'RBI': nan,
 'SB': nan,
 'CS': nan,
 'BB': nan,
 'SO': nan,
 'IBB': nan,
 'HBP': nan,
 'SH': nan,
 'SF': nan,
 'GIDP': nan}
# this dictionary can serve as a template for any new data you would like to enter

appending a single row to a DataFrame is a fairly expensive operation ,
so, don’t write a loop to append single rows of data to a DataFrame


# first create 1000 rows of new data as a list of Series
a =[1,2]
a.append(3)
a
Out[60]: [1, 2, 3]
# !!! .append modify list, but not modify DataFrame
random_data=[]
for i in range(1000):
    d=dict()
    for k,v in data_dict.items():
        if isinstance(v,str):
            d[k]=np.random.choice(list('abcde'))
        else:
            d[k]=np.random.randint(10)
    random_data.append(pd.Series(d,name=i+len(bball_16)))
    
random_data[0]
Out[66]: 
playerID    e
yearID      9
stint       9
teamID      c
lgID        b
G           0
AB          3
R           6
H           0
2B          3
3B          4
HR          4
RBI         8
SB          6
CS          3
BB          6
SO          5
IBB         7
HBP         1
SH          6
SF          3
GIDP        5
Name: 16, dtype: object
# time how long it takes to loop through each item making one append at one time
%%timeit
bball_16_copy=bball_16.copy()
for row in random_data:
    bball_16_copy=bball_16_copy.append(row)
6.2 s ± 376 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# pass in entire list of Series, we get an enumous speed increase
%%timeit
bball_16_copy=bball_16.append(random_data)
67 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# internally, pandas converts the list of Series to a single DataFrame 
# and then appends the data

Concatenating multiple DataFrames together

as per usual 照常
water-down 大打折扣的,用水冲淡的
haphazardly 偶然地,随意地,杂乱地
.append method is heavily water-down version of concat that can only append new rows to DataFrame;
internally, .append just calls the concat function .

# concat function enables concatenating two or more DataFrames (or Series ) together, both vertically and horizontally
import pandas as pd
import numpy as np
stock_2016=pd.read_csv('stocks_2016.csv',index_col='Symbol')
stock_2016
Out[9]: 
        Shares  Low  High
Symbol                   
AAPL        80   95   110
TSLA        50   80   130
WMT         40   55    70
stock_2017
Out[10]: 
        Shares  Low  High
Symbol                   
AAPL        50  120   140
GE         100   30    40
IBM         87   75    95
SLB         20   55    85
TXN        500   15    23
TSLA       100  100   300
# place all the datasets into a single list
s_list=[stock_2016,stock_2017]
s_list
Out[13]: 
[        Shares  Low  High
 Symbol                   
 AAPL        80   95   110
 TSLA        50   80   130
 WMT         40   55    70,
         Shares  Low  High
 Symbol                   
 AAPL        50  120   140
 GE         100   30    40
 IBM         87   75    95
 SLB         20   55    85
 TXN        500   15    23
 TSLA       100  100   300]
pd.concat(s_list)
Out[14]: 
        Shares  Low  High
Symbol                   
AAPL        80   95   110
TSLA        50   80   130
WMT         40   55    70
AAPL        50  120   140
GE         100   30    40
IBM         87   75    95
SLB         20   55    85
TXN        500   15    23
TSLA       100  100   300
# call the concat function to concatenate them together along the default axis(axis=0)
# or say , by default, concat function concatenates DataFrames vertically,
# one top of the other
type(pd.concat(s_list))
Out[18]: pandas.core.frame.DataFrame
# the preceeding DataFrame , no way to identify the year of each row 
# concat function uses `key` parameter to label each piece
pd.concat(s_list,keys=['2016','2017'])
Out[22]: 
             Shares  Low  High
     Symbol                   
2016 AAPL        80   95   110
     TSLA        50   80   130
     WMT         40   55    70
2017 AAPL        50  120   140
     GE         100   30    40
     IBM         87   75    95
     SLB         20   55    85
     TXN        500   15    23
     TSLA       100  100   300
pd.concat(s_list,keys=['2016','2017'],names=['Years','Symbol'])
Out[23]: 
              Shares  Low  High
Years Symbol                   
2016  AAPL        80   95   110
      TSLA        50   80   130
      WMT         40   55    70
2017  AAPL        50  120   140
      GE         100   30    40
      IBM         87   75    95
      SLB         20   55    85
      TXN        500   15    23
      TSLA       100  100   300
# change axis from 0 to 1, to concatenate DataFrame horizontally
pd.concat(s_list,axis=1)
Out[25]: 
        Shares   Low   High  Shares    Low   High
Symbol                                           
AAPL      80.0  95.0  110.0    50.0  120.0  140.0
TSLA      50.0  80.0  130.0   100.0  100.0  300.0
WMT       40.0  55.0   70.0     NaN    NaN    NaN
GE         NaN   NaN    NaN   100.0   30.0   40.0
IBM        NaN   NaN    NaN    87.0   75.0   95.0
SLB        NaN   NaN    NaN    20.0   55.0   85.0
TXN        NaN   NaN    NaN   500.0   15.0   23.0
pd.concat(s_list,axis=1,keys=['2016','2017'],names=['Year',None])
Out[26]: 
Year     2016                2017              
       Shares   Low   High Shares    Low   High
Symbol                                         
AAPL     80.0  95.0  110.0   50.0  120.0  140.0
TSLA     50.0  80.0  130.0  100.0  100.0  300.0
WMT      40.0  55.0   70.0    NaN    NaN    NaN
GE        NaN   NaN    NaN  100.0   30.0   40.0
IBM       NaN   NaN    NaN   87.0   75.0   95.0
SLB       NaN   NaN    NaN   20.0   55.0   85.0
TXN       NaN   NaN    NaN  500.0   15.0   23.0
pd.concat(s_list,axis='columns',keys=['2016','2017'],names=['Year',None])
Out[27]: 
Year     2016                2017              
       Shares   Low   High Shares    Low   High
Symbol                                         
AAPL     80.0  95.0  110.0   50.0  120.0  140.0
TSLA     50.0  80.0  130.0  100.0  100.0  300.0
WMT      40.0  55.0   70.0    NaN    NaN    NaN
GE        NaN   NaN    NaN  100.0   30.0   40.0
IBM       NaN   NaN    NaN   87.0   75.0   95.0
SLB       NaN   NaN    NaN   20.0   55.0   85.0
TXN       NaN   NaN    NaN  500.0   15.0   23.0
# missing values appear whenever a stock symbol is 
# present in one year but not the other 
# concat function, by default, uses an outer join,
# keeping all rows from each DataFrame in the list
# set the `join` parameter to change the behavior
pd.concat(s_list,join='inner',keys=['2016','2017'],names=['Year',None])
Out[33]: 
           Shares  Low  High
Year                        
2016 AAPL      80   95   110
     TSLA      50   80   130
     WMT       40   55    70
2017 AAPL      50  120   140
     GE       100   30    40
     IBM       87   75    95
     SLB       20   55    85
     TXN      500   15    23
     TSLA     100  100   300
pd.concat(s_list,join='inner',keys=['2016','2017'],names=['Year',None],axis='columns')
Out[34]: 
Year     2016            2017          
       Shares Low High Shares  Low High
Symbol                                 
AAPL       80  95  110     50  120  140
TSLA       50  80  130    100  100  300
# concat function accepts a list as the first parameter 
# this list must be a sequence of pandas objects, 
# typically a list of DataFrame or Series.
# by default, these objects will be stacked vertically
# when concatenating vertically, the DataFrame align by their column names 
stock_2016.append(stock_2017)
Out[40]: 
        Shares  Low  High
Symbol                   
AAPL        80   95   110
TSLA        50   80   130
WMT         40   55    70
AAPL        50  120   140
GE         100   30    40
IBM         87   75    95
SLB         20   55    85
TXN        500   15    23
TSLA       100  100   300

Understanding the differences between concat, join, and merge

the .merge and .join DataFrame (not Series) methods and the concat function all provide very similar functionality to combine multiple pandas objects together.

在这里插入图片描述
.merge.join 很像,二者都是DataFrame方法,都是水平连接数据帧,但不同的是前者默认左连接,后者默认内连接,不过都可以调参数。
concat是函数,既可以水平连接,也可以竖直连接。
suffix 后缀
steak 牛排,碎牛肉
coconut 椰子
A method is a far better choice than B method : A远胜于B
switch gears to focus on … 转换注意力于……
sanity check 完整性检查;
align data by the values in their columns 按照列中的值对齐数据;

import pandas as pd
import numpy as np
# read stock data of three years into a list of DataFrame using a loop instead of three different calls to the 'read_csv' function
years=2016,2017,2018
# years is a tuple
stock_tables=[pd.read_csv(f'stocks_{year}.csv',index_col='Symbol') for year in years]
stock_tables
Out[9]: 
[        Shares  Low  High
 Symbol                   
 AAPL        80   95   110
 TSLA        50   80   130
 WMT         40   55    70,
         Shares  Low  High
 Symbol                   
 AAPL        50  120   140
 GE         100   30    40
 IBM         87   75    95
 SLB         20   55    85
 TXN        500   15    23
 TSLA       100  100   300,
         Shares  Low  High
 Symbol                   
 AAPL        40  135   170
 AMZN         8  900  1125
 TSLA        50  220   400]
# o_o .... 还能这样用
stock_2016,stock_2017,stock_2018=stock_tables
stock_2016
Out[12]: 
        Shares  Low  High
Symbol                   
AAPL        80   95   110
TSLA        50   80   130
WMT         40   55    70
pd.concat(stock_tables,keys=['2016','2017','2018'])
Out[13]: 
             Shares  Low  High
     Symbol                   
2016 AAPL        80   95   110
     TSLA        50   80   130
     WMT         40   55    70
2017 AAPL        50  120   140
     GE         100   30    40
     IBM         87   75    95
     SLB         20   55    85
     TXN        500   15    23
     TSLA       100  100   300
2018 AAPL        40  135   170
     AMZN         8  900  1125
     TSLA        50  220   400
pd.concat(stock_tables,axis=1,keys=['2016','2017','2018'])
Out[14]: 
         2016                2017                 2018               
       Shares   Low   High Shares    Low   High Shares    Low    High
Symbol                                                               
AAPL     80.0  95.0  110.0   50.0  120.0  140.0   40.0  135.0   170.0
TSLA     50.0  80.0  130.0  100.0  100.0  300.0   50.0  220.0   400.0
WMT      40.0  55.0   70.0    NaN    NaN    NaN    NaN    NaN     NaN
GE        NaN   NaN    NaN  100.0   30.0   40.0    NaN    NaN     NaN
IBM       NaN   NaN    NaN   87.0   75.0   95.0    NaN    NaN     NaN
SLB       NaN   NaN    NaN   20.0   55.0   85.0    NaN    NaN     NaN
TXN       NaN   NaN    NaN  500.0   15.0   23.0    NaN    NaN     NaN
AMZN      NaN   NaN    NaN    NaN    NaN    NaN    8.0  900.0  1125.0
pd.concat(dict(zip(years,stock_tables)),axis='columns')
Out[15]: 
         2016                2017                 2018               
       Shares   Low   High Shares    Low   High Shares    Low    High
Symbol                                                               
AAPL     80.0  95.0  110.0   50.0  120.0  140.0   40.0  135.0   170.0
TSLA     50.0  80.0  130.0  100.0  100.0  300.0   50.0  220.0   400.0
WMT      40.0  55.0   70.0    NaN    NaN    NaN    NaN    NaN     NaN
GE        NaN   NaN    NaN  100.0   30.0   40.0    NaN    NaN     NaN
IBM       NaN   NaN    NaN   87.0   75.0   95.0    NaN    NaN     NaN
SLB       NaN   NaN    NaN   20.0   55.0   85.0    NaN    NaN     NaN
TXN       NaN   NaN    NaN  500.0   15.0   23.0    NaN    NaN     NaN
AMZN      NaN   NaN    NaN    NaN    NaN    NaN    8.0  900.0  1125.0

(pd.concat(dict(zip(years,stock_tables)),axis='columns')
 ==pd.concat(stock_tables,axis=1,keys=['2016','2017','2018']))
Traceback (most recent call last):
  ...
ValueError: Can only compare identically-labeled DataFrame objects

(pd.concat(dict(zip(years,stock_tables)),axis='columns')
 .equals(concat(stock_tables,axis=1,keys=['2016','2017','2018'])))
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-18-1a818f9b814a>", line 2, in <module>
    .equals(concat(stock_tables,axis=1,keys=['2016','2017','2018'])))
NameError: name 'concat' is not defined
# combining the 
# DataFrame horizontally, we can use the .join and .merge to replicate the functionality of `concat` function
# here, use .join method to combine stock_2017 and stock_2016 DataFrame
# be default, the DataFrames align on their index
# for any of the columns with the same name, use 'lsuffix' and
# 'rsuffix' parameters to distinguish them in the result
stock_2016.join(stock_2017,lsuffix='_2016',rsuffix='_2017',how='outer')
Out[25]: 
        Shares_2016  Low_2016  High_2016  Shares_2017  Low_2017  High_2017
Symbol                                                                    
AAPL           80.0      95.0      110.0         50.0     120.0      140.0
GE              NaN       NaN        NaN        100.0      30.0       40.0
IBM             NaN       NaN        NaN         87.0      75.0       95.0
SLB             NaN       NaN        NaN         20.0      55.0       85.0
TSLA           50.0      80.0      130.0        100.0     100.0      300.0
TXN             NaN       NaN        NaN        500.0      15.0       23.0
WMT            40.0      55.0       70.0          NaN       NaN        NaN
# to replicate the output of the `concat` function, pass a list 
# of dataframes to the .join method
other=[stock_2016.add_suffix('_2016'),stock_2017.add_suffix('_2017')]
other
Out[29]: 
[        Shares_2016  Low_2016  High_2016
 Symbol                                  
 AAPL             80        95        110
 TSLA             50        80        130
 WMT              40        55         70,
         Shares_2017  Low_2017  High_2017
 Symbol                                  
 AAPL             50       120        140
 GE              100        30         40
 IBM              87        75         95
 SLB              20        55         85
 TXN             500        15         23
 TSLA            100       100        300]
stock_2018.add_suffix('_2018').join(other,how='outer')
Out[30]: 
        Shares_2018  Low_2018  High_2018  ...  Shares_2017  Low_2017  High_2017
Symbol                                    ...                                  
AAPL           40.0     135.0      170.0  ...         50.0     120.0      140.0
AMZN            8.0     900.0     1125.0  ...          NaN       NaN        NaN
TSLA           50.0     220.0      400.0  ...        100.0     100.0      300.0
WMT             NaN       NaN        NaN  ...          NaN       NaN        NaN
GE              NaN       NaN        NaN  ...        100.0      30.0       40.0
IBM             NaN       NaN        NaN  ...         87.0      75.0       95.0
SLB             NaN       NaN        NaN  ...         20.0      55.0       85.0
TXN             NaN       NaN        NaN  ...        500.0      15.0       23.0
[8 rows x 9 columns]
# check whether they are equal 

stock_join=stock_2016.add_suffix('_2016').join(other,how='outer')
Traceback (most recent call last):
  ......
ValueError: Indexes have overlapping values: Index(['Shares_2016', 'Low_2016', 'High_2016'], dtype='object')
# because 'other' contains stock_2016 already.

stock_join=stock_2018.add_suffix('_2018').join(other,how='outer')
stock_concat=(
    pd.concat(
        dict(zip(years,stock_tables)),axis='columns')
    .swaplevel(axis=1)
    .pipe(lambda df_:df_.set_axis(df_.columns.to_flat_index(),axis=1))
    .rename(lambda label:'_'.join([str[x] for x in label),axis=1)))
  File "<ipython-input-34-62ed7569d7c4>", line 6
    .rename(lambda label:'_'.join([str[x] for x in label),axis=1)))
                                                        ^
SyntaxError: closing parenthesis ')' does not match opening parenthesis '['
stock_concat=(
    pd.concat(
        dict(zip(years,stock_tables)),axis='columns')
    .swaplevel(axis=1)
    .pipe(lambda df_:df_.set_axis(df_.columns.to_flat_index(),axis=1))
    .rename(lambda label:'_'.join([str(x) for x in label]),axis=1)))
  File "<ipython-input-35-1b47d0c47ebf>", line 6
    .rename(lambda label:'_'.join([str(x) for x in label]),axis=1)))
                                                                   ^
SyntaxError: unmatched ')'
stock_concat=(
    pd.concat(
        dict(zip(years,stock_tables)),axis='columns')
    .swaplevel(axis=1)
    .pipe(lambda df_:df_.set_axis(df_.columns.to_flat_index(),axis=1))
    .rename(lambda label:'_'.join([str(x) for x in label]),axis=1))
stock_concat.equals(stock_join)
Out[37]: False
stock_concat
Out[38]: 
        Shares_2016  Low_2016  High_2016  ...  Shares_2018  Low_2018  High_2018
Symbol                                    ...                                  
AAPL           80.0      95.0      110.0  ...         40.0     135.0      170.0
TSLA           50.0      80.0      130.0  ...         50.0     220.0      400.0
WMT            40.0      55.0       70.0  ...          NaN       NaN        NaN
GE              NaN       NaN        NaN  ...          NaN       NaN        NaN
IBM             NaN       NaN        NaN  ...          NaN       NaN        NaN
SLB             NaN       NaN        NaN  ...          NaN       NaN        NaN
TXN             NaN       NaN        NaN  ...          NaN       NaN        NaN
AMZN            NaN       NaN        NaN  ...          8.0     900.0     1125.0
[8 rows x 9 columns]
stock_join
Out[39]: 
        Shares_2018  Low_2018  High_2018  ...  Shares_2017  Low_2017  High_2017
Symbol                                    ...                                  
AAPL           40.0     135.0      170.0  ...         50.0     120.0      140.0
AMZN            8.0     900.0     1125.0  ...          NaN       NaN        NaN
TSLA           50.0     220.0      400.0  ...        100.0     100.0      300.0
WMT             NaN       NaN        NaN  ...          NaN       NaN        NaN
GE              NaN       NaN        NaN  ...        100.0      30.0       40.0
IBM             NaN       NaN        NaN  ...         87.0      75.0       95.0
SLB             NaN       NaN        NaN  ...         20.0      55.0       85.0
TXN             NaN       NaN        NaN  ...        500.0      15.0       23.0
[8 rows x 9 columns]
#因为我把 stock_join 中的年份顺序弄错了,所以结果是False,本应该是True
# unlike `concat` and `.join`, `.merge` can only combine two dataframes together 
# be default, `,merge` attempts to align the values in the columns that have the same name for each of dataframes
# have it align on the index by setting the Boolean parameter `left_index` and 'right_index` to True
stock_2016.index[:2]
Out[44]: Index(['AAPL', 'TSLA'], dtype='object', name='Symbol')
stock_2016.merge(stock_2017,right_index=True,left_index=True)
Out[45]: 
        Shares_x  Low_x  High_x  Shares_y  Low_y  High_y
Symbol                                                  
AAPL          80     95     110        50    120     140
TSLA          50     80     130       100    100     300
# by default, .merge uses an inner join and automatically supplies suffixes for identically named columns
# to replicate `concat` function, change to an outer join and then perform another outer join of the 2018 data 
stock_2016.merge(stock_2017,right_index=True,left_index=True,how='outer',suffixes=('_2016','_2017')).merge(stock_2018.add_suffix('_2018'),right_index=True,left_index=True,how='outer')
Out[48]: 
        Shares_2016  Low_2016  High_2016  ...  Shares_2018  Low_2018  High_2018
Symbol                                    ...                                  
AAPL           80.0      95.0      110.0  ...         40.0     135.0      170.0
AMZN            NaN       NaN        NaN  ...          8.0     900.0     1125.0
GE              NaN       NaN        NaN  ...          NaN       NaN        NaN
IBM             NaN       NaN        NaN  ...          NaN       NaN        NaN
SLB             NaN       NaN        NaN  ...          NaN       NaN        NaN
TSLA           50.0      80.0      130.0  ...         50.0     220.0      400.0
TXN             NaN       NaN        NaN  ...          NaN       NaN        NaN
WMT            40.0      55.0       70.0  ...          NaN       NaN        NaN
[8 rows x 9 columns]
stock_merge=stock_2016.merge(stock_2017,right_index=True,left_index=True,how='outer',suffixes=('_2016','_2017')).merge(stock_2018.add_suffix('_2018'),right_index=True,left_index=True,how='outer')
stock_merge.equals(stock_concat.sort_index())
Out[50]: True
# .merge method is built for this situation where
# align together the values of columns and not the index or column labels themselves
names=['prices','transactions']
food_table=[pd.read_csv("food_{}.csv".format(name)) for name in names]

food_table
Out[57]: 
[     item store  price  Date
 0    pear     A   0.99  2017
 1    pear     B   1.99  2017
 2   peach     A   2.99  2017
 3   peach     B   3.49  2017
 4  banana     A   0.39  2017
 5  banana     B   0.49  2017
 6   steak     A   5.99  2017
 7   steak     B   6.99  2017
 8   steak     B   4.99  2015,
    custid     item store  quantity
 0       1     pear     A         5
 1       1   banana     A        10
 2       2    steak     B         3
 3       2     pear     B         1
 4       2    peach     B         2
 5       2    steak     B         1
 6       2  coconut     B         4]
food_prices,food_transactions=food_table
food_prices
Out[60]: 
     item store  price  Date
0    pear     A   0.99  2017
1    pear     B   1.99  2017
2   peach     A   2.99  2017
3   peach     B   3.49  2017
4  banana     A   0.39  2017
5  banana     B   0.49  2017
6   steak     A   5.99  2017
7   steak     B   6.99  2017
8   steak     B   4.99  2015
# to find the total amount of each transaction, join these dataframes on the `item` and 'store` columns
food_transactions.merge(food_prices,on=['item','store'])
Out[62]: 
   custid    item store  quantity  price  Date
0       1    pear     A         5   0.99  2017
1       1  banana     A        10   0.39  2017
2       2   steak     B         3   6.99  2017
3       2   steak     B         3   4.99  2015
4       2   steak     B         1   6.99  2017
5       2   steak     B         1   4.99  2015
6       2    pear     B         1   1.99  2017
7       2   peach     B         2   3.49  2017
# the prices are aligned correctly with its corresponding items and store
# but the problem is that as the `steak` item appears twice in each table for store B,
# a Cartesian product takes place between them, resulting in four rows
# and, 'coconut' item is missing because there is no corresponding price for it 
food_transactions.merge(food_prices.query("Date==2017"),on=['item','store'],how='left')
Out[67]: 
   custid     item store  quantity  price    Date
0       1     pear     A         5   0.99  2017.0
1       1   banana     A        10   0.39  2017.0
2       2    steak     B         3   6.99  2017.0
3       2     pear     B         1   1.99  2017.0
4       2    peach     B         2   3.49  2017.0
5       2    steak     B         1   6.99  2017.0
6       2  coconut     B         4    NaN     NaN
food_transactions.merge(food_prices.query("Date==2017"),how='left')
Out[68]: 
   custid     item store  quantity  price    Date
0       1     pear     A         5   0.99  2017.0
1       1   banana     A        10   0.39  2017.0
2       2    steak     B         3   6.99  2017.0
3       2     pear     B         1   1.99  2017.0
4       2    peach     B         2   3.49  2017.0
5       2    steak     B         1   6.99  2017.0
6       2  coconut     B         4    NaN     NaN

# to replicate the result, use .join method 
# firstly , put the joining columns of the `food_prices`,why?? let's explore it.
food_price_join=food_prices.query('Date==2017')
food_transactions.join(food_price_join,on=["item",'store'])
Traceback (most recent call last):
  .. ...
ValueError: len(left_on) must equal the number of levels in the index of "right"
# 这里的报错不要删,它提醒我们在用.join方法时,the passed dataframe 需要设置索引
# so , put the joining columns of the passed dataframe into the index
food_prices_join=food_prices.query('Date==2017')\
    .set_index(['item','store'])
food_transactions.join(food_prices_join,on=['item','store'])
Out[77]: 
   custid     item store  quantity  price    Date
0       1     pear     A         5   0.99  2017.0
1       1   banana     A        10   0.39  2017.0
2       2    steak     B         3   6.99  2017.0
3       2     pear     B         1   1.99  2017.0
4       2    peach     B         2   3.49  2017.0
5       2    steak     B         1   6.99  2017.0
6       2  coconut     B         4    NaN     NaN

food_transactions.join(food_prices_join)
Traceback (most recent call last):
 ... ...
ValueError: cannot join with no overlapping index names

food_transactions.set_index(['item','store']).join(food_prices_join)
Out[79]: 
               custid  quantity  price    Date
item    store                                 
banana  A           1        10   0.39  2017.0
coconut B           2         4    NaN     NaN
peach   B           2         2   3.49  2017.0
pear    A           1         5   0.99  2017.0
        B           2         1   1.99  2017.0
steak   B           2         3   6.99  2017.0
        B           2         1   6.99  2017.0
food_prices_join
Out[80]: 
              price  Date
item   store             
pear   A       0.99  2017
       B       1.99  2017
peach  A       2.99  2017
       B       3.49  2017
banana A       0.39  2017
       B       0.49  2017
steak  A       5.99  2017
       B       6.99  2017
# .join method only align with the index of the passed dataframe
# but can use the index or columns of the calling DataFrame 
# to use the columns for alignment on the calling DataFrame , 
# you will need to pass them to `on` parameter
# to replicate this with the concat function, 
# you need to put the `item` and `store` columns in the index of both dataframes
# however, in this particular case, an error would be produced as a duplicate index value occurs in at least one of the DataFrames(with item steak and store B)
pd.concat([food_prices.query('Date==2017').set_index(['item','store']),food_transactions.set_index(['item','store'])],axis=1)
Traceback (most recent call last):
    ......
ValueError: Reindexing only valid with uniquely valued Index objects
# 这大概是因为 food_transactions contains two records with the same index (steak,store B)
food_prices.query('Date==2017') 
Out[89]: 
     item store  price  Date
0    pear     A   0.99  2017
1    pear     B   1.99  2017
2   peach     A   2.99  2017
3   peach     B   3.49  2017
4  banana     A   0.39  2017
5  banana     B   0.49  2017
6   steak     A   5.99  2017
7   steak     B   6.99  2017

在这里插入图片描述
it is possible to use ,join in these instances, but all the columns in the passed DataFrame must be moved into the index first.(这里似乎有点问题……不应该是all the columns 吧)
concat is going to be a poor choice whenever you intend to align data by values in their columns .
在这里插入图片描述
(420(441/627))

# task: read all files from a particular directory into a DataFrame without knowing their names
# Python provides a few ways to iterate through directories, with `glob` module be a popular choice
# the `glob` module has the `glob` function
# which takes a single parameter(the location of directory you would like to iterate through as a string)
# to get all the files in the directory, use the string `*`
# the return of the glob function is a list of string filenames
import glob
import pandas as pd
import numpy as np
df_list=[]
for filename in glob.glob('gas prices/*.csv'):
    df_list.append(pd.read_csv(filename,index_col='Week',parse_dates=['Week']))    
    
df_list
Out[21]: 
[            All Grades
 Week                  
 2017-09-25       2.701
 2017-09-18       2.750
 2017-09-11       2.800
 2017-09-04       2.794
 2017-08-28       2.513
 ...                ...
 2007-01-29       2.213
 2007-01-22       2.216
 2007-01-15       2.280
 2007-01-08       2.354
 2007-01-01       2.382
 
 [561 rows x 1 columns],
... ...
 
 
             Regular
 Week               
 2017-09-25    2.583
 2017-09-18    2.634
 2017-09-11    2.685
 2017-09-04    2.679
 2017-08-28    2.399
 ...             ...
 2007-01-29    2.165
 2007-01-22    2.165
 2007-01-15    2.229
 2007-01-08    2.306
 2007-01-01    2.334
 
 [561 rows x 1 columns]]
 
gas=pd.concat(df_list,axis='columns')
gas
Out[23]: 
            All Grades  Diesel  Midgrade  Premium  Regular
Week                                                      
2017-09-25       2.701   2.788     2.859    3.105    2.583
2017-09-18       2.750   2.791     2.906    3.151    2.634
2017-09-11       2.800   2.802     2.953    3.197    2.685
2017-09-04       2.794   2.758     2.946    3.191    2.679
2017-08-28       2.513   2.605     2.668    2.901    2.399
                ...     ...       ...      ...      ...
2007-01-29       2.213   2.413     2.277    2.381    2.165
2007-01-22       2.216   2.430     2.285    2.391    2.165
2007-01-15       2.280   2.463     2.347    2.453    2.229
2007-01-08       2.354   2.537     2.418    2.523    2.306
2007-01-01       2.382   2.580     2.442    2.547    2.334
[561 rows x 5 columns]

Connecting to SQL databases

先跳过,日后来补吧😭😭😭😭😭😭😭😭😭😭😭😭😭

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值