pandas学习总结

API接口

What kind of data does pandas handle?

How do I read and write tabular data?

How do I select a subset of a DataFrame?

如何选择df的子集

In [2]: titanic = pd.read_csv("data/titanic.csv")

In [3]: titanic.head()
Out[3]: 
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S

[5 rows x 12 columns]

选取某一列或者某几列

# 选取Age这一列
In [4]: ages = titanic["Age"]

In [5]: ages.head()
Out[5]: 
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64
In [7]: titanic["Age"].shape
Out[7]: (891,)
# 选取两列或者更多列
In [8]: age_sex = titanic[["Age", "Sex"]]

In [9]: age_sex.head()
Out[9]: 
    Age     Sex
0  22.0    male
1  38.0  female
2  26.0  female
3  35.0  female
4  35.0    male
In [11]: titanic[["Age", "Sex"]].shape
Out[11]: (891, 2)

过滤器Filter,选取符合条件的行(样本)

In [12]: above_35 = titanic[titanic["Age"] > 35]

In [13]: above_35.head()
Out[13]: 
    PassengerId  Survived  Pclass                                               Name  ...    Ticket     Fare  Cabin  Embarked
1             2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  ...  PC 17599  71.2833    C85         C
6             7         0       1                            McCarthy, Mr. Timothy J  ...     17463  51.8625    E46         S
11           12         1       1                           Bonnell, Miss. Elizabeth  ...    113783  26.5500   C103         S
13           14         0       3                        Andersson, Mr. Anders Johan  ...    347082  31.2750    NaN         S
15           16         1       2                   Hewlett, Mrs. (Mary D Kingcome)   ...    248706  16.0000    NaN         S

[5 rows x 12 columns]

判断行的Age列大于35

In [14]: titanic["Age"] > 35
Out[14]: 
0      False
1       True
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: Age, Length: 891, dtype: bool

选取某一列的值在自定义集合的

In [16]: class_23 = titanic[titanic["Pclass"].isin([2, 3])]

In [17]: class_23.head()
Out[17]: 
   PassengerId  Survived  Pclass                            Name  ...            Ticket     Fare  Cabin  Embarked
0            1         0       3         Braund, Mr. Owen Harris  ...         A/5 21171   7.2500    NaN         S
2            3         1       3          Heikkinen, Miss. Laina  ...  STON/O2. 3101282   7.9250    NaN         S
4            5         0       3        Allen, Mr. William Henry  ...            373450   8.0500    NaN         S
5            6         0       3                Moran, Mr. James  ...            330877   8.4583    NaN         Q
7            8         0       3  Palsson, Master. Gosta Leonard  ...            349909  21.0750    NaN         S

[5 rows x 12 columns]

上面等价于或语句

In [18]: class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]

In [19]: class_23.head()
Out[19]: 
   PassengerId  Survived  Pclass                            Name  ...            Ticket     Fare  Cabin  Embarked
0            1         0       3         Braund, Mr. Owen Harris  ...         A/5 21171   7.2500    NaN         S
2            3         1       3          Heikkinen, Miss. Laina  ...  STON/O2. 3101282   7.9250    NaN         S
4            5         0       3        Allen, Mr. William Henry  ...            373450   8.0500    NaN         S
5            6         0       3                Moran, Mr. James  ...            330877   8.4583    NaN         Q
7            8         0       3  Palsson, Master. Gosta Leonard  ...            349909  21.0750    NaN         S

[5 rows x 12 columns]

筛选某些列的值不为空

In [20]: age_no_na = titanic[titanic["Age"].notna()]

In [21]: age_no_na.head()
Out[21]: 
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S

[5 rows x 12 columns]

筛选某些列符合条件的另一个列的值:例如筛选Age大于35的名字Name

In [23]: adult_names = titanic.loc[titanic["Age"] > 35, "Name"]

In [24]: adult_names.head()
Out[24]: 
1     Cumings, Mrs. John Bradley (Florence Briggs Th...
6                               McCarthy, Mr. Timothy J
11                             Bonnell, Miss. Elizabeth
13                          Andersson, Mr. Anders Johan
15                     Hewlett, Mrs. (Mary D Kingcome) 
Name: Name, dtype: object

使用切片形式

In [25]: titanic.iloc[9:25, 2:5]
Out[25]: 
    Pclass                                 Name     Sex
9        2  Nasser, Mrs. Nicholas (Adele Achem)  female
10       3      Sandstrom, Miss. Marguerite Rut  female
11       1             Bonnell, Miss. Elizabeth  female
12       3       Saundercock, Mr. William Henry    male
13       3          Andersson, Mr. Anders Johan    male
..     ...                                  ...     ...
20       2                 Fynney, Mr. Joseph J    male
21       2                Beesley, Mr. Lawrence    male
22       3          McGowan, Miss. Anna "Annie"  female
23       1         Sloper, Mr. William Thompson    male
24       3        Palsson, Miss. Torborg Danira  female

[16 rows x 3 columns]

切片语句: titanic.iloc[9:25, 2:5]
条件筛选语句:titanic.loc[titanic["Age"] > 35, "Name"]
使用切片赋值

In [26]: titanic.iloc[0:3, 3] = "anonymous"

How to create plots in pandas?

直接使用plot对pandas画图

In [1]: import pandas as pd

In [2]: import matplotlib.pyplot as plt
In [3]: air_quality = pd.read_csv("data/air_quality_no2.csv", index_col=0, parse_dates=True)

In [4]: air_quality.head()
Out[4]: 
                     station_antwerp  station_paris  station_london
datetime                                                           
2019-05-07 02:00:00              NaN            NaN            23.0
2019-05-07 03:00:00             50.5           25.0            19.0
2019-05-07 04:00:00             45.0           27.7            19.0
2019-05-07 05:00:00              NaN           50.4            16.0
2019-05-07 06:00:00              NaN           61.9             NaN
In [5]: air_quality.plot()
Out[5]: <AxesSubplot:xlabel='datetime'>

直接对有数据的画图
在这里插入图片描述
对某些列画图

In [6]: air_quality["station_paris"].plot()
Out[6]: <AxesSubplot:xlabel='datetime'>

在这里插入图片描述
可视化某几列的相关性

In [7]: air_quality.plot.scatter(x="station_london", y="station_paris", alpha=0.5)
Out[7]: <AxesSubplot:xlabel='station_london', ylabel='station_paris'>

在这里插入图片描述
显示plot的

In [8]: [
   ...:     method_name
   ...:     for method_name in dir(air_quality.plot)
   ...:     if not method_name.startswith("_")
   ...: ]
   ...: 
Out[8]: 
['area',
 'bar',
 'barh',
 'box',
 'density',
 'hexbin',
 'hist',
 'kde',
 'line',
 'pie',
 'scatter']

选用其中一个方法DataFrame.plot.box()使用盒子画图
在这里插入图片描述
对每一列画图展示

In [10]: axs = air_quality.plot.area(figsize=(12, 4), subplots=True)

在这里插入图片描述
保存图片

In [11]: fig, axs = plt.subplots(figsize=(12, 4))# Create an empty matplotlib Figure and Axes

In [12]: air_quality.plot.area(ax=axs)# Use pandas to put the area plot on the prepared Figure/Axes
Out[12]: <AxesSubplot:xlabel='datetime'>

In [13]: axs.set_ylabel("NO$_2$ concentration")# Do any matplotlib customization you like
Out[13]: Text(0, 0.5, 'NO$_2$ concentration')

In [14]: fig.savefig("no2_concentrations.png")# Save the Figure/Axes using the existing matplotlib method.

在这里插入图片描述

How to create new columns derived from existing columns?

In [1]: import pandas as pd
In [2]: air_quality = pd.read_csv("data/air_quality_no2.csv", index_col=0, parse_dates=True)

In [3]: air_quality.head()
Out[3]: 
                     station_antwerp  station_paris  station_london
datetime                                                           
2019-05-07 02:00:00              NaN            NaN            23.0
2019-05-07 03:00:00             50.5           25.0            19.0
2019-05-07 04:00:00             45.0           27.7            19.0
2019-05-07 05:00:00              NaN           50.4            16.0
2019-05-07 06:00:00              NaN           61.9             NaN

创建一个新列new columns

在这里插入图片描述
添加某几列

In [4]: air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882

In [5]: air_quality.head()
Out[5]: 
                     station_antwerp  station_paris  station_london  london_mg_per_cubic
datetime                                                                                
2019-05-07 02:00:00              NaN            NaN            23.0               43.286
2019-05-07 03:00:00             50.5           25.0            19.0               35.758
2019-05-07 04:00:00             45.0           27.7            19.0               35.758
2019-05-07 05:00:00              NaN           50.4            16.0               30.112
2019-05-07 06:00:00              NaN           61.9             NaN                  NaN

在这里插入图片描述
添加某几列是原有列之间的加减乘除运算结果

In [6]: air_quality["ratio_paris_antwerp"] = (
   ...:     air_quality["station_paris"] / air_quality["station_antwerp"]
   ...: )
   ...: 

In [7]: air_quality.head()
Out[7]: 
                     station_antwerp  station_paris  station_london  london_mg_per_cubic  ratio_paris_antwerp
datetime                                                                                                     
2019-05-07 02:00:00              NaN            NaN            23.0               43.286                  NaN
2019-05-07 03:00:00             50.5           25.0            19.0               35.758             0.495050
2019-05-07 04:00:00             45.0           27.7            19.0               35.758             0.615556
2019-05-07 05:00:00              NaN           50.4            16.0               30.112                  NaN
2019-05-07 06:00:00              NaN           61.9             NaN                  NaN                  NaN

对原有的列名 重命名

In [8]: air_quality_renamed = air_quality.rename(
   ...:     columns={
   ...:         "station_antwerp": "BETR801",
   ...:         "station_paris": "FR04014",
   ...:         "station_london": "London Westminster",
   ...:     }
   ...: )
   ...: 
   
In [9]: air_quality_renamed.head()
Out[9]: 
                     BETR801  FR04014  London Westminster  london_mg_per_cubic  ratio_paris_antwerp
datetime                                                                                           
2019-05-07 02:00:00      NaN      NaN                23.0               43.286                  NaN
2019-05-07 03:00:00     50.5     25.0                19.0               35.758             0.495050
2019-05-07 04:00:00     45.0     27.7                19.0               35.758             0.615556
2019-05-07 05:00:00      NaN     50.4                16.0               30.112                  NaN
2019-05-07 06:00:00      NaN     61.9                 NaN                  NaN                  NaN

将列名大写小写化

In [10]: air_quality_renamed = air_quality_renamed.rename(columns=str.lower)

In [11]: air_quality_renamed.head()
Out[11]: 
                     betr801  fr04014  london westminster  london_mg_per_cubic  ratio_paris_antwerp
datetime                                                                                           
2019-05-07 02:00:00      NaN      NaN                23.0               43.286                  NaN
2019-05-07 03:00:00     50.5     25.0                19.0               35.758             0.495050
2019-05-07 04:00:00     45.0     27.7                19.0               35.758             0.615556
2019-05-07 05:00:00      NaN     50.4                16.0               30.112                  NaN
2019-05-07 06:00:00      NaN     61.9                 NaN                  NaN                  NaN

How to calculate summary statistics?

In [1]: import pandas as pd
In [2]: titanic = pd.read_csv("data/titanic.csv")

In [3]: titanic.head()
Out[3]: 
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S

[5 rows x 12 columns]

对列求均值

In [4]: titanic["Age"].mean()
Out[4]: 29.69911764705882

对列求中位数

In [5]: titanic[["Age", "Fare"]].median()
Out[5]: 
Age     28.0000
Fare    14.4542
dtype: float64

对数据摘要描述

In [6]: titanic[["Age", "Fare"]].describe()
Out[6]: 
              Age        Fare
count  714.000000  891.000000
mean    29.699118   32.204208
std     14.526497   49.693429
min      0.420000    0.000000
25%     20.125000    7.910400
50%     28.000000   14.454200
75%     38.000000   31.000000
max     80.000000  512.329200

使用agg()自定义展示统计数据

In [7]: titanic.agg(
   ...:     {
   ...:         "Age": ["min", "max", "median", "skew"],
   ...:         "Fare": ["min", "max", "median", "mean"],
   ...:     }
   ...: )
   ...: 
Out[7]: 
              Age        Fare
min      0.420000    0.000000
max     80.000000  512.329200
median  28.000000   14.454200
skew     0.389108         NaN
mean          NaN   32.204208

对列分组排序再求均值

In [8]: titanic[["Sex", "Age"]].groupby("Sex").mean()
Out[8]: 
              Age
Sex              
female  27.915709
male    30.726645


In [9]: titanic.groupby("Sex").mean()
Out[9]: 
        PassengerId  Survived    Pclass        Age     SibSp     Parch       Fare
Sex                                                                              
female   431.028662  0.742038  2.159236  27.915709  0.694268  0.649682  44.479818
male     454.147314  0.188908  2.389948  30.726645  0.429809  0.235702  25.523893

In [10]: titanic.groupby("Sex")["Age"].mean()
Out[10]: 
Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

在这里插入图片描述
二级分组排序

In [11]: titanic.groupby(["Sex", "Pclass"])["Fare"].mean()
Out[11]: 
Sex     Pclass
female  1         106.125798
        2          21.970121
        3          16.118810
male    1          67.226127
        2          19.741782
        3          12.661633
Name: Fare, dtype: float64

统计列当中各个值得个数

In [12]: titanic["Pclass"].value_counts()
Out[12]: 
3    491
1    216
2    184
Name: Pclass, dtype: int64

或者按统计值排序

In [13]: titanic.groupby("Pclass")["Pclass"].count()
Out[13]: 
Pclass
1    216
2    184
3    491
Name: Pclass, dtype: int64

How to reshape the layout of tables?

In [1]: import pandas as pd
In [2]: titanic = pd.read_csv("data/titanic.csv")

In [3]: titanic.head()
Out[3]: 
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S

[5 rows x 12 columns]

In [4]: air_quality = pd.read_csv(
   ...:     "data/air_quality_long.csv", index_col="date.utc", parse_dates=True
   ...: )
   ...: 

In [5]: air_quality.head()
Out[5]: 
                                city country location parameter  value   unit
date.utc                                                                     
2019-06-18 06:00:00+00:00  Antwerpen      BE  BETR801      pm25   18.0  µg/2019-06-17 08:00:00+00:00  Antwerpen      BE  BETR801      pm25    6.5  µg/2019-06-17 07:00:00+00:00  Antwerpen      BE  BETR801      pm25   18.5  µg/2019-06-17 06:00:00+00:00  Antwerpen      BE  BETR801      pm25   16.0  µg/2019-06-17 05:00:00+00:00  Antwerpen      BE  BETR801      pm25    7.5  µg/

按某列排序(升序或降序)

In [6]: titanic.sort_values(by="Age").head()
Out[6]: 
     PassengerId  Survived  Pclass                             Name     Sex  ...  Parch  Ticket     Fare Cabin  Embarked
803          804         1       3  Thomas, Master. Assad Alexander    male  ...      1    2625   8.5167   NaN         C
755          756         1       2        Hamalainen, Master. Viljo    male  ...      1  250649  14.5000   NaN         S
644          645         1       3           Baclini, Miss. Eugenie  female  ...      1    2666  19.2583   NaN         C
469          470         1       3    Baclini, Miss. Helene Barbara  female  ...      1    2666  19.2583   NaN         C
78            79         1       2    Caldwell, Master. Alden Gates    male  ...      2  248738  29.0000   NaN         S

[5 rows x 12 columns]

['Pclass', 'Age']降序

In [7]: titanic.sort_values(by=['Pclass', 'Age'], ascending=False).head()
Out[7]: 
     PassengerId  Survived  Pclass                       Name     Sex  ...  Parch  Ticket    Fare Cabin  Embarked
851          852         0       3        Svensson, Mr. Johan    male  ...      0  347060  7.7750   NaN         S
116          117         0       3       Connors, Mr. Patrick    male  ...      0  370369  7.7500   NaN         Q
280          281         0       3           Duane, Mr. Frank    male  ...      0  336439  7.7500   NaN         Q
483          484         1       3     Turkula, Mrs. (Hedwig)  female  ...      0    4134  9.5875   NaN         S
326          327         0       3  Nysveen, Mr. Johan Hansen    male  ...      0  345364  6.2375   NaN         S

[5 rows x 12 columns]

只筛选某列符合条件的数据

# filter for no2 data only
In [8]: no2 = air_quality[air_quality["parameter"] == "no2"]

让我们使用一小部分空气质量数据集。 我们专注于𝑁𝑂2数据,并且仅使用每个位置(即每组的头)的前两个测量值。 数据子集将称为no2_subset

# filter for no2 data only
In [8]: no2 = air_quality[air_quality["parameter"] == "no2"]

# use 2 measurements (head) for each location (groupby)
# 选取每一类的头两行数据
In [9]: no2_subset = no2.sort_index().groupby(["location"]).head(2)

In [10]: no2_subset
Out[10]: 
                                city country            location parameter  value   unit
date.utc                                                                                
2019-04-09 01:00:00+00:00  Antwerpen      BE             BETR801       no2   22.5  µg/2019-04-09 01:00:00+00:00      Paris      FR             FR04014       no2   24.4  µg/2019-04-09 02:00:00+00:00     London      GB  London Westminster       no2   67.0  µg/2019-04-09 02:00:00+00:00  Antwerpen      BE             BETR801       no2   53.5  µg/2019-04-09 02:00:00+00:00      Paris      FR             FR04014       no2   27.4  µg/2019-04-09 03:00:00+00:00     London      GB  London Westminster       no2   67.0  µg/

I want the values for the three stations as separate columns next to each other
我希望三个站的值彼此相邻地作为单独的列

In [11]: no2_subset.pivot(columns="location", values="value")
Out[11]: 
location                   BETR801  FR04014  London Westminster
date.utc                                                       
2019-04-09 01:00:00+00:00     22.5     24.4                 NaN
2019-04-09 02:00:00+00:00     53.5     27.4                67.0
2019-04-09 03:00:00+00:00      NaN      NaN                67.0

In [12]: no2.head()
Out[12]: 
                            city country location parameter  value   unit
date.utc                                                                 
2019-06-21 00:00:00+00:00  Paris      FR  FR04014       no2   20.0  µg/2019-06-20 23:00:00+00:00  Paris      FR  FR04014       no2   21.8  µg/2019-06-20 22:00:00+00:00  Paris      FR  FR04014       no2   26.5  µg/2019-06-20 21:00:00+00:00  Paris      FR  FR04014       no2   24.9  µg/2019-06-20 20:00:00+00:00  Paris      FR  FR04014       no2   21.4  µg/
In [13]: no2.pivot(columns="location", values="value").plot()
Out[13]: <AxesSubplot:xlabel='date.utc'>

在图片描述
我要统计某列不同类别在不同列的值得均值

In [14]: air_quality.pivot_table(
   ....:     values="value", index="location", columns="parameter", aggfunc="mean"
   ....: )
   ....: 
Out[14]: 
parameter                 no2       pm25
location                                
BETR801             26.950920  23.169492
FR04014             29.374284        NaN
London Westminster  29.740050  13.443568

How to combine data from multiple tables?如何联合多张表的数据

In [1]: import pandas as pd
In [2]: air_quality_no2 = pd.read_csv("data/air_quality_no2_long.csv",
   ...:                               parse_dates=True)
   ...: 

In [3]: air_quality_no2 = air_quality_no2[["date.utc", "location",
   ...:                                    "parameter", "value"]]
   ...: 

# NO2的数据集
In [4]: air_quality_no2.head()
Out[4]: 
                    date.utc location parameter  value
0  2019-06-21 00:00:00+00:00  FR04014       no2   20.0
1  2019-06-20 23:00:00+00:00  FR04014       no2   21.8
2  2019-06-20 22:00:00+00:00  FR04014       no2   26.5
3  2019-06-20 21:00:00+00:00  FR04014       no2   24.9
4  2019-06-20 20:00:00+00:00  FR04014       no2   21.4

# PM25的数据集
In [5]: air_quality_pm25 = pd.read_csv("data/air_quality_pm25_long.csv",
   ...:                                parse_dates=True)
   ...: 

In [6]: air_quality_pm25 = air_quality_pm25[["date.utc", "location",
   ...:                                      "parameter", "value"]]
   ...: 

In [7]: air_quality_pm25.head()
Out[7]: 
                    date.utc location parameter  value
0  2019-06-18 06:00:00+00:00  BETR801      pm25   18.0
1  2019-06-17 08:00:00+00:00  BETR801      pm25    6.5
2  2019-06-17 07:00:00+00:00  BETR801      pm25   18.5
3  2019-06-17 06:00:00+00:00  BETR801      pm25   16.0
4  2019-06-17 05:00:00+00:00  BETR801      pm25    7.5

如何联合两张不同的表呢?
在这里插入图片描述
假设这两张表有相似的结构

In [8]: air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0)

In [9]: air_quality.head()
Out[9]: 
                    date.utc location parameter  value
0  2019-06-18 06:00:00+00:00  BETR801      pm25   18.0
1  2019-06-17 08:00:00+00:00  BETR801      pm25    6.5
2  2019-06-17 07:00:00+00:00  BETR801      pm25   18.5
3  2019-06-17 06:00:00+00:00  BETR801      pm25   16.0
4  2019-06-17 05:00:00+00:00  BETR801      pm25    7.5

看看合并之后这三张表的关系吧

In [10]: print('Shape of the ``air_quality_pm25`` table: ', air_quality_pm25.shape)
Shape of the ``air_quality_pm25`` table:  (1110, 4)

In [11]: print('Shape of the ``air_quality_no2`` table: ', air_quality_no2.shape)
Shape of the ``air_quality_no2`` table:  (2068, 4)

In [12]: print('Shape of the resulting ``air_quality`` table: ', air_quality.shape)
Shape of the resulting ``air_quality`` table:  (3178, 4)

按某一列的值排序

In [13]: air_quality = air_quality.sort_values("date.utc")

In [14]: air_quality.head()
Out[14]: 
                       date.utc            location parameter  value
2067  2019-05-07 01:00:00+00:00  London Westminster       no2   23.0
1003  2019-05-07 01:00:00+00:00             FR04014       no2   25.0
100   2019-05-07 01:00:00+00:00             BETR801      pm25   12.5
1098  2019-05-07 01:00:00+00:00             BETR801       no2   50.5
1109  2019-05-07 01:00:00+00:00  London Westminster      pm25    8.0

使用keys参数来增加一列表明这条数据来自于哪一张表(数据样本的来源)

In [16]: air_quality_.head()
Out[16]: 
                         date.utc location parameter  value
PM25 0  2019-06-18 06:00:00+00:00  BETR801      pm25   18.0
     1  2019-06-17 08:00:00+00:00  BETR801      pm25    6.5
     2  2019-06-17 07:00:00+00:00  BETR801      pm25   18.5
     3  2019-06-17 06:00:00+00:00  BETR801      pm25   16.0
     4  2019-06-17 05:00:00+00:00  BETR801      pm25    7.5

如果两张表有相同的特征(主键和外键)可以使用merge()
在这里插入图片描述
https://blog.csdn.net/brucewong0516/article/details/82707492

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False,
         validate=None)

left: 拼接的左侧DataFrame对象
right: 拼接的右侧DataFrame对象
on: 要加入的列或索引级别名称。 必须在左侧和右侧DataFrame对象中找到。 如果未传递且left_index和right_index为False,则DataFrame中的列的交集将被推断为连接键。
how: One of ‘left’, ‘right’, ‘outer’, ‘inner’. 默认inner。inner是取交集,outer取并集。比如left:[‘A’,‘B’,‘C’];right[’'A,‘C’,‘D’];inner取交集的话,left中出现的A会和right中出现的买一个A进行匹配拼接,如果没有是B,在right中没有匹配到,则会丢失。'outer’取并集,出现的A会进行一一匹配,没有同时出现的会将缺失的部分添加缺失值。
在这里插入图片描述

How to handle time series data with ease?

In [1]: import pandas as pd

In [2]: import matplotlib.pyplot as plt
In [3]: air_quality = pd.read_csv("data/air_quality_no2_long.csv")

In [4]: air_quality = air_quality.rename(columns={"date.utc": "datetime"})

In [5]: air_quality.head()
Out[5]: 
    city country                   datetime location parameter  value   unit
0  Paris      FR  2019-06-21 00:00:00+00:00  FR04014       no2   20.0  µg/1  Paris      FR  2019-06-20 23:00:00+00:00  FR04014       no2   21.8  µg/2  Paris      FR  2019-06-20 22:00:00+00:00  FR04014       no2   26.5  µg/3  Paris      FR  2019-06-20 21:00:00+00:00  FR04014       no2   24.9  µg/4  Paris      FR  2019-06-20 20:00:00+00:00  FR04014       no2   21.4  µg/m³
In [6]: air_quality.city.unique()
Out[6]: array(['Paris', 'Antwerpen', 'London'], dtype=object)

将时间字符串转化为时间序列

In [7]: air_quality["datetime"] = pd.to_datetime(air_quality["datetime"])

In [8]: air_quality["datetime"]
Out[8]: 
0      2019-06-21 00:00:00+00:00
1      2019-06-20 23:00:00+00:00
2      2019-06-20 22:00:00+00:00
3      2019-06-20 21:00:00+00:00
4      2019-06-20 20:00:00+00:00
                  ...           
2063   2019-05-07 06:00:00+00:00
2064   2019-05-07 04:00:00+00:00
2065   2019-05-07 03:00:00+00:00
2066   2019-05-07 02:00:00+00:00
2067   2019-05-07 01:00:00+00:00
Name: datetime, Length: 2068, dtype: datetime64[ns, UTC]

Q:如何查看时间起始点和终止时间

In [9]: air_quality["datetime"].min(), air_quality["datetime"].max()
Out[9]: 
(Timestamp('2019-05-07 01:00:00+0000', tz='UTC'),
 Timestamp('2019-06-21 00:00:00+0000', tz='UTC'))

Q:查看这个时间段多久

In [10]: air_quality["datetime"].max() - air_quality["datetime"].min()
Out[10]: Timedelta('44 days 23:00:00')

Q:添加一列为该时间的月份数字

In [11]: air_quality["month"] = air_quality["datetime"].dt.month

In [12]: air_quality.head()
Out[12]: 
    city country                  datetime location parameter  value   unit  month
0  Paris      FR 2019-06-21 00:00:00+00:00  FR04014       no2   20.0  µg/6
1  Paris      FR 2019-06-20 23:00:00+00:00  FR04014       no2   21.8  µg/6
2  Paris      FR 2019-06-20 22:00:00+00:00  FR04014       no2   26.5  µg/6
3  Paris      FR 2019-06-20 21:00:00+00:00  FR04014       no2   24.9  µg/6
4  Paris      FR 2019-06-20 20:00:00+00:00  FR04014       no2   21.4  µg/6

除了month还有其他的year, weekofyear, quarter,

Q:每一个测量位置在一周中的每一天平均NO2浓度是多少

In [13]: air_quality.groupby(
   ....:     [air_quality["datetime"].dt.weekday, "location"])["value"].mean()
   ....: 
Out[13]: 
datetime  location          
0         BETR801               27.875000
          FR04014               24.856250
          London Westminster    23.969697
1         BETR801               22.214286
          FR04014               30.999359
                                  ...    
5         FR04014               25.266154
          London Westminster    24.977612
6         BETR801               21.896552
          FR04014               23.274306
          London Westminster    24.859155
Name: value, Length: 21, dtype: float64

Q:画出每天每个小时段的NO2浓度均值波动情况

In [14]: fig, axs = plt.subplots(figsize=(12, 4))

In [15]: air_quality.groupby(air_quality["datetime"].dt.hour)["value"].mean().plot(
   ....:     kind='bar', rot=0, ax=axs
   ....: )
   ....: 
Out[15]: <AxesSubplot:xlabel='datetime'>

In [16]: plt.xlabel("Hour of the day");  # custom x label using matplotlib

In [17]: plt.ylabel("$NO_2 (µg/m^3)$");

使用pivot()数据透视查看某个时刻时间每个站点的值

In [18]: no_2 = air_quality.pivot(index="datetime", columns="location", values="value")

In [19]: no_2.head()
Out[19]: 
location                   BETR801  FR04014  London Westminster
datetime                                                       
2019-05-07 01:00:00+00:00     50.5     25.0                23.0
2019-05-07 02:00:00+00:00     45.0     27.7                19.0
2019-05-07 03:00:00+00:00      NaN     50.4                19.0
2019-05-07 04:00:00+00:00      NaN     61.9                16.0
2019-05-07 05:00:00+00:00      NaN     72.4                 NaN

我们可以查看时间序列对应的年份\周几

In [20]: no_2.index.year, no_2.index.weekday
Out[20]: 
(Int64Index([2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019,
             ...
             2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019],
            dtype='int64', name='datetime', length=1033),
 Int64Index([1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
             ...
             3, 3, 3, 3, 3, 3, 3, 3, 3, 4],
            dtype='int64', name='datetime', length=1033))

我们可以画出某一天到某一天的数据波动情况

In [21]: no_2["2019-05-20":"2019-05-21"].plot();

在这里插入图片描述

How to manipulate textual data?

In [1]: import pandas as pd
In [2]: titanic = pd.read_csv("data/titanic.csv")

In [3]: titanic.head()
Out[3]: 
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S

[5 rows x 12 columns]

将所有的字符串小写化

In [4]: titanic["Name"].str.lower()
Out[4]: 
0                                braund, mr. owen harris
1      cumings, mrs. john bradley (florence briggs th...
2                                 heikkinen, miss. laina
3           futrelle, mrs. jacques heath (lily may peel)
4                               allen, mr. william henry
                             ...                        
886                                montvila, rev. juozas
887                         graham, miss. margaret edith
888             johnston, miss. catherine helen "carrie"
889                                behr, mr. karl howell
890                                  dooley, mr. patrick
Name: Name, Length: 891, dtype: object

创建一个新列包含游客的名字列表

In [5]: titanic["Name"].str.split(",")
Out[5]: 
0                             [Braund,  Mr. Owen Harris]
1      [Cumings,  Mrs. John Bradley (Florence Briggs ...
2                              [Heikkinen,  Miss. Laina]
3        [Futrelle,  Mrs. Jacques Heath (Lily May Peel)]
4                            [Allen,  Mr. William Henry]
                             ...                        
886                             [Montvila,  Rev. Juozas]
887                      [Graham,  Miss. Margaret Edith]
888          [Johnston,  Miss. Catherine Helen "Carrie"]
889                             [Behr,  Mr. Karl Howell]
890                               [Dooley,  Mr. Patrick]
Name: Name, Length: 891, dtype: object

选取第一个名字作为子字符串

In [6]: titanic["Surname"] = titanic["Name"].str.split(",").str.get(0)

In [7]: titanic["Surname"]
Out[7]: 
0         Braund
1        Cumings
2      Heikkinen
3       Futrelle
4          Allen
         ...    
886     Montvila
887       Graham
888     Johnston
889         Behr
890       Dooley
Name: Surname, Length: 891, dtype: object

判断某个字符串是否包含特定的子字符串

In [8]: titanic["Name"].str.contains("Countess")
Out[8]: 
0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: Name, Length: 891, dtype: bool
In [9]: titanic[titanic["Name"].str.contains("Countess")]
Out[9]: 
     PassengerId  Survived  Pclass                                               Name  ...  Fare  Cabin  Embarked  Surname
759          760         1       1  Rothes, the Countess. of (Lucy Noel Martha Dye...  ...  86.5    B77         S   Rothes

[1 rows x 13 columns]

统计字符串的长度以及找出对应的下标

# 统计字符串的长度
In [10]: titanic["Name"].str.len()
Out[10]: 
0      23
1      51
2      22
3      44
4      24
       ..
886    21
887    28
888    40
889    21
890    19
Name: Name, Length: 891, dtype: int64

# 找出最长字符串的下标
In [11]: titanic["Name"].str.len().idxmax()
Out[11]: 307

# 获取其元素content
In [12]: titanic.loc[titanic["Name"].str.len().idxmax(), "Name"]
Out[12]: 'Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)'

使用简写来代替元素内容

In [13]: titanic["Sex_short"] = titanic["Sex"].replace({"male": "M", "female": "F"})

In [14]: titanic["Sex_short"]
Out[14]: 
0      M
1      F
2      F
3      F
4      M
      ..
886    M
887    F
888    F
889    M
890    M
Name: Sex_short, Length: 891, dtype: object

https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值