What kind of data does pandas handle?
How do I read and write tabular data?
How do I select a subset of a DataFrame?
如何选择df的子集
In [2]: titanic = pd.read_csv("data/titanic.csv")
In [3]: titanic.head()
Out[3]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
[5 rows x 12 columns]
选取某一列或者某几列
# 选取Age这一列
In [4]: ages = titanic["Age"]
In [5]: ages.head()
Out[5]:
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64
In [7]: titanic["Age"].shape
Out[7]: (891,)
# 选取两列或者更多列
In [8]: age_sex = titanic[["Age", "Sex"]]
In [9]: age_sex.head()
Out[9]:
Age Sex
0 22.0 male
1 38.0 female
2 26.0 female
3 35.0 female
4 35.0 male
In [11]: titanic[["Age", "Sex"]].shape
Out[11]: (891, 2)
过滤器Filter,选取符合条件的行(样本)
In [12]: above_35 = titanic[titanic["Age"] > 35]
In [13]: above_35.head()
Out[13]:
PassengerId Survived Pclass Name ... Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... ... PC 17599 71.2833 C85 C
6 7 0 1 McCarthy, Mr. Timothy J ... 17463 51.8625 E46 S
11 12 1 1 Bonnell, Miss. Elizabeth ... 113783 26.5500 C103 S
13 14 0 3 Andersson, Mr. Anders Johan ... 347082 31.2750 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) ... 248706 16.0000 NaN S
[5 rows x 12 columns]
判断行的Age列大于35
In [14]: titanic["Age"] > 35
Out[14]:
0 False
1 True
2 False
3 False
4 False
...
886 False
887 False
888 False
889 False
890 False
Name: Age, Length: 891, dtype: bool
选取某一列的值在自定义集合的
In [16]: class_23 = titanic[titanic["Pclass"].isin([2, 3])]
In [17]: class_23.head()
Out[17]:
PassengerId Survived Pclass Name ... Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ... A/5 21171 7.2500 NaN S
2 3 1 3 Heikkinen, Miss. Laina ... STON/O2. 3101282 7.9250 NaN S
4 5 0 3 Allen, Mr. William Henry ... 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James ... 330877 8.4583 NaN Q
7 8 0 3 Palsson, Master. Gosta Leonard ... 349909 21.0750 NaN S
[5 rows x 12 columns]
上面等价于或语句
In [18]: class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]
In [19]: class_23.head()
Out[19]:
PassengerId Survived Pclass Name ... Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris ... A/5 21171 7.2500 NaN S
2 3 1 3 Heikkinen, Miss. Laina ... STON/O2. 3101282 7.9250 NaN S
4 5 0 3 Allen, Mr. William Henry ... 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James ... 330877 8.4583 NaN Q
7 8 0 3 Palsson, Master. Gosta Leonard ... 349909 21.0750 NaN S
[5 rows x 12 columns]
筛选某些列的值不为空
In [20]: age_no_na = titanic[titanic["Age"].notna()]
In [21]: age_no_na.head()
Out[21]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
[5 rows x 12 columns]
筛选某些列符合条件的另一个列的值:例如筛选Age大于35的名字Name
In [23]: adult_names = titanic.loc[titanic["Age"] > 35, "Name"]
In [24]: adult_names.head()
Out[24]:
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
6 McCarthy, Mr. Timothy J
11 Bonnell, Miss. Elizabeth
13 Andersson, Mr. Anders Johan
15 Hewlett, Mrs. (Mary D Kingcome)
Name: Name, dtype: object
使用切片形式
In [25]: titanic.iloc[9:25, 2:5]
Out[25]:
Pclass Name Sex
9 2 Nasser, Mrs. Nicholas (Adele Achem) female
10 3 Sandstrom, Miss. Marguerite Rut female
11 1 Bonnell, Miss. Elizabeth female
12 3 Saundercock, Mr. William Henry male
13 3 Andersson, Mr. Anders Johan male
.. ... ... ...
20 2 Fynney, Mr. Joseph J male
21 2 Beesley, Mr. Lawrence male
22 3 McGowan, Miss. Anna "Annie" female
23 1 Sloper, Mr. William Thompson male
24 3 Palsson, Miss. Torborg Danira female
[16 rows x 3 columns]
切片语句: titanic.iloc[9:25, 2:5]
条件筛选语句:titanic.loc[titanic["Age"] > 35, "Name"]
使用切片赋值
In [26]: titanic.iloc[0:3, 3] = "anonymous"
How to create plots in pandas?
直接使用plot对pandas画图
In [1]: import pandas as pd
In [2]: import matplotlib.pyplot as plt
In [3]: air_quality = pd.read_csv("data/air_quality_no2.csv", index_col=0, parse_dates=True)
In [4]: air_quality.head()
Out[4]:
station_antwerp station_paris station_london
datetime
2019-05-07 02:00:00 NaN NaN 23.0
2019-05-07 03:00:00 50.5 25.0 19.0
2019-05-07 04:00:00 45.0 27.7 19.0
2019-05-07 05:00:00 NaN 50.4 16.0
2019-05-07 06:00:00 NaN 61.9 NaN
In [5]: air_quality.plot()
Out[5]: <AxesSubplot:xlabel='datetime'>
直接对有数据的画图
对某些列画图
In [6]: air_quality["station_paris"].plot()
Out[6]: <AxesSubplot:xlabel='datetime'>
可视化某几列的相关性
In [7]: air_quality.plot.scatter(x="station_london", y="station_paris", alpha=0.5)
Out[7]: <AxesSubplot:xlabel='station_london', ylabel='station_paris'>
显示plot的
In [8]: [
...: method_name
...: for method_name in dir(air_quality.plot)
...: if not method_name.startswith("_")
...: ]
...:
Out[8]:
['area',
'bar',
'barh',
'box',
'density',
'hexbin',
'hist',
'kde',
'line',
'pie',
'scatter']
选用其中一个方法DataFrame.plot.box()
使用盒子画图
对每一列画图展示
In [10]: axs = air_quality.plot.area(figsize=(12, 4), subplots=True)
保存图片
In [11]: fig, axs = plt.subplots(figsize=(12, 4))# Create an empty matplotlib Figure and Axes
In [12]: air_quality.plot.area(ax=axs)# Use pandas to put the area plot on the prepared Figure/Axes
Out[12]: <AxesSubplot:xlabel='datetime'>
In [13]: axs.set_ylabel("NO$_2$ concentration")# Do any matplotlib customization you like
Out[13]: Text(0, 0.5, 'NO$_2$ concentration')
In [14]: fig.savefig("no2_concentrations.png")# Save the Figure/Axes using the existing matplotlib method.
How to create new columns derived from existing columns?
In [1]: import pandas as pd
In [2]: air_quality = pd.read_csv("data/air_quality_no2.csv", index_col=0, parse_dates=True)
In [3]: air_quality.head()
Out[3]:
station_antwerp station_paris station_london
datetime
2019-05-07 02:00:00 NaN NaN 23.0
2019-05-07 03:00:00 50.5 25.0 19.0
2019-05-07 04:00:00 45.0 27.7 19.0
2019-05-07 05:00:00 NaN 50.4 16.0
2019-05-07 06:00:00 NaN 61.9 NaN
创建一个新列new columns
添加某几列
In [4]: air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882
In [5]: air_quality.head()
Out[5]:
station_antwerp station_paris station_london london_mg_per_cubic
datetime
2019-05-07 02:00:00 NaN NaN 23.0 43.286
2019-05-07 03:00:00 50.5 25.0 19.0 35.758
2019-05-07 04:00:00 45.0 27.7 19.0 35.758
2019-05-07 05:00:00 NaN 50.4 16.0 30.112
2019-05-07 06:00:00 NaN 61.9 NaN NaN
添加某几列是原有列之间的加减乘除运算结果
In [6]: air_quality["ratio_paris_antwerp"] = (
...: air_quality["station_paris"] / air_quality["station_antwerp"]
...: )
...:
In [7]: air_quality.head()
Out[7]:
station_antwerp station_paris station_london london_mg_per_cubic ratio_paris_antwerp
datetime
2019-05-07 02:00:00 NaN NaN 23.0 43.286 NaN
2019-05-07 03:00:00 50.5 25.0 19.0 35.758 0.495050
2019-05-07 04:00:00 45.0 27.7 19.0 35.758 0.615556
2019-05-07 05:00:00 NaN 50.4 16.0 30.112 NaN
2019-05-07 06:00:00 NaN 61.9 NaN NaN NaN
对原有的列名 重命名
In [8]: air_quality_renamed = air_quality.rename(
...: columns={
...: "station_antwerp": "BETR801",
...: "station_paris": "FR04014",
...: "station_london": "London Westminster",
...: }
...: )
...:
In [9]: air_quality_renamed.head()
Out[9]:
BETR801 FR04014 London Westminster london_mg_per_cubic ratio_paris_antwerp
datetime
2019-05-07 02:00:00 NaN NaN 23.0 43.286 NaN
2019-05-07 03:00:00 50.5 25.0 19.0 35.758 0.495050
2019-05-07 04:00:00 45.0 27.7 19.0 35.758 0.615556
2019-05-07 05:00:00 NaN 50.4 16.0 30.112 NaN
2019-05-07 06:00:00 NaN 61.9 NaN NaN NaN
将列名大写小写化
In [10]: air_quality_renamed = air_quality_renamed.rename(columns=str.lower)
In [11]: air_quality_renamed.head()
Out[11]:
betr801 fr04014 london westminster london_mg_per_cubic ratio_paris_antwerp
datetime
2019-05-07 02:00:00 NaN NaN 23.0 43.286 NaN
2019-05-07 03:00:00 50.5 25.0 19.0 35.758 0.495050
2019-05-07 04:00:00 45.0 27.7 19.0 35.758 0.615556
2019-05-07 05:00:00 NaN 50.4 16.0 30.112 NaN
2019-05-07 06:00:00 NaN 61.9 NaN NaN NaN
How to calculate summary statistics?
In [1]: import pandas as pd
In [2]: titanic = pd.read_csv("data/titanic.csv")
In [3]: titanic.head()
Out[3]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
[5 rows x 12 columns]
对列求均值
In [4]: titanic["Age"].mean()
Out[4]: 29.69911764705882
对列求中位数
In [5]: titanic[["Age", "Fare"]].median()
Out[5]:
Age 28.0000
Fare 14.4542
dtype: float64
对数据摘要描述
In [6]: titanic[["Age", "Fare"]].describe()
Out[6]:
Age Fare
count 714.000000 891.000000
mean 29.699118 32.204208
std 14.526497 49.693429
min 0.420000 0.000000
25% 20.125000 7.910400
50% 28.000000 14.454200
75% 38.000000 31.000000
max 80.000000 512.329200
使用agg()
自定义展示统计数据
In [7]: titanic.agg(
...: {
...: "Age": ["min", "max", "median", "skew"],
...: "Fare": ["min", "max", "median", "mean"],
...: }
...: )
...:
Out[7]:
Age Fare
min 0.420000 0.000000
max 80.000000 512.329200
median 28.000000 14.454200
skew 0.389108 NaN
mean NaN 32.204208
对列分组排序再求均值
In [8]: titanic[["Sex", "Age"]].groupby("Sex").mean()
Out[8]:
Age
Sex
female 27.915709
male 30.726645
In [9]: titanic.groupby("Sex").mean()
Out[9]:
PassengerId Survived Pclass Age SibSp Parch Fare
Sex
female 431.028662 0.742038 2.159236 27.915709 0.694268 0.649682 44.479818
male 454.147314 0.188908 2.389948 30.726645 0.429809 0.235702 25.523893
In [10]: titanic.groupby("Sex")["Age"].mean()
Out[10]:
Sex
female 27.915709
male 30.726645
Name: Age, dtype: float64
二级分组排序
In [11]: titanic.groupby(["Sex", "Pclass"])["Fare"].mean()
Out[11]:
Sex Pclass
female 1 106.125798
2 21.970121
3 16.118810
male 1 67.226127
2 19.741782
3 12.661633
Name: Fare, dtype: float64
统计列当中各个值得个数
In [12]: titanic["Pclass"].value_counts()
Out[12]:
3 491
1 216
2 184
Name: Pclass, dtype: int64
或者按统计值排序
In [13]: titanic.groupby("Pclass")["Pclass"].count()
Out[13]:
Pclass
1 216
2 184
3 491
Name: Pclass, dtype: int64
How to reshape the layout of tables?
In [1]: import pandas as pd
In [2]: titanic = pd.read_csv("data/titanic.csv")
In [3]: titanic.head()
Out[3]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
[5 rows x 12 columns]
In [4]: air_quality = pd.read_csv(
...: "data/air_quality_long.csv", index_col="date.utc", parse_dates=True
...: )
...:
In [5]: air_quality.head()
Out[5]:
city country location parameter value unit
date.utc
2019-06-18 06:00:00+00:00 Antwerpen BE BETR801 pm25 18.0 µg/m³
2019-06-17 08:00:00+00:00 Antwerpen BE BETR801 pm25 6.5 µg/m³
2019-06-17 07:00:00+00:00 Antwerpen BE BETR801 pm25 18.5 µg/m³
2019-06-17 06:00:00+00:00 Antwerpen BE BETR801 pm25 16.0 µg/m³
2019-06-17 05:00:00+00:00 Antwerpen BE BETR801 pm25 7.5 µg/m³
按某列排序(升序或降序)
In [6]: titanic.sort_values(by="Age").head()
Out[6]:
PassengerId Survived Pclass Name Sex ... Parch Ticket Fare Cabin Embarked
803 804 1 3 Thomas, Master. Assad Alexander male ... 1 2625 8.5167 NaN C
755 756 1 2 Hamalainen, Master. Viljo male ... 1 250649 14.5000 NaN S
644 645 1 3 Baclini, Miss. Eugenie female ... 1 2666 19.2583 NaN C
469 470 1 3 Baclini, Miss. Helene Barbara female ... 1 2666 19.2583 NaN C
78 79 1 2 Caldwell, Master. Alden Gates male ... 2 248738 29.0000 NaN S
[5 rows x 12 columns]
按['Pclass', 'Age']
降序
In [7]: titanic.sort_values(by=['Pclass', 'Age'], ascending=False).head()
Out[7]:
PassengerId Survived Pclass Name Sex ... Parch Ticket Fare Cabin Embarked
851 852 0 3 Svensson, Mr. Johan male ... 0 347060 7.7750 NaN S
116 117 0 3 Connors, Mr. Patrick male ... 0 370369 7.7500 NaN Q
280 281 0 3 Duane, Mr. Frank male ... 0 336439 7.7500 NaN Q
483 484 1 3 Turkula, Mrs. (Hedwig) female ... 0 4134 9.5875 NaN S
326 327 0 3 Nysveen, Mr. Johan Hansen male ... 0 345364 6.2375 NaN S
[5 rows x 12 columns]
只筛选某列符合条件的数据
# filter for no2 data only
In [8]: no2 = air_quality[air_quality["parameter"] == "no2"]
让我们使用一小部分空气质量数据集。 我们专注于𝑁𝑂2数据,并且仅使用每个位置(即每组的头)的前两个测量值。 数据子集将称为no2_subset
# filter for no2 data only
In [8]: no2 = air_quality[air_quality["parameter"] == "no2"]
# use 2 measurements (head) for each location (groupby)
# 选取每一类的头两行数据
In [9]: no2_subset = no2.sort_index().groupby(["location"]).head(2)
In [10]: no2_subset
Out[10]:
city country location parameter value unit
date.utc
2019-04-09 01:00:00+00:00 Antwerpen BE BETR801 no2 22.5 µg/m³
2019-04-09 01:00:00+00:00 Paris FR FR04014 no2 24.4 µg/m³
2019-04-09 02:00:00+00:00 London GB London Westminster no2 67.0 µg/m³
2019-04-09 02:00:00+00:00 Antwerpen BE BETR801 no2 53.5 µg/m³
2019-04-09 02:00:00+00:00 Paris FR FR04014 no2 27.4 µg/m³
2019-04-09 03:00:00+00:00 London GB London Westminster no2 67.0 µg/m³
I want the values for the three stations as separate columns next to each other
我希望三个站的值彼此相邻地作为单独的列
In [11]: no2_subset.pivot(columns="location", values="value")
Out[11]:
location BETR801 FR04014 London Westminster
date.utc
2019-04-09 01:00:00+00:00 22.5 24.4 NaN
2019-04-09 02:00:00+00:00 53.5 27.4 67.0
2019-04-09 03:00:00+00:00 NaN NaN 67.0
In [12]: no2.head()
Out[12]:
city country location parameter value unit
date.utc
2019-06-21 00:00:00+00:00 Paris FR FR04014 no2 20.0 µg/m³
2019-06-20 23:00:00+00:00 Paris FR FR04014 no2 21.8 µg/m³
2019-06-20 22:00:00+00:00 Paris FR FR04014 no2 26.5 µg/m³
2019-06-20 21:00:00+00:00 Paris FR FR04014 no2 24.9 µg/m³
2019-06-20 20:00:00+00:00 Paris FR FR04014 no2 21.4 µg/m³
In [13]: no2.pivot(columns="location", values="value").plot()
Out[13]: <AxesSubplot:xlabel='date.utc'>
我要统计某列不同类别在不同列的值得均值
In [14]: air_quality.pivot_table(
....: values="value", index="location", columns="parameter", aggfunc="mean"
....: )
....:
Out[14]:
parameter no2 pm25
location
BETR801 26.950920 23.169492
FR04014 29.374284 NaN
London Westminster 29.740050 13.443568
How to combine data from multiple tables?如何联合多张表的数据
In [1]: import pandas as pd
In [2]: air_quality_no2 = pd.read_csv("data/air_quality_no2_long.csv",
...: parse_dates=True)
...:
In [3]: air_quality_no2 = air_quality_no2[["date.utc", "location",
...: "parameter", "value"]]
...:
# NO2的数据集
In [4]: air_quality_no2.head()
Out[4]:
date.utc location parameter value
0 2019-06-21 00:00:00+00:00 FR04014 no2 20.0
1 2019-06-20 23:00:00+00:00 FR04014 no2 21.8
2 2019-06-20 22:00:00+00:00 FR04014 no2 26.5
3 2019-06-20 21:00:00+00:00 FR04014 no2 24.9
4 2019-06-20 20:00:00+00:00 FR04014 no2 21.4
# PM25的数据集
In [5]: air_quality_pm25 = pd.read_csv("data/air_quality_pm25_long.csv",
...: parse_dates=True)
...:
In [6]: air_quality_pm25 = air_quality_pm25[["date.utc", "location",
...: "parameter", "value"]]
...:
In [7]: air_quality_pm25.head()
Out[7]:
date.utc location parameter value
0 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0
1 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5
2 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5
3 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0
4 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5
如何联合两张不同的表呢?
假设这两张表有相似的结构
In [8]: air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0)
In [9]: air_quality.head()
Out[9]:
date.utc location parameter value
0 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0
1 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5
2 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5
3 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0
4 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5
看看合并之后这三张表的关系吧
In [10]: print('Shape of the ``air_quality_pm25`` table: ', air_quality_pm25.shape)
Shape of the ``air_quality_pm25`` table: (1110, 4)
In [11]: print('Shape of the ``air_quality_no2`` table: ', air_quality_no2.shape)
Shape of the ``air_quality_no2`` table: (2068, 4)
In [12]: print('Shape of the resulting ``air_quality`` table: ', air_quality.shape)
Shape of the resulting ``air_quality`` table: (3178, 4)
按某一列的值排序
In [13]: air_quality = air_quality.sort_values("date.utc")
In [14]: air_quality.head()
Out[14]:
date.utc location parameter value
2067 2019-05-07 01:00:00+00:00 London Westminster no2 23.0
1003 2019-05-07 01:00:00+00:00 FR04014 no2 25.0
100 2019-05-07 01:00:00+00:00 BETR801 pm25 12.5
1098 2019-05-07 01:00:00+00:00 BETR801 no2 50.5
1109 2019-05-07 01:00:00+00:00 London Westminster pm25 8.0
使用keys
参数来增加一列表明这条数据来自于哪一张表(数据样本的来源)
In [16]: air_quality_.head()
Out[16]:
date.utc location parameter value
PM25 0 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0
1 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5
2 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5
3 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0
4 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5
如果两张表有相同的特征(主键和外键)可以使用merge()
https://blog.csdn.net/brucewong0516/article/details/82707492
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False,
validate=None)
left: 拼接的左侧DataFrame对象
right: 拼接的右侧DataFrame对象
on: 要加入的列或索引级别名称。 必须在左侧和右侧DataFrame对象中找到。 如果未传递且left_index和right_index为False,则DataFrame中的列的交集将被推断为连接键。
how: One of ‘left’, ‘right’, ‘outer’, ‘inner’. 默认inner。inner是取交集,outer取并集。比如left:[‘A’,‘B’,‘C’];right[’'A,‘C’,‘D’];inner取交集的话,left中出现的A会和right中出现的买一个A进行匹配拼接,如果没有是B,在right中没有匹配到,则会丢失。'outer’取并集,出现的A会进行一一匹配,没有同时出现的会将缺失的部分添加缺失值。
How to handle time series data with ease?
In [1]: import pandas as pd
In [2]: import matplotlib.pyplot as plt
In [3]: air_quality = pd.read_csv("data/air_quality_no2_long.csv")
In [4]: air_quality = air_quality.rename(columns={"date.utc": "datetime"})
In [5]: air_quality.head()
Out[5]:
city country datetime location parameter value unit
0 Paris FR 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 µg/m³
1 Paris FR 2019-06-20 23:00:00+00:00 FR04014 no2 21.8 µg/m³
2 Paris FR 2019-06-20 22:00:00+00:00 FR04014 no2 26.5 µg/m³
3 Paris FR 2019-06-20 21:00:00+00:00 FR04014 no2 24.9 µg/m³
4 Paris FR 2019-06-20 20:00:00+00:00 FR04014 no2 21.4 µg/m³
In [6]: air_quality.city.unique()
Out[6]: array(['Paris', 'Antwerpen', 'London'], dtype=object)
将时间字符串转化为时间序列
In [7]: air_quality["datetime"] = pd.to_datetime(air_quality["datetime"])
In [8]: air_quality["datetime"]
Out[8]:
0 2019-06-21 00:00:00+00:00
1 2019-06-20 23:00:00+00:00
2 2019-06-20 22:00:00+00:00
3 2019-06-20 21:00:00+00:00
4 2019-06-20 20:00:00+00:00
...
2063 2019-05-07 06:00:00+00:00
2064 2019-05-07 04:00:00+00:00
2065 2019-05-07 03:00:00+00:00
2066 2019-05-07 02:00:00+00:00
2067 2019-05-07 01:00:00+00:00
Name: datetime, Length: 2068, dtype: datetime64[ns, UTC]
Q:如何查看时间起始点和终止时间
In [9]: air_quality["datetime"].min(), air_quality["datetime"].max()
Out[9]:
(Timestamp('2019-05-07 01:00:00+0000', tz='UTC'),
Timestamp('2019-06-21 00:00:00+0000', tz='UTC'))
Q:查看这个时间段多久
In [10]: air_quality["datetime"].max() - air_quality["datetime"].min()
Out[10]: Timedelta('44 days 23:00:00')
Q:添加一列为该时间的月份数字
In [11]: air_quality["month"] = air_quality["datetime"].dt.month
In [12]: air_quality.head()
Out[12]:
city country datetime location parameter value unit month
0 Paris FR 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 µg/m³ 6
1 Paris FR 2019-06-20 23:00:00+00:00 FR04014 no2 21.8 µg/m³ 6
2 Paris FR 2019-06-20 22:00:00+00:00 FR04014 no2 26.5 µg/m³ 6
3 Paris FR 2019-06-20 21:00:00+00:00 FR04014 no2 24.9 µg/m³ 6
4 Paris FR 2019-06-20 20:00:00+00:00 FR04014 no2 21.4 µg/m³ 6
除了month
还有其他的year, weekofyear, quarter,
Q:每一个测量位置在一周中的每一天平均NO2浓度是多少
In [13]: air_quality.groupby(
....: [air_quality["datetime"].dt.weekday, "location"])["value"].mean()
....:
Out[13]:
datetime location
0 BETR801 27.875000
FR04014 24.856250
London Westminster 23.969697
1 BETR801 22.214286
FR04014 30.999359
...
5 FR04014 25.266154
London Westminster 24.977612
6 BETR801 21.896552
FR04014 23.274306
London Westminster 24.859155
Name: value, Length: 21, dtype: float64
Q:画出每天每个小时段的NO2浓度均值波动情况
In [14]: fig, axs = plt.subplots(figsize=(12, 4))
In [15]: air_quality.groupby(air_quality["datetime"].dt.hour)["value"].mean().plot(
....: kind='bar', rot=0, ax=axs
....: )
....:
Out[15]: <AxesSubplot:xlabel='datetime'>
In [16]: plt.xlabel("Hour of the day"); # custom x label using matplotlib
In [17]: plt.ylabel("$NO_2 (µg/m^3)$");
使用pivot()
数据透视查看某个时刻时间每个站点的值
In [18]: no_2 = air_quality.pivot(index="datetime", columns="location", values="value")
In [19]: no_2.head()
Out[19]:
location BETR801 FR04014 London Westminster
datetime
2019-05-07 01:00:00+00:00 50.5 25.0 23.0
2019-05-07 02:00:00+00:00 45.0 27.7 19.0
2019-05-07 03:00:00+00:00 NaN 50.4 19.0
2019-05-07 04:00:00+00:00 NaN 61.9 16.0
2019-05-07 05:00:00+00:00 NaN 72.4 NaN
我们可以查看时间序列对应的年份\周几
In [20]: no_2.index.year, no_2.index.weekday
Out[20]:
(Int64Index([2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019,
...
2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019],
dtype='int64', name='datetime', length=1033),
Int64Index([1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
...
3, 3, 3, 3, 3, 3, 3, 3, 3, 4],
dtype='int64', name='datetime', length=1033))
我们可以画出某一天到某一天的数据波动情况
In [21]: no_2["2019-05-20":"2019-05-21"].plot();
How to manipulate textual data?
In [1]: import pandas as pd
In [2]: titanic = pd.read_csv("data/titanic.csv")
In [3]: titanic.head()
Out[3]:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
[5 rows x 12 columns]
将所有的字符串小写化
In [4]: titanic["Name"].str.lower()
Out[4]:
0 braund, mr. owen harris
1 cumings, mrs. john bradley (florence briggs th...
2 heikkinen, miss. laina
3 futrelle, mrs. jacques heath (lily may peel)
4 allen, mr. william henry
...
886 montvila, rev. juozas
887 graham, miss. margaret edith
888 johnston, miss. catherine helen "carrie"
889 behr, mr. karl howell
890 dooley, mr. patrick
Name: Name, Length: 891, dtype: object
创建一个新列包含游客的名字列表
In [5]: titanic["Name"].str.split(",")
Out[5]:
0 [Braund, Mr. Owen Harris]
1 [Cumings, Mrs. John Bradley (Florence Briggs ...
2 [Heikkinen, Miss. Laina]
3 [Futrelle, Mrs. Jacques Heath (Lily May Peel)]
4 [Allen, Mr. William Henry]
...
886 [Montvila, Rev. Juozas]
887 [Graham, Miss. Margaret Edith]
888 [Johnston, Miss. Catherine Helen "Carrie"]
889 [Behr, Mr. Karl Howell]
890 [Dooley, Mr. Patrick]
Name: Name, Length: 891, dtype: object
选取第一个名字作为子字符串
In [6]: titanic["Surname"] = titanic["Name"].str.split(",").str.get(0)
In [7]: titanic["Surname"]
Out[7]:
0 Braund
1 Cumings
2 Heikkinen
3 Futrelle
4 Allen
...
886 Montvila
887 Graham
888 Johnston
889 Behr
890 Dooley
Name: Surname, Length: 891, dtype: object
判断某个字符串是否包含特定的子字符串
In [8]: titanic["Name"].str.contains("Countess")
Out[8]:
0 False
1 False
2 False
3 False
4 False
...
886 False
887 False
888 False
889 False
890 False
Name: Name, Length: 891, dtype: bool
In [9]: titanic[titanic["Name"].str.contains("Countess")]
Out[9]:
PassengerId Survived Pclass Name ... Fare Cabin Embarked Surname
759 760 1 1 Rothes, the Countess. of (Lucy Noel Martha Dye... ... 86.5 B77 S Rothes
[1 rows x 13 columns]
统计字符串的长度以及找出对应的下标
# 统计字符串的长度
In [10]: titanic["Name"].str.len()
Out[10]:
0 23
1 51
2 22
3 44
4 24
..
886 21
887 28
888 40
889 21
890 19
Name: Name, Length: 891, dtype: int64
# 找出最长字符串的下标
In [11]: titanic["Name"].str.len().idxmax()
Out[11]: 307
# 获取其元素content
In [12]: titanic.loc[titanic["Name"].str.len().idxmax(), "Name"]
Out[12]: 'Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)'
使用简写来代替元素内容
In [13]: titanic["Sex_short"] = titanic["Sex"].replace({"male": "M", "female": "F"})
In [14]: titanic["Sex_short"]
Out[14]:
0 M
1 F
2 F
3 F
4 M
..
886 M
887 F
888 F
889 M
890 M
Name: Sex_short, Length: 891, dtype: object
https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html