Pandas在大气科学中的应用

一航亦航

已于 2024-06-02 12:51:29 修改

阅读量1.1k

点赞数 34

分类专栏： python 文章标签： pandas

于 2024-05-30 21:29:13 首次发布

本文链接：https://blog.csdn.net/lvyihang200411/article/details/139329469

版权

python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Pandas简介

Pandas作为Numpy的升级版本，通常来说应用更加广泛，是数据处理很好用的也很基础的一个python库，而且以我的专业为例，python在大气科学中的应用绝对少不了读取数据然后处理数据，因此学习Pandas对我很重要，接下来我也向各位分享我学习的Pandas。

pd.Series ------序列

序列本质上就是一个带了标签的numpy.ndarray对象的衍生，因此序列是支持所有关于多维数组的操作的。【在此要注意的就是pd.Series一定是一维的】

pd.Series(data, index, dtype, name) # data是一维列表或者数组，index是数据对应的索引，dtype就是数据类型，name是用于命名对象【index索引可以是时间索引】

例如：

import pandas as pd
import datetime as dt
# 当索引是普通的索引时
data = pd.Series([0.1, 9.8, 7.6, 5.5], index=["a", "b", "c", "d"], name="xiaolv")
# 当索引是时间索引时
# 时间戳是时间协调时（UTC【Universal Time coordinated】），如果想要将北京时间（CST【China Standard Time】）转成时间协调时使用
index_time = pd.to_datetime(["2020-2-19", "2020-2-20", "2020-2-21", "2020-2-22"])
index_time -= dt.timedelta(hours=8)
data_time = pd.Series([0.1, 9.8, 7.6, 5.5],
                      index=index_time, name="xiaolv")
print(data)
print(data_time)



output:
a    0.1
b    9.8
c    7.6
d    5.5
Name: xiaolv, dtype: float64
2020-02-18 16:00:00    0.1
2020-02-19 16:00:00    9.8
2020-02-20 16:00:00    7.6
2020-02-21 16:00:00    5.5
Name: xiaolv, dtype: float64

pd.Series对象的算术运算

pd.Series对象的算术运算和numpy.ndarray对象的运算几乎完全一样，广播运算没有两样，但是两个同样维度的pd.Series对象进行运算时根据他们的索引匹配运算的【对于两个对象都有的索引是匹配运算，而对于两个对象特有的某索引python会自动赋予NAN值并且索引扩张】

import pandas as pd
a = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"], name="xiaolv")
b = pd.Series([2, 3, 4, 6], index=["a", "c", "e", "d"], name="patience")
print(a+b)


output:
a     3.0
b     NaN
c     6.0
d    10.0
e     NaN
dtype: float64

pd.Series对象的常用的一些属性

.dtype # 输出pd.Series对象的数据的类型

.shape # 可以输出pd.Series对象有几个元素

.at["a"] # 用标签索引访问pd.Series对象的单个元素

.iat[0] # 用数值索引访问pd.Series对象的单个元素

.loc[[]] # 用标签或者布尔序列访问多个元素【类似于Numpy】

.iloc[[]] # 用数值索引访问多个元素

.values # 获取数据的原始np.ndarray对象

特别提醒：pd.Series对象可以进行切片索引操作的

import pandas as pd
a = pd.Series([9.1, 9.5, 10.0, 11.2], index=["a", "b", "c", "d"], name="xiaolv")
print(a.at["c"])
print(a.iat[0])
print(a.loc[a > 9.6])
print(a.loc[["a", "b"]])
print(a.iloc[[0, 2, 3]])
print(a.values)



output:
10.0
9.1
c    10.0
d    11.2
Name: xiaolv, dtype: float64
a    9.1
b    9.5
Name: xiaolv, dtype: float64
a     9.1
c    10.0
d    11.2
Name: xiaolv, dtype: float64
[ 9.1  9.5 10.  11.2]

pd.Series对象的常用方法

.dropna() # 可用于删除pd.Series对象中的nan值

.groupby() # 根据索引对数据进行分组，返回一个GroupBy对象，对象自带min，max，mean等方法，例如b=a.groupby(level=0).mean()，再比如c=a.groupby(a > 9.6).mean()

.sum()/.mean()/.max()/.min() # 求和，求平均，求最大，求最小（都是默认不统计nan）

.std() # 求标准差

.abs() # 对所有元素求绝对值

.idxmax()/.idxmin()/.argmax()/.argmin() # 最大值/最小值对应的索引标签(idx),位置(arg)【index，position argument】

.to_list() # 将其转化为list对象

.astype() # 转换所有元素的类型

import pandas as pd
import numpy as np
a = pd.Series([9.1, 9.5, 10.0, 11.2], index=["a", "a", "c", "d"], name="xiaolv")
b = a.groupby(level=0).mean()
c = a.groupby(a > 9.6).mean()
d = a.std()
e = a.idxmax()
f = a.argmax()
g = a.astype(np.int_)
h = a.to_list()
print(b)
print(c)
print(d)
print(e)
print(f)
print(g)
print(h)


output:
a     9.3
c    10.0
d    11.2
Name: xiaolv, dtype: float64
xiaolv
False     9.3
True     10.6
Name: xiaolv, dtype: float64
0.9110433579144297
d
3
a     9
a     9
c    10
d    11
Name: xiaolv, dtype: int32
[9.1, 9.5, 10.0, 11.2]

pd.DataFrame------数据框

与Numpy不同的是，pd.DataFrame对象的每一列都可以是不同的数据类型（这也是通常使用pandas来读取csv文件而不用Numpy读取的原因）

pd.DataFrame(data, index, column, dtype) # column指的是列索引，一般来说只会把index当做时间索引来操作，而不会把column作为时间索引

import pandas as pd
a = pd.DataFrame([[21.7, 983, 0.64], 
                  [19.2, 991, 0.75],
                  [13.4, 973, 0.83]],
                  index=pd.to_datetime(["2020-02-19", "2020-02-20", "2020-02-22"]),
                  columns=["t", "p", "rh"])
print(a)


output:
               t    p    rh
2020-02-19  21.7  983  0.64
2020-02-20  19.2  991  0.75
2020-02-22  13.4  973  0.83

pd.DataFrame基本操作

一、pd.DataFrame的算术运算

1.与标量运算，相当于直接作用于每一个元素（这种情况很少，因为每一列单位不同）

2.与pd.Series运算，要保证pd.Series的index和pd.DataFrame的column进行对应

3.两个pd.DataFrame进行运算，就要保证index和column都相等

二、提取满足条件的行

1.按照数据条件提取

这和Numpy提取满足条件的行是一样的，利用的是类似Numpy数组的逻辑索引功能，使用bool数组来进行选取行。当有多个条件的时候和Numpy数组一样使用&,|来分隔，每个条件一定需要使用括号括起来才能用。【这里a["t"]提取之后得到的是numpy.ndarray对象】

例如：print(a[(a["t"] < 20) & (a["p"] > 10)])

2.按照时间索引条件提取

使用类似b= a[a.index.year/month/day == ?]的形式来进行选取，本质上还是类似Numpy的bool数组选取

三、pd.DataFrame的常用属性

.dtypes # 查看每一列的数据类型

.at["s1", "t"] # 通过行/列标签索引选取一个元素

.iat[0,0] # 通过行/列位置索引选取一个元素

.loc[] # 通过行或者列标签访问多个行或者列，注意形成Series或DataFrame

.iloc[] # 通过行或者列位置访问多个行或者列，用法大同小异

.values # 和pd.Series类似，返回np.ndarray对象

特别提醒：pd.DataFrame也是可以进行切片索引操作的。

以下举例：
import numpy as np
import pandas as pd
data = np.array([[1, 2, 3, 4, 5], [1, 2, 3, 4, 5], [2, 3, 4, 5, 6]])
datahh = pd.Series([1, 2, 3, 4, 5], index=["2", "3", "4", "5", "6"])
hh = pd.DataFrame(data)
print(hh.loc[0:1, 1:3])
# 必须例如一下[[0, 1], [1, 2, 3]]行列都用列表（这里把切片也看成是列表括起来的）括起来才是DataFrame对象，如果但凡有一个不是列表括起来的都生成DataArray对象。
print(hh.loc[[0, 1], [1, 2, 3]])
print(hh.loc[:, [1, 2]])
print(hh.loc[:, 1])


output:
   1  2  3
0  2  3  4
1  2  3  4
   1  2  3
0  2  3  4
1  2  3  4
   1  2
0  2  3
1  2  3
2  3  4
0    2
1    2
2    3
Name: 1, dtype: int32

import pandas as pd
a = pd.DataFrame([[21.7, 983, 0.64], 
                  [19.2, 991, 0.75],
                  [13.4, 973, 0.83]],
                  index=["k", "w", "s"],
                  columns=["t", "p", "rh"])
print(a.loc[["k"]])   # DataFrame
print(a.loc["k"])     # Series
print(a.loc[:, ["t"]]) # DataFrame
print(a.loc[:, "t"])   # Series
print(a.iloc[:, [0, 1]]) # DataFrame
print(a.iloc[0])  # Series


output:
      t    p    rh
k  21.7  983  0.64

t      21.70
p     983.00
rh      0.64
Name: k, dtype: float64

      t
k  21.7
w  19.2
s  13.4

k    21.7
w    19.2
s    13.4
Name: t, dtype: float64
      t    p
k  21.7  983
w  19.2  991
s  13.4  973

t      21.70
p     983.00
rh      0.64
Name: k, dtype: float64

四、pd.DataFrame的常用方法

1..dropna()

# 删除包含NaN的行，如果想要删除包含NaN的列（加axis=1），如果想要删除全为NaN的行或列就加一个how="all"参数即可

2..groupby()

# 分组分类，可以接受level参数实现通过索引分组或者接受by参数实现通过列名称对数据进行分组。【其中level=0，例如by=“kind”指定以kind列为分组依据】，在分组之后生成一个DataFrameGroupBy对象，摒弃了原索引，索引变成了kind的索引。这个对象拥有sum(),max(),min(),mean()方法

特别提醒：by参数也可以是一个自定义函数，这相当于将索引（index参数的值）中的元素逐一输入函数，然后根据自定义函数的输出值进行分组，相同函数输出值所在行分为同一组。

3..max()/min()/idxmax()/idxmin()

# 类似于pd.Series，只不过参数axis=0时为每列最大值输出，axis=1时为每行最大值输出

4..to_csv(fname)

# fname是地址

五、pd.DataFrame读取csv文件

pd.read_csv(fname, index_col=0) # fname为路径，index_col为指定索引列所在的位置，如果文件只有索引没有列名，指定header=None即可。parse_dates=[]可以指定需要读取的列，可以和index_col联合使用

六、pd.DataFrame增加列的方法和merge()，concat()函数

对于pd.DataFrame对象data，data[""] = pd.Series 就可以增加一列，右端为pd.Series对象。

1.merge()

# merge函数分为左连接，右连接，内连接和外连接

2.concat()

# 用于合并多个pd.DataFrame，可以按照行或者列合并多个pd.DataFrame。

import pandas as pd
a = pd.DataFrame([["sunny", 983, 0.64],
                  ["rain", 991, 0.75],
                  ["fog", 973, 0.83],
                  ["haze", 1001, 0.93]],
                  index=["d1", "d2", "d3", "d4"],
                  columns=["weather", "p", "rh"]
                 )
b = pd.DataFrame([["rain", "0121"],
                  ["windy", "1123"],
                  ["fog", "1234"],
                  ["sunny", "2234"]],
index=["d1", "d2", "d3", "d4"],
                  columns=["weather", "code"]
                 )
print(a)
print(b)
# concat是直接将两个DataFrame强行合并的,按照是以行合并还是合并用axis=0/1区分
c = pd.concat([a, b], axis=0)  # axis=0是按照行和并
d = pd.concat([a, b], axis=1)  # axis=1是按照列合并
print(c)
print(d)
# merge使用之后index都会变成0,1,2...,同时left_on和right_on就是指定左右DataFrame对象是通过什么来确定怎么合并
# left就是保留所有左DataFrame的键，同理right就是保留所有右边键，inner就是保留共有的键，outer就是保留所有键
# 没有的值用NaN填充
e = pd.merge(a, b, left_on="weather", right_on="weather", how="left")
f = pd.merge(a, b, left_on="weather", right_on="weather", how="right")
g = pd.merge(a, b, left_on="weather", right_on="weather", how="inner")
h = pd.merge(a, b, left_on="weather", right_on="weather", how="outer")
print(e)
print(f)
print(g)
print(h)




output:
   weather     p    rh
d1   sunny   983  0.64
d2    rain   991  0.75
d3     fog   973  0.83
d4    haze  1001  0.93
   weather  code
d1    rain  0121
d2   windy  1123
d3     fog  1234
d4   sunny  2234
   weather       p    rh  code
d1   sunny   983.0  0.64   NaN
d2    rain   991.0  0.75   NaN
d3     fog   973.0  0.83   NaN
d4    haze  1001.0  0.93   NaN
d1    rain     NaN   NaN  0121
d2   windy     NaN   NaN  1123
d3     fog     NaN   NaN  1234
d4   sunny     NaN   NaN  2234
   weather     p    rh weather  code
d1   sunny   983  0.64    rain  0121
d2    rain   991  0.75   windy  1123
d3     fog   973  0.83     fog  1234
d4    haze  1001  0.93   sunny  2234
  weather     p    rh  code
0   sunny   983  0.64  2234
1    rain   991  0.75  0121
2     fog   973  0.83  1234
3    haze  1001  0.93   NaN
  weather      p    rh  code
0    rain  991.0  0.75  0121
1   windy    NaN   NaN  1123
2     fog  973.0  0.83  1234
3   sunny  983.0  0.64  2234
  weather    p    rh  code
0   sunny  983  0.64  2234
1    rain  991  0.75  0121
2     fog  973  0.83  1234
  weather       p    rh  code
0     fog   973.0  0.83  1234
1    haze  1001.0  0.93   NaN
2    rain   991.0  0.75  0121
3   sunny   983.0  0.64  2234
4   windy     NaN   NaN  1123