笔记：Pandas

DanTules

已于 2022-10-28 20:02:42 修改

阅读量264

点赞数 1

文章标签： pandas python 数据分析

于 2022-10-28 20:00:57 首次发布

本文链接：https://blog.csdn.net/m0_74154585/article/details/127577977

版权

Pandas

一、 Series

Series是一种一维数组对象，包含索引和数据。

1.列表 and Series

通过列表创建一个Series

import pandas as pd

a = pd.Series([1, 1, 2, 3, 2, 1, 1])

print(a)

输出结果第一列为索引（index），第二列为数据（value），如下：

0    1
1    1
2    2
3    3
4    2
5    1
6    1
dtype: int64

设置索引

import pandas as pd

import pandas as pd

x = ['a', 'b', 'c', 'd']
y = [0, 1, 2, 3]
s =pd.Series(y, index=x)
print(s)
print(s[2])
print(s['c'])

输出结果为：

a    0
b    1
c    2
d    3
dtype: int64
2
2

虽然在创建Series的时候设置了index，但从输出结果可以看出：s[2]和s[c]的含义是一样的，也就是说，Pandas实际上还是有隐藏的位置信息的。所以，Series中应该是可以通过两种方法返回数据的：一种是位置，一种是索引。

**Attention：**如果设置索引的时候是用的数字，用位置的方法貌似就返回不了对应位置上的数据了！

For example：

import pandas as pd

x = [5, 4, 6, 2]
y = [0, 1, 2, 3]
s =pd.Series(y, index=x)

print(s[2])

输出结果为：

输出是3，而不是对应位置（第3位）的2。

2. 字典 and Series

通过字典创建一个Series

import pandas as pd

dict1 = {'a': 1, 'b': 5, 'c': 7, 'd': 3}
s = pd.Series(dict1)

print(s)

通过字典创建的Series，索引即为该字典中的键。

输出结果如下：

a    1
b    5
c    7
d    3
dtype: int64

键值与设置的索引不匹配

如果键值与设置的索引不匹配，则会返回NaN。

import pandas as pd

dict1 = {'a': 1, 'b': 5, 'c': 7, 'd': 3}
l = ['a', 'b', 'c', 'd', 'e']
s = pd.Series(dict1, index=l)

print(s)

输出结果为：

a    1.0
b    5.0
c    7.0
d    3.0
e    NaN
dtype: float64

自动对齐不同索引的数据

import pandas as pd

dict1 = {'a': 1, 'b': 5, 'c': 7, 'd': 3}
s1 = pd.Series(dict1)
dict2 = {'a': 5, 'd': 8, 'c': 9, 'f': 3, 'e': 1}
s2 = pd.Series(dict2)

print(s1 + s2)

输出结果为：

a     6.0
b     NaN
c    16.0
d    11.0
e     NaN
f     NaN
dtype: float64

Series相加

如果有相同索引，则进行运算；如果没有相同索引则进行数据对齐，引入空缺值。

例：

import pandas as pd

dict1 = {'a': 1, 'b': 5, 'c': 7, 'd': 3}
s1 = pd.Series(dict1)

dict2 = {'a': 2, 'e': 5, 'd': 7, 'f': 3, 'c': 8}
s2 = pd.Series(dict2)

print(s1, '\n')
print(s2, '\n')
print(s1 + s2)

输出结果为：

a    1
b    5
c    7
d    3
dtype: int64 

a    2
e    5
d    7
f    3
c    8
dtype: int64 

a     3.0
b     NaN
c    15.0
d    10.0
e     NaN
f     NaN
dtype: float64

二、 Dataframe

Dataframe是一个表格型的数据结构，它每列可以为不同类型的数据。

Dataframe既有行索引，也有列索引。

1. 创建Dataframe

pd.DataFrame(dict1, column=, index=)

import pandas as pd

dict1 = {'city':['北京', '金华', '深圳'],
         'weather':['晴', '雨', '阴'],
         'temperature':['25℃', '27℃', '30℃']}
         
a = pd.DataFrame(dict1)

print(a)

输出结果为：

    city    weather   temperature
0   北京       晴         25℃
1   金华       雨         27℃
2   深圳       阴         30℃

Dataframe中的column函数指定列的名称，index则设置行的索引。

2. Dataframe的属性

方法	属性
df.values	所有值
df.columns	所有列
df.size	元素个数
df.ndim	维度
df.shape	形状
df.index	索引

3.Dataframe数据的查询

索引…索引

选取列

通过列索引标签或以属性的方式可以单独获取Dataframe的列数据，返回的数据为Series类型数据。（选取列的时候不能用切片）

import pandas as pd

dict1 = {'city': ['北京', '金华', '深圳'],
         'weather': ['晴', '雨', '阴'],
         'temperature': ['25℃', '27℃', '30℃']}

a = pd.DataFrame(dict1)

#选取city和weather
s1 = a[['city', 'weather']]

#选取包含字符串类型的列标签
s2 = a.select_dtypes(include='object')

#选取除了int64类型的列标签
s3 = a.select_dtypes(exclude='int64')

print(s1, '\n')
print(s2, '\n')
print(s3)

输出结果为：

    city    weather
0   北京       晴
1   金华       雨
2   深圳       阴 

    city    weather   temperature
0   北京       晴         25℃
1   金华       雨         27℃
2   深圳       阴         30℃ 

    city    weather   temperature
0   北京       晴         25℃
1   金华       雨         27℃
2   深圳       阴         30℃

选取行

通过行索引或行索引位置的切片可以选取行数据。

import pandas as pd

dict1 = {'city': ['北京', '金华', '深圳'],
         'weather': ['晴', '雨', '阴'],
         'temperature': ['25℃', '27℃', '30℃']}

a = pd.DataFrame(dict1)

w1 = a[:2]

print(w1)

输出结果为：

    city    weather   temperature
0   北京       晴         25℃
1   金华       雨         27℃

选取行和列

通过loc和iloc可以实现。

.loc(行索引名称或条件，列索引名称)

.iloc(行索引位置，列索引位置)

loc示例：

import pandas as pd

dict1 = {'city': ['北京', '金华', '深圳'],
         'weather': ['晴', '雨', '阴'],
         'temperature': [25, 27, 30]}

a = pd.DataFrame(dict1)

p1 = a.loc[1:2, ['city']]
p2 = a.loc[1, ['city', 'temperature']]
p3 = a.loc[a['temperature'] > 25, ['city', 'temperature']]

print(p1, '\n')
print(p2, '\n')
print(p3)

输出结果为：

    city
1   金华
2   深圳 

city           金华
temperature    27
Name: 1, dtype: object 

  city  temperature
1   金华           27
2   深圳           30

iloc示例：

import pandas as pd

dict1 = {'city': ['北京', '金华', '深圳'],
         'weather': ['晴', '雨', '阴'],
         'temperature': [25, 27, 30]}

a = pd.DataFrame(dict1)

p1 = a.iloc[:, 2]        #显示前两列
p2 = a.iloc[[1, 2]]      #显示第2行和第3行

print(p1, '\n')
print(p2)

输出结果为：

0    25
1    27
2    30
Name: temperature, dtype: int64 

    city    weather    temperature
1   金华       雨           27
2   深圳       阴           30

4.Dataframe数据的编辑

增加数据

增加行：

通过append方法传入字典即可，参数ignore_index可以设置是否忽略原Index。

import pandas as pd

dict1 = {'city': ['北京', '金华', '深圳'],
         'weather': ['晴', '雨', '阴'],
         'temperature': [25, 27, 30]}

a = pd.DataFrame(dict1)

dict2 = {'city': '吉安', 'weather': '晴', 'temperature': 29}

b = a.append(dict2, ignore_index=True)

print(b)

输出结果为：

    city    weather    temperature
0   北京       晴           25
1   金华       雨           27
2   深圳       阴           30
3   吉安       晴           29

增加列：

在原字典里面增加或为新增的列赋值即可。若要在指定位置增加列，则可以使用insert函数。

赋值：

import pandas as pd

dict1 = {'city': ['北京', '金华', '深圳'],
         'weather': ['晴', '雨', '阴'],
         'temperature': [25, 27, 30]}

a = pd.DataFrame(dict1)

a['air'] = ['差', '好', '中']

print(a)

输出结果为：

    city    weather    temperature  air
0   北京       晴           25       差
1   金华       雨           27       好
2   深圳       阴           30       中

指定位置插入：

import pandas as pd

dict1 = {'city': ['北京', '金华', '深圳'],
         'weather': ['晴', '雨', '阴'],
         'temperature': [25, 27, 30]}

a = pd.DataFrame(dict1)

a.insert(1, 'air', ['差', '好', '中'])

print(a)

输出结果为：

    city  air    weather    temperature
0   北京   差       晴           25
1   金华   好       雨           27
2   深圳   中       阴           30

删除数据

通过drop来删除数据，通过axis参数设定删除的是行（0）还是列（1）。如果在原数据上删除需要设置参数inplace=Ture。

import pandas as pd

dict1 = {'city': ['北京', '金华', '深圳'],
         'weather': ['晴', '雨', '阴'],
         'temperature': [25, 27, 30]}

a = pd.DataFrame(dict1)

print(a, '\n')

a.drop(0, axis=0, inplace=True)

print(a)

输出结果为：

    city    weather    temperature
0   北京       晴           25
1   金华       雨           27
2   深圳       阴           30 

    city    weather    temperature
1   金华       雨           27
2   深圳       阴           30

修改数据

通过replace来修改数据。

Attention：修改数据是直接对Dataframe数据修改，操作无法撤销，更改数据时要做好数据备份。

replace.(to_replace=None, value=None, inplace=False)

其中to_replace表示被替换的值，value表示替换后的值。

import pandas as pd

dict1 = {'city': ['北京', '金华', '深圳'],
         'weather': ['晴', '雨', '阴'],
         'temperature': [25, 27, 30]}

a = pd.DataFrame(dict1)

a.replace(to_replace='北京', value='上海', inplace=True)

print(a)

输出结果为：

    city    weather    temperature
0   上海       晴           25
1   金华       雨           27
2   深圳       阴           30

修改行、列名

通过rename修改列名，需传入一个字典。

修改行名：

import pandas as pd

dict1 = {'city': ['北京', '金华', '深圳'],
         'weather': ['晴', '雨', '阴'],
         'temperature': [25, 27, 30]}

a = pd.DataFrame(dict1)

a.rename(index={1: 'na'}, inplace=True)

print(a)

输出结果为：

   city weather  temperature
0    北京       晴           25
na   金华       雨           27
2    深圳       阴           30

修改列名：

import pandas as pd

dict1 = {'city': ['北京', '金华', '深圳'],
         'weather': ['晴', '雨', '阴'],
         'temperature': [25, 27, 30]}

a = pd.DataFrame(dict1)

a.rename(columns={'temperature': 'temp'}, inplace=True)

print(a)

输出结果为：

    city    weather  temp
0   北京       晴      25
1   金华       雨      27
2   深圳       阴      30

5.Dataframe相加

Dataframe的数据对齐则会同时发生在行和列上。

import pandas as pd

dict1 = {'a': [1, 2], 'b': [5, 6], 'c': [7, 8], 'd': [3, 4]}
s1 = pd.DataFrame(dict1, index=['A', 'B'])

dict2 = {'a': [2, 3], 'e': [5, 6], 'd': [7, 8], 'f': [3, 4], 'c': [8, 9]}
s2 = pd.DataFrame(dict2, index=['A', 'C'])

print(s1, '\n')
print(s2, '\n')
print(s1 + s2)

输出结果为：

   a  b  c  d
A  1  5  7  3
B  2  6  8  4 

   a  e  d  f  c
A  2  5  7  3  8
C  3  6  8  4  9 

     a   b     c     d   e   f
A  3.0 NaN  15.0  10.0 NaN NaN
B  NaN NaN   NaN   NaN NaN NaN
C  NaN NaN   NaN   NaN NaN NaN

三、索引

1.索引对象

构建Series或Dataframe时，所用到的任何数组或其它序列的标签都会被转换成一个Index。

import pandas as pd

dict1 = {'city':['北京', '金华', '深圳'],
         'weather':['晴', '雨', '阴'],
         'temperature':['25℃', '27℃', '30℃']}

a = pd.DataFrame(dict1)

print(a)
print(a.index)
print(a.columns)

输出结果为：

    city    weather   temperature
0   北京       晴         25℃
1   金华       雨         27℃
2   深圳       阴         30℃
RangeIndex(start=0, stop=3, step=1)
Index(['city', 'weather', 'temperature'], dtype='object')

索引对象不能进行修改，否则会报错。

关于Index的一些使用方法和属性：

方法	属性
append	连接另一个Index对象，产生一个新的Index
difference	计算差集
intersection	计算交集
union	计算并集
isin	计算一个指示各值是否都包含在参数集合中的布尔类型值
delect	删除所输入索引处的元素
drop	删除传入的值
insert	将元素插入输入的索引处
is_monotonic	当每个元素都大于等于前一个元素的时候，会输出ture
is_unique	当Index中的值没有重复的时候，会输出ture
unique	计算Index中唯一值的数组

2.重建索引

上面说到了，索引对象是不能改变的，所以重建索引的意思不是改变索引对象，而是对索引重新排序。如果索引值不存在，则引入空缺值NaN。

方法

import pandas as pd

a = pd.Series([1, 2, 3, 4], index=['a', 'd', 'c', 'b'])

b = a.reindex(['a', 'b', 'c', 'd', 'e'])

print(a, '\n')
print(b)

输出结果为：

a    1
d    2
c    3
b    4
dtype: int64 

a    1.0
b    4.0
c    3.0
d    2.0
e    NaN
dtype: float64

填充缺失值

可以用fill_value参数来填充引入的空缺值。

import pandas as pd

a = pd.Series([1, 2, 3, 4], index=['a', 'd', 'c', 'b'])

b = a.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=5)


print(a, '\n')
print(b)

输出结果为：

a    1
d    2
c    3
b    4
dtype: int64 

a    1
b    4
c    3
d    2
e    5
dtype: int64

reindex的常用参数

参数	作用
index	用于索引新的序列
method	填充方式
fill_value	填充空缺值
limit	最大填充量
level	（看不懂）
copy	默认为Ture，无论如何都复制；如果为False，则新索引与旧索引相等的时候就不复制

3.更换索引

将列数据作为索引：

set_index()

import pandas as pd

dict1 = {'city': ['北京', '金华', '深圳'],
         'weather': ['晴', '雨', '阴'],
         'temperature': ['25℃', '27℃', '30℃']}

a = pd.DataFrame(dict1)

b = a.set_index('city')

print(b)

输出结果为：

          weather   temperature
city                    
北京         晴         25℃
金华         雨         27℃
深圳         阴         30℃

还原索引：

reset_index()

import pandas as pd

dict1 = {'city': ['北京', '金华', '深圳'],
         'weather': ['晴', '雨', '阴'],
         'temperature': ['25℃', '27℃', '30℃']}

a = pd.DataFrame(dict1, index=['a', 'b', 'c'])
b = a.reset_index()

print(a,'\n')
print(b)

输出结果为：

    city    weather   temperature
a   北京       晴         25℃
b   金华       雨         27℃
c   深圳       阴         30℃ 

    index city    weather   temperature
0     a   北京       晴         25℃
1     b   金华       雨         27℃
2     c   深圳       阴         30℃

该方法可以重新恢复索引为默认的整型索引。