数据分析（3）Pandas

最新推荐文章于 2024-09-12 20:06:16 发布

大龄程序媛

最新推荐文章于 2024-09-12 20:06:16 发布

阅读量71

点赞数

文章标签： python 数据分析

本文链接：https://blog.csdn.net/weixin_51862488/article/details/119726735

版权

本文详细介绍了Python数据科学库Pandas的基础知识，包括Series和DataFrame两种核心数据结构的创建、切片、索引以及数据操作。Series是带标签的一维数组，可以通过数组或字典创建，而DataFrame是二维表格数据结构，拥有行和列索引。文章还讲解了如何处理缺失数据，以及DataFrame的统计方法。此外，讨论了布尔索引、字符串方法和读取外部数据如CSV和数据库的功能。

摘要由CSDN通过智能技术生成

什么是pandas

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

pandas的常用数据类型

1、Series 一维， 带标签数组
2、DataFrame 二维， Series容器

Series

Series创建

1、使用数组创建Series

>>> import pandas as pd
>>> pd.Series([1,2,3,4,5])	
0    1
1    2
2    3
3    4
4    5
dtype: int64
>>> type(a)
<class 'pandas.core.series.Series'>

默认的索引index是从0开始的0,1，2,3……，可index参数修改索引值。

>>> pd.Series([1,23,2,2,1],index=list('abcde'))
a     1
b    23
c     2
d     2
e     1
dtype: int64

2、通过字典创建一个Series，其中的索引就是字典的键

>>> temp_dict = {"name":"xiaohong","age":30,"tel":95598}
>>> t=pd.Series(temp_dict)
>>> t
name    xiaohong
age           30
tel        95598
dtype: object

重新给其指定其他的索引之后，如果能够对应上，就取其值，如果不能，就为NAN。

import string
>>> b = {string.ascii_uppercase[i]:i for i in range(10)}
>>> b
{'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6, 'H': 7, 'I': 8, 'J': 9}
>>> pd.Series(b)
A    0
B    1
C    2
D    3
E    4
F    5
G    6
H    7
I    8
J    9
dtype: int64
>>> pd.Series(b,index=list(string.ascii_uppercase[5:15]))
F    5.0
G    6.0
H    7.0
I    8.0
J    9.0
K    NaN
L    NaN
M    NaN
N    NaN
O    NaN
dtype: float64

Series切片和索引

Numpy中nan为float型，pandas会自动根据数据类更改series的dtype类型，修改方式与Numpy的方法一样。

>>> t.dtype
dtype('O')
>>> a.astype(float)
0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
5    5.0
6    6.0
7    7.0
8    8.0
9    9.0
dtype: float64

切片： 直接传入start end 或者步长即可
索引： 一个的时候直接传入序号或者index，多个的时候传入序号或者index的列表
Series对象本质上由两个数组构成，一个数组构成对象的键（index,索引），一个数组构成对象的值（values），键->值
ndarray的很多方法都可以运用于series类型，比如argmax，clip
series具有where方法，但是结果和ndarray不同,不太常用，可以在pandas官方文档上查询。

读取外部数据

读取csv文件，使用pd. read_csv即可。
读取mysql数据库中的文件，pd.read_sql(sql_sentence,connection)

DataFrame

>>> t = pd.DataFrame(np.arange(12).reshape(3,4))
>>> t
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

DataFrame对象既有行索引，又有列索引
行索引， 表明不同行，横向索引，叫index，0轴，axis=0
列索引， 表名不同列，纵向索引，叫columns，1轴，axis=1

>>> import numpy as np
>>> t1=pd.DataFrame(np.arange(12).reshape(3,4),index=list("abc"),columns=list("WXYZ"))
>>> t1
   W  X   Y   Z
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11

>>> d1={"name":["xiaoming","xiaogang"],"age":[20,32],"tel":[10086,10010]}
>>> pd.DataFrame(d1)
       name  age    tel
0  xiaoming   20  10086
1  xiaogang   32  10010
>>> t1 = pd.DataFrame(d1)
>>> type(t1)
<class 'pandas.core.frame.DataFrame'>

在这里插入图片描述

>>> d2 = [{"name":"xiaohong","age":32,"tel":10086},{"name":"xiaogang","tel":10010},{"name":"xiaowang","age":20}]
>>> t2 = pd.DataFrame(d2)
>>> t2
       name   age      tel
0  xiaohong  32.0  10086.0
1  xiaogang   NaN  10010.0
2  xiaowang  20.0      NaN
>>> t2.index
RangeIndex(start=0, stop=3, step=1)
>>> t2.columns
Index(['name', 'age', 'tel'], dtype='object')
>>> t2.values
array([['xiaohong', 32.0, 10086.0],
       ['xiaogang', nan, 10010.0],
       ['xiaowang', 20.0, nan]], dtype=object)
>>> t2.shape
(3, 3)
>>> t2.dtypes
name     object
age     float64
tel     float64
dtype: object
>>> t2.ndim
2
>>> t2.head()
       name   age      tel
0  xiaohong  32.0  10086.0
1  xiaogang   NaN  10010.0
2  xiaowang  20.0      NaN
>>> t2.head(1)
       name   age      tel
0  xiaohong  32.0  10086.0
>>> t2.tail(2)
       name   age      tel
1  xiaogang   NaN  10010.0
2  xiaowang  20.0      NaN
>>> t2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    3 non-null      object 
 1   age     2 non-null      float64
 2   tel     2 non-null      float64
dtypes: float64(2), object(1)
memory usage: 200.0+ bytes
>>> t2.describe()	#只显示数值类型的属性
             age           tel
count   2.000000      2.000000
mean   26.000000  10048.000000
std     8.485281     53.740115
min    20.000000  10010.000000
25%    23.000000  10029.000000
50%    26.000000  10048.000000
75%    29.000000  10067.000000
max    32.000000  10086.000000

DataFrame中排序的方法 df.sort_values（） ，ascending参数默认为正序，设置倒序需明确写明False

pandas之取行或者列

df_sorted = df.sort_values(by=“Count_AnimalName”)
df_sorted[:100]
具体要选择某一列该怎么选择呢？df[" Count_AnimalName “]
要同时选择行和列改怎么办？df[：100][” Count_AnimalName "]
pandas取行或者列的注意点：
1、方括号写数组，表示取行，对行进行操作
2、写字符串，表示的取列索引，对列进行操作

df.loc 通过标签索引行数据
df.iloc 通过位置获取行数据

>>> t1
   W  X   Y   Z
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11
>>> t1.loc["a"]
W    0
X    1
Y    2
Z    3
Name: a, dtype: int32
>>> t1.loc["a":'c',["W","Z"]]  #冒号在loc里面是闭合的，即会选择到冒号后面的数据
   W   Z
a  0   3
b  4   7
c  8  11
>>> t1.iloc[1:,:2] = np.nan  #赋值成功，pandas自动转换格式
>>> t1
     W    X   Y   Z
a  0.0  1.0   2   3
b  NaN  NaN   6   7
c  NaN  NaN  10  11

pandas之布尔索引

单一条件：df = df[df[“Count_AnimalName”]>50]
多个条件：df[(df[“Count_AnimalName”]>50) & (df[“Row_Labels”].str.len()<6)]
连接符号：
& 且
| 或
注意点： 不同的条件之间需要用括号括起来

pandas之字符串方法

在这里插入图片描述

缺失数据的处理

判断数据是否为NaN： pd.isnull(df),pd.notnull(df)

处理方式1： 删除NaN所在的行列dropna (axis=0, how=‘any’, inplace=False)
how是条件参数，any是只要出现就删，all是全部出现才删；inplace是否进行原地替换，默认是False
处理方式2： 填充数据，t.fillna(t.mean()),t.fiallna(t.median()),t.fillna(0)

处理为0的数据：t[t==0]=np.nan
当然并不是每次为0的数据都需要处理
计算平均值等情况，nan是不参与计算的，但是0会

>>> import pandas as pd
>>> import numpy as np
>>> t = pd.DataFrame(np.arange(24).reshape((4,6)).astype(float))
>>> t.index=list("ABCD")
>>> t.columns=list("UVWXYZ")
>>> t
      U     V     W     X     Y     Z
A   0.0   1.0   2.0   3.0   4.0   5.0
B   6.0   7.0   8.0   9.0  10.0  11.0
C  12.0  13.0  14.0  15.0  16.0  17.0
D  18.0  19.0  20.0  21.0  22.0  23.0
>>> t.loc["A",["U","Z"]] = np.nan
>>> t.loc["D","W"] = np.nan
>>> t.loc["B","Y"] = 0
>>> t
      U     V     W     X     Y     Z
A   NaN   1.0   2.0   3.0   4.0   NaN
B   6.0   7.0   8.0   9.0   0.0  11.0
C  12.0  13.0  14.0  15.0  16.0  17.0
D  18.0  19.0   NaN  21.0  22.0  23.0
>>> pd.isnull(t)
       U      V      W      X      Y      Z
A   True  False  False  False  False   True
B  False  False  False  False  False  False
C  False  False  False  False  False  False
D  False  False   True  False  False  False
>>> pd.notnull(t)
       U     V      W     X     Y      Z
A  False  True   True  True  True  False
B   True  True   True  True  True   True
C   True  True   True  True  True   True
D   True  True  False  True  True   True
>>> t[pd.notnull(t["U"])]
      U     V     W     X     Y     Z
B   6.0   7.0   8.0   9.0   0.0  11.0
C  12.0  13.0  14.0  15.0  16.0  17.0
D  18.0  19.0   NaN  21.0  22.0  23.0
>>> t.dropna(axis=0,how='all')
      U     V     W     X     Y     Z
A   NaN   1.0   2.0   3.0   4.0   NaN
B   6.0   7.0   8.0   9.0   0.0  11.0
C  12.0  13.0  14.0  15.0  16.0  17.0
D  18.0  19.0   NaN  21.0  22.0  23.0
>>> t.dropna(axis=0,how='any')
      U     V     W     X     Y     Z
B   6.0   7.0   8.0   9.0   0.0  11.0
C  12.0  13.0  14.0  15.0  16.0  17.0
>>> t
      U     V     W     X     Y     Z
A   NaN   1.0   2.0   3.0   4.0   NaN
B   6.0   7.0   8.0   9.0   0.0  11.0
C  12.0  13.0  14.0  15.0  16.0  17.0
D  18.0  19.0   NaN  21.0  22.0  23.0
>>> t.fillna(t.mean())
      U     V     W     X     Y     Z
A  12.0   1.0   2.0   3.0   4.0  17.0
B   6.0   7.0   8.0   9.0   0.0  11.0
C  12.0  13.0  14.0  15.0  16.0  17.0
D  18.0  19.0   8.0  21.0  22.0  23.0
>>> t.fillna(0)
      U     V     W     X     Y     Z
A   0.0   1.0   2.0   3.0   4.0   0.0
B   6.0   7.0   8.0   9.0   0.0  11.0
C  12.0  13.0  14.0  15.0  16.0  17.0
D  18.0  19.0   0.0  21.0  22.0  23.0
>>> t['U'] = t["U"].fillna(t["U"].median())
>>> t
      U     V     W     X     Y     Z
A  12.0   1.0   2.0   3.0   4.0   NaN
B   6.0   7.0   8.0   9.0   0.0  11.0
C  12.0  13.0  14.0  15.0  16.0  17.0
D  18.0  19.0   NaN  21.0  22.0  23.0

pandas常用统计方法

在这里插入图片描述

大龄程序媛

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据分析（3）Pandas

什么是pandasPandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.pandas的常用数据类型1、Series 一维，带标签数组2、DataFrame 二维， Series容器SeriesSeries创建1、使用数组创建
复制链接

扫一扫