pandas手册_2020年数据分析必知必会（七）：pandas入门与数据结构基础

最新推荐文章于 2023-12-14 12:19:10 发布

weixin_40007804

最新推荐文章于 2023-12-14 12:19:10 发布

阅读量296

点赞数

文章标签： pandas手册 pandas遍历dataframe

本文链接：https://blog.csdn.net/weixin_40007804/article/details/111287784

版权

本文是pandas手册系列的第七篇，主要介绍了pandas的安装、DataFrame的创建方式，包括CSV、Excel、字典、列表等，并展示了DataFrame的属性和操作，如形状、数据类型、索引等。此外，还探讨了Series数据结构的创建和操作，包括列表、numpy数组和字典方式。

摘要由CSDN通过智能技术生成

本文编辑：远方Github (转载作者请注明出处)

认认真真系统学习数据分析

本文继续学习Python数据分析知识，前期的知识点可点击下面蓝色字体链接进行回看复习：

数据分析开篇：一个简单的应用(2019/11/04)
2020年数据分析必知必会(一)：NumPy数组
2020年数据分析必知必会(二)：NumPy数组----文章末尾附Python
2020年数据分析必知必会(三)：数组的形状和属性(有福利赠送)
数据分析必知必会(四)：数组的转换、视图、拷贝、索引和广播(这里的“广播”是一个数组的应用：数据处理旧手机铃声)
2020年数据分析必知必会(五)：统计学和线性代数(使用Numpy与Scipy实现)
2020年数据分析必知必会(六)：掩码式数组的创建(以北京最近的猪肉价格为例子)

废话不多说，直接上干货

....

正文开始

1、pandas

pandas的全称是panel data(面板数据：一种经济学词汇)并且,pandas是python下的一类开源项目。

pandas官方强调：pandas的所有项目名称最好采纳小写形式，同时规定导入该程序库的语句统一写为：

import pandas as pd

安装pandas命令：

pip install pandas 或 python -m pip install pandas

如果你使用的是pandas,那么你必须使用sudo命令安装。

查看pandas：

import pandas as pdpd.show_versions()pd.__version__

输出：

pandas : 0.25.
numpy : 1.16.
pytz : 2019.
dateutil : 2.8.0
pip : 19.3.
setuptools : 41.0.
pytest : 5.1.1
sphinx : 2.2.1
lxml.etree : 4.4.1
pymysql : 0.9.3
jinja2 : 2.10.
IPython : 7.9.0
bs4 : 4.8.0
lxml.etree : 4.4.1
matplotlib : 3.1.1
openpyxl : 2.6.2
scipy : 1.3.0
>>> pd.__version__
'0.25.3'

目前pandas版本为0.25.3

下面介绍几种必备的pandas依赖项集合：

NumPy前面已经讲过，这里略过。
python-dateutil:专门处理日期数据的程序库。
pytz:处理时区问题的程序库

这些依赖项在上述代码中的版本分别为：

numpy : 1.16.pytz : 2019.dateutil : 2.8.0

2、pandas利剑之DataFrame

pandas的DataFrame数据结构是一种带标签的二维对象，它与Excel的电子表格或数据库的数据表很相似。起初，DataFrame的概念来源R语言，下面说一下DataFrame数据结构的几种创建方式：

CSV
Excel
python dictionary
List of tuples
List of dictionary

先来用字典中创建DataFrame来给大家展示一下什么是DataFrame

包含列表的字典创建DataFrame

from pandas import Series,DataFrameimport pandas as pddata = {'书籍':['高等数学','深度学习','算法手册'],        '数量':[5,6,3],'价格':[108.9,98.8,88.6]}df = DataFrame(data)print(df)

执行结果：

     书籍  数量     价格0  高等数学   5  108.91  深度学习   6   98.82  算法手册   3   88.6

嵌套字典创建DataFrame

其中在索引的设定上主要是让外层字典的键作为列索引，内层字典的键作为行索引

data1 = {'数量':{'高等数学':5,'深度学习':6,'算法手册':3},       '价格':{'高等数学':108.9,'深度学习':98.8,'算法手册':88.6}}df1 = DataFrame(data1)print(df1)

执行结果：

      数量     价格高等数学   5  108.9深度学习   6   98.8算法手册   3   88.6

包含Series的字典创建DataFrame

data2 = {'书籍':Series(['高等数学','深度学习','算法手册']),       '数量':Series([5,6,3]),       '价格':Series([108.9,98.8,88.6])}df2 = DataFrame(data2)print(df2)

执行结果：

     书籍  数量     价格0  高等数学   5  108.91  深度学习   6   98.82  算法手册   3   88.6

现在对DataFrame是不是有印象啦，现在开始逐步介绍DataFrame相关功能和属性。

3、pandas的DataFrame及其属性

首先以CSV文件为例，获取网站

http://www.exploredata.net/Downloads/WHO-Data-Set

将获得的CSV数据留下九列，其他删掉，如图

(1)、将该CSV文件数据导入到DataFrame中，方法代码如下：

from pandas.io.parsers import read_csv   #载入CSv并显示阅读df = read_csv("C:/Users/Administrator/Desktop/CSV_DATA/3358OS_04_Code/code4/WHO_first9cols.csv") #文件路径print("DataFrame",df) #打印数据

执行结果;

DataFrame                 Country  CountryID  ...  Net primary school enrolment ratio male (%)  Population (in thousands) total0           Afghanistan          1  ...                                          NaN                          26088.01               Albania          2  ...                                         94.0                           3172.02               Algeria          3  ...                                         96.0                          33351.03               Andorra          4  ...                                         83.0                             74.04                Angola          5  ...                                         51.0                          16557.0..                  ...        ...  ...                                          ...                              ...197             Vietnam        198  ...                                         96.0                          86206.0198  West Bank and Gaza        199  ...                                          NaN                              NaN199               Yemen        200  ...                                         85.0                          21732.0200              Zambia        201  ...                                         90.0                          11696.0201            Zimbabwe        202  ...                                         87.0                          13228.0[202 rows x 9 columns]

(2)、以元组的形式来存放DataFrame的形状数据，也就是可以查看DataFrame的行数，列数

from pandas.io.parsers import read_csv   #载入CSv并显示阅读df = read_csv("C:/Users/Administrator/Desktop/CSV_DATA/3358OS_04_Code/code4/WHO_first9cols.csv") #文件路径print("DataFrame",df) #打印数据print("shape:",df.shape)print("Lenght",len(df))

执行结果：

shape: (202, 9)Lenght 202

(3)通过其他属性来查询各列标题和数据类型，方法代码如下：

print("Column Headesr:",df.columns)print("Data types:",df.dtypes)

执行结果：

Column Headesr: Index(['Country', 'CountryID', 'Continent', 'Adolescent fertility rate (%)',       'Adult literacy rate (%)',       'Gross national income per capita (PPP international $)',       'Net primary school enrolment ratio female (%)',       'Net primary school enrolment ratio male (%)',       'Population (in thousands) total'],      dtype='object')Data types: Country                                                    objectCountryID                                                   int64Continent                                                   int64Adolescent fertility rate (%)                             float64Adult literacy rate (%)                                   float64Gross national income per capita (PPP international $)    float64Net primary school enrolment ratio female (%)             float64Net primary school enrolment ratio male (%)               float64Population (in thousands) total                           float64dtype: object

(4)创建并打印索引

print("index:",df.index)

执行结果：

index: RangeIndex(start=0, stop=202, step=1)

说明：这个索引是对数组的一种封装，以0位起点，以步长为1单位递增，最后的值为202.

(5)、遍历DataFrame数据结构，采用基础的NumPy数组中提取这些值，再进行相应的处理。

print("values:",df.values)

执行结果：

values: [['Afghanistan' 1 1 ... nan nan 26088.0] ['Albania' 2 2 ... 93.0 94.0 3172.0] ['Algeria' 3 3 ... 94.0 96.0 33351.0] ... ['Yemen' 200 1 ... 65.0 85.0 21732.0] ['Zambia' 201 3 ... 94.0 90.0 11696.0] ['Zimbabwe' 202 3 ... 88.0 87.0 13228.0]]

注意：出现nan是空字段或者数值缺失导致的

综上所述，所有代码为：

from pandas.io.parsers import read_csv   #载入CSV并显示阅读df = read_csv("C:/Users/Administrator/Desktop/CSV_DATA/3358OS_04_Code/code4/WHO_first9cols.csv") #文件路径print("DataFrame",df) #打印数据print("shape:",df.shape)print("Lenght",len(df))print("Column Headesr:",df.columns)print("Data types:",df.dtypes)print("index:",df.index)print("values:",df.values)

4、Series数据结构

创建方式：

python 列表
通过numpy的数组arange创建series
通过python字典

先来体验一下，后面再载入CSV数据

(1)、列表创建Series

电脑终端执行：

C:\Users\Administrator>pythonPython 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 19:29:22) [MSC v.1916 32 bit (Intel)] on win32Type "help", "copyright", "credits" or "license" for more information.>>> import pandas as pd>>> ser1 = pd.Series([2019,2020,2021,2022])>>> ser10    20191    20202    20213    2022dtype: int64

Python Visual Studio终端运行：

import pandas as pdser1 = pd.Series([2019,2020,2021,2022])print("ser1:",ser1)

执行结果：

>>> print("ser1:",ser1)ser1: 0    20191    20202    20213    2022dtype: int64

加索引以后：

ser1 = pd.Series([2019,2020,2021,2022],index=['a','b','c','d'])

执行结果：

>>> print("ser1:",ser1)ser1: a    2019b    2020c    2021d    2022dtype: int64

(2)、numpy数组arange创建Series

ser2 = pd.Series(np.arange(6))print("ser2:",ser2)

执行结果：

>>> print("ser2:",ser2)ser2: 0    01    12    23    34    45    5dtype: int32

加索引以后：

ser2 = pd.Series(np.arange(6),index=['A','B','C','D','E','F'])print("ser2:",ser2)

执行结果：

>>> print("ser2:",ser2)ser2: A    0B    1C    2D    3E    4F    5dtype: int32

(3)、python字典创建Series

ser3 = pd.Series({'Python':1, "Java":2, "C":3})print("ser3:",ser3)

执行结果：

>> print("ser3:",ser3)ser3: Python    1Java      2C         3dtype: int64

下面载入CSV文件数据(仍然使用前面的CSV数据)：

(4)、选取CSV中的第一列或其他列，显示这个对象在局部作用域中的类型

from pandas.io.parsers import read_csv   #载入CSV并显示阅读df = read_csv("C:/Users/Administrator/Desktop/CSV_DATA/3358OS_04_Code/code4/WHO_first9cols.csv") #文件路径print("DataFrame",df) #打印数据print("shape:",df.shape)print("Lenght",len(df))print("Column Headesr:",df.columns)print("Data types:",df.dtypes)print("index:",df.index)print("values:",df.values)#显示局部对象的作用域中的类型country_col = df["Country"]print("Type df",type(df))print("Type country col:",type(country_col))

执行结果：

>>> country_col = df["Country"]>>> print("Type df",type(df))Type df <class 'pandas.core.frame.DataFrame'>>>> print("Type country col:",type(country_col))Type country col: <class 'pandas.core.series.Series'>

(5)、提供与名称有有关的一个属性

#提供与名称有有关的一个属性print("Series shape", country_col.shape)print("Series index", country_col.index)print("Series values", country_col.values)print("Series name", country_col.name)

执行结果：

>>> #提供与名称有有关的一个属性...>>> print("Series shape", country_col.shape)Series shape (202,)>>> print("Series index", country_col.index)Series index RangeIndex(start=0, stop=202, step=1)>>> print("Series values", country_col.values)Series values ['Afghanistan' 'Albania' 'Algeria' 'Andorra' 'Angola' 'Antigua and Barbuda' 'Argentina' 'Armenia' 'Australia' 'Austria' 'Azerbaijan' 'Bahamas' 'Bahrain'  ...'China'....(太多了，所以这里省略了一部分)' 'United States of America' 'Uruguay' 'Uzbekistan' 'Vanuatu' 'Venezuela' 'Vietnam' 'West Bank and Gaza' 'Yemen' 'Zambia' 'Zimbabwe']>>> print("Series name", country_col.name)Series name Country

(6)、Series切片功能

截取Series变量Country中最后的4个国家进行说明

#截取最后4个国家print("Last 4 countries", country_col[-4:])print("Last 4 countries type", type(country_col[-4:]))

执行结果：

>>> #截取最后4个国家...>>> print("Last 4 countries", country_col[-4:])Last 4 countries 198    West Bank and Gaza199                 Yemen200                Zambia201              ZimbabweName: Country, dtype: object>>> print("Last 4 countries type", type(country_col[-4:]))Last 4 countries type <class 'pandas.core.series.Series'>

(7)、使用NumPy的sign()函数来获取数字符号，获取规则：正数返回1，负数返回-1，零值返回0

先复习一个数学中的符号函数：

#导入numpy库import numpy as np #输入数据dataArr = [-2017, -2018, 0, 2019, 2020, 2021]print("输入数据为：")print(dataArr) #使用numpy的sign(x)函数求输入数据的符号signResult = np.sign(dataArr) #打印出sign()的输出结果print("\n使用sign函数的输出符号为：")print(signResult)

‍执行结果：

>>> ataArr = [-2017, -2018, 0, 2019, 2020, 2021]>>> print("输入数据为：")输入数据为：>>> print(dataArr)[-2017, -2018, 0, 2019, 2020, 2021]>>>使用sign函数的输出符号为：>>> print(signResult)[-1 -1  0  1  1  1]

应用的对象是上述CSV数据的最后一列，即各国国家人口数

#将sing()函数应用在其中last_col = df.columns[-1]print("Last df column signs", last_col,np.sign(df[last_col]))

执行结果：

>>> #sing()函数的应用...>>> last_col = df.columns[-1]>>> print("Last df column signs", last_col,np.sign(df[last_col]))Last df column signs Population (in thousands) total 0      1.01      1.02      1.03      1.04      1.0      ...197    1.0198    NaN199    1.0200    1.0201    1.0Name: Population (in thousands) total, Length: 202, dtype: float64

下期预告：带你一起用pandas查询数据和统计分析