Pandas快速入门课堂笔记

Salary_Salary_Salary

已于 2023-01-03 21:31:26 修改

阅读量209

点赞数

文章标签： pandas python 数据分析

于 2023-01-03 21:28:41 首次发布

本文链接：https://blog.csdn.net/beiyin0141/article/details/128531213

版权

本文介绍了Python数据分析库Pandas的基础知识，包括加载csv和tsv数据集，理解DataFrame和Series数据结构，以及如何通过loc和iloc获取指定行列的数据。内容涵盖了行标签和列标签的获取与设置，以及切片操作和[]语法的使用。

摘要由CSDN通过智能技术生成

学习目标：

课堂笔记：黑马程序员Python大数据课程 Pandas快速入门01
知道DataFrame和Series数据结构
能够加载csv 和tsv的数据集
能够区分DataFrame 的行列标签和行列位置编号
能够获取DataFrame 指定行列的数据
    loc
    iloc
    loc和iloc的切片操作
    []语法

全篇共8497字,内容仅适合初学者进行理解，如对您有所帮助，倍感荣幸！

学习内容：

pandas 版本1.1.3(案例版本)

'''pip install pandas==1.1.3'''
import pandas as pd

案例数据集链接:

某网盘提取码：q74dhttps://pan.baidu.com/s/10gTn1ve26_za6KgL_7DUDA?pwd=q74d

加载.csv数据集

 加载CSV数据
tips = pd.read_csv('pandas数据包/tips.csv') 
#''引号内容为数据集的路径 请切换为自己的路径
print(tips)
print(type(tips))

'''     total_bill   tip     sex smoker   day    time  size
0         16.99  1.01  Female     No   Sun  Dinner     2
1         10.34  1.66    Male     No   Sun  Dinner     3
2         21.01  3.50    Male     No   Sun  Dinner     3
3         23.68  3.31    Male     No   Sun  Dinner     2
4         24.59  3.61  Female     No   Sun  Dinner     4
..          ...   ...     ...    ...   ...     ...   ...
239       29.03  5.92    Male     No   Sat  Dinner     3
240       27.18  2.00  Female    Yes   Sat  Dinner     2
241       22.67  2.00    Male    Yes   Sat  Dinner     2
242       17.82  1.75    Male     No   Sat  Dinner     2
243       18.78  3.00  Female     No  Thur  Dinner     2
'''

加载.tsv数据集

# 加载TSV数据
China_tsv = pd.read_csv('pandas数据包/china.tsv',sep='\t')  #与csv的不同点 是需要以\t 进行分割
print(China_tsv)
print(type(China_tsv))

'''
          country continent  year  lifeExp       pop   gdpPercap
0     Afghanistan      Asia  1952   28.801   8425333  779.445314
1     Afghanistan      Asia  1957   30.332   9240934  820.853030
2     Afghanistan      Asia  1962   31.997  10267083  853.100710
3     Afghanistan      Asia  1967   34.020  11537966  836.197138
4     Afghanistan      Asia  1972   36.088  13079460  739.981106
...           ...       ...   ...      ...       ...         ...
1699     Zimbabwe    Africa  1987   62.351   9216418  706.157306
1700     Zimbabwe    Africa  1992   60.377  10704340  693.420786
1701     Zimbabwe    Africa  1997   46.809  11404948  792.449960
1702     Zimbabwe    Africa  2002   39.989  11926563  672.038623
1703     Zimbabwe    Africa  2007   43.487  12311143  469.709298
'''

获取DataFrame 的行标签

# 获取DataFrame 的行标签
print(China_tsv.index)
'''RangeIndex(start=0, stop=1704, step=1)'''

获取DataFrame 的列标签

# 获取DataFrame 的列标签
print(China_tsv.columns)
'''Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')'''

设置 DataFrame 的行标签(与其说是设置不如说把某字段设置为读取信息的行)

# 设置 DataFrame 的行标签
# ***注意 设置DataFrame 的行标签时,并不会改变原来的DataFrame,而是返回原来的数据副本
China_df = China_tsv.set_index('year')
print(China_df)
'''      
          country continent  lifeExp       pop   gdpPercap
year                                                      
1952  Afghanistan      Asia   28.801   8425333  779.445314
1957  Afghanistan      Asia   30.332   9240934  820.853030
1962  Afghanistan      Asia   31.997  10267083  853.100710
1967  Afghanistan      Asia   34.020  11537966  836.197138
1972  Afghanistan      Asia   36.088  13079460  739.981106
...           ...       ...      ...       ...         ...
1987     Zimbabwe    Africa   62.351   9216418  706.157306
1992     Zimbabwe    Africa   60.377  10704340  693.420786
1997     Zimbabwe    Africa   46.809  11404948  792.449960
2002     Zimbabwe    Africa   39.989  11926563  672.038623
2007     Zimbabwe    Africa   43.487  12311143  469.709298
'''
#重新获取行标签可发现行标签更换为了我们所设置的年份 year 的数值
print(China_df.index)
'''
Int64Index([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997,
            ...
            1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002, 2007],
           dtype='int64', name='year', length=1704)
'''

四种获取数据的语法

loc 函数获取指定行列的内容

# loc 函数 获取指定行列的内容
# 获取行为1952 1962 1972 列为 country, pop , gdpPercap 的内容
print(China_df.loc[[1952, 1962, 1972], ['country', 'pop', 'gdpPercap']])
'''
                 country       pop    gdpPercap
year                                           
1952         Afghanistan   8425333   779.445314
1952             Albania   1282697  1601.056136
1952             Algeria   9279525  2449.008185
1952              Angola   4232095  3520.610273
1952           Argentina  17876956  5911.315053
...                  ...       ...          ...
1972             Vietnam  44655014   699.501644
1972  West Bank and Gaza   1089572  3133.409277
1972         Yemen, Rep.   7407075  1265.047031
1972              Zambia   4506497  1773.498265
1972            Zimbabwe   5861135   799.362176

[426 rows x 3 columns]
'''

loc 函数获取所有行的country, pop ,gdpPercap列的数据

#获取所有行的country, pop ,gdpPercap列的数据
print(China_df.loc[:, ['country', 'pop', 'gdpPercap']])
'''
          country       pop   gdpPercap
year                                   
1952  Afghanistan   8425333  779.445314
1957  Afghanistan   9240934  820.853030
1962  Afghanistan  10267083  853.100710
1967  Afghanistan  11537966  836.197138
1972  Afghanistan  13079460  739.981106
...           ...       ...         ...
1987     Zimbabwe   9216418  706.157306
1992     Zimbabwe  10704340  693.420786
1997     Zimbabwe  11404948  792.449960
2002     Zimbabwe  11926563  672.038623
2007     Zimbabwe  12311143  469.709298

[1704 rows x 3 columns]

'''

loc 函数获取行标签为1957行的所有列的数据

# 获取行标签为1957行的所有列的数据
print(China_df.loc[[1957]])
print(type(China_df.loc[[1957]]))

'''
                 country continent  lifeExp       pop    gdpPercap
year                                                              
1957         Afghanistan      Asia   30.332   9240934   820.853030
1957             Albania    Europe   59.280   1476505  1942.284244
1957             Algeria    Africa   45.685  10270856  3013.976023
1957              Angola    Africa   31.999   4561361  3827.940465
1957           Argentina  Americas   64.399  19610538  6856.856212
...                  ...       ...      ...       ...          ...
1957             Vietnam      Asia   42.887  28998543   676.285448
1957  West Bank and Gaza      Asia   45.671   1070439  1827.067742
1957         Yemen, Rep.      Asia   33.970   5498090   804.830455
1957              Zambia    Africa   44.077   3016000  1311.956766
1957            Zimbabwe    Africa   50.469   3646340   518.764268

[142 rows x 5 columns]

<class 'pandas.core.frame.DataFrame'>

loc 函数获取行标签为1957 行的lifeExp (期望寿命)的数据

# 获取行标签为1957 行的lifeExp (期望寿命)的数据
print(China_df.loc[[1957],['lifeExp']])
'''
      lifeExp
year         
1957   30.332
1957   59.280
1957   45.685
1957   31.999
1957   64.399
...       ...
1957   42.887
1957   45.671
1957   33.970
1957   44.077
1957   50.469

[142 rows x 1 columns]

'''

iloc 函数获取指定行列的数据

# iloc 函数 获取行位置为0，2，4行的0，1，2列的数据
print(China_df.iloc[[0, 2, 4], [0, 1, 2]])
'''
          country continent  lifeExp
year                                
1952  Afghanistan      Asia   28.801
1962  Afghanistan      Asia   31.997
1972  Afghanistan      Asia   36.088
'''

iloc 函数获取0 2 4 行所有列的数据

# iloc 函数 获取0 2 4 行所有列的数据
print(China_df.iloc[[0,2,4]])
'''
          country continent  lifeExp       pop   gdpPercap
year                                                      
1952  Afghanistan      Asia   28.801   8425333  779.445314
1962  Afghanistan      Asia   31.997  10267083  853.100710
1972  Afghanistan      Asia   36.088  13079460  739.981106
'''

iloc 函数获取所有行的列位置为0,1,2列的数据

# iloc 函数 获取所有行的列位置为0,1,2列的数据
print(China_df.iloc[:, [0, 1, 2]])
'''
          country continent  lifeExp
year                                
1952  Afghanistan      Asia   28.801
1957  Afghanistan      Asia   30.332
1962  Afghanistan      Asia   31.997
1967  Afghanistan      Asia   34.020
1972  Afghanistan      Asia   36.088
...           ...       ...      ...
1987     Zimbabwe    Africa   62.351
1992     Zimbabwe    Africa   60.377
1997     Zimbabwe    Africa   46.809
2002     Zimbabwe    Africa   39.989
2007     Zimbabwe    Africa   43.487

[1704 rows x 3 columns]
'''

iloc 函数获取行为1所有列的数据

# iloc 函数 获取行为1所有列的数据
print(China_df.iloc[[1]])
print(type(China_df.iloc[[1]]))
'''
          country continent  lifeExp      pop  gdpPercap
year                                                    
1957  Afghanistan      Asia   30.332  9240934  820.85303
<class 'pandas.core.frame.DataFrame'>
'''
#Series 数据类型的写法  类型稍稍有所区别
print(China_df.iloc[1])
print(type(China_df.iloc[1]))
'''
country      Afghanistan
continent           Asia
lifeExp           30.332
pop              9240934
gdpPercap      820.85303
Name: 1957, dtype: object
<class 'pandas.core.series.Series'>
'''

iloc 函数获取行为1 列为2的数据

# iloc 函数 获取行为1 列为2的数据
# 省略 行的中括号 得到的数据
print(China_df.iloc[1,[2]])
print(type(China_df.iloc[1,[2]]))
'''
lifeExp    30.332
Name: 1957, dtype: object
<class 'pandas.core.series.Series'>
'''
# 省略 列的中括号 得到的数据
print(China_df.iloc[[1],2])
print(type(China_df.iloc[[1],2]))
'''
year
1957    30.332
Name: lifeExp, dtype: float64
<class 'pandas.core.series.Series'>
'''

注意：为保证下方切片可以正确执行,请将csv中的数据集更换为下方复制粘贴并保存到 china.csv 中：

country    continent  year   lifeExp    pop    gdpPercap
China  Asia   1952   44 556263527  400.448611
China  Asia   1957   50.54896   637408000  575.9870009
China  Asia   1962   44.50136   665770000  487.6740183
China  Asia   1967   58.38112   754550000  612.7056934
China  Asia   1972   63.11888   862030000  676.9000921
China  Asia   1977   63.96736   943455000  741.2374699
China  Asia   1982   65.525 1000281000 962.4213805
China  Asia   1987   67.274 1084035000 1378.904018
China  Asia   1992   68.69  1164970000 1655.784158
China  Asia   1997   70.426 1230075000 2289.234136
China  Asia   2002   72.028 1280400000 3119.280896
China  Asia   2007   72.961 1318683096 4959.114854

因为源数据集中出现多个重复的年份会使切片报错如下：KeyError: 'Cannot get left slice bound for non-unique label: xxxx'
当用源china.tsv时 报错 没有唯一的1952 说明在进行切片操作时要有唯一的左右的切片值才能准确定位

loc 和 iloc 的切片操作

获取 China_df 中前三行和前三列的数据

#示例  获取China_df 中前三行和前三列的数据
#    loc实现：
print(China_df.loc[1952:1962,'country':'lifeExp'])
'''
     country continent   lifeExp
year                            
1952   China      Asia  44.00000
1957   China      Asia  50.54896
1962   China      Asia  44.50136
'''
#    iloc 实现：
print(China_df.iloc[0:3, 0:3])

'''
     country continent   lifeExp
year                            
1952   China      Asia  44.00000
1957   China      Asia  50.54896
1962   China      Asia  44.50136

[ ] 语法获取指定行列的数据

获取所有行的 country pop gdpPercap 列的数据

print(China_df[['country', 'pop', 'gdpPercap']])
'''
     country         pop    gdpPercap
year                                 
1952   China   556263527   400.448611
1957   China   637408000   575.987001
1962   China   665770000   487.674018
1967   China   754550000   612.705693
1972   China   862030000   676.900092
1977   China   943455000   741.237470
1982   China  1000281000   962.421381
1987   China  1084035000  1378.904018
1992   China  1164970000  1655.784158
1997   China  1230075000  2289.234136
2002   China  1280400000  3119.280896
2007   China  1318683096  4959.114854
'''

获取所有行的 pop 列的数据

print(China_df[['pop']])
'''
             pop
year            
1952   556263527
1957   637408000
1962   665770000
1967   754550000
1972   862030000
1977   943455000
1982  1000281000
1987  1084035000
1992  1164970000
1997  1230075000
2002  1280400000
2007  1318683096
'''

省略 print(China_df[['pop']]) 中的 中括号 以后会得到Series数据类型的结果

print(China_df['pop'])
print(type(China_df['pop']))
'''
year
1952     556263527
1957     637408000
1962     665770000
1967     754550000
1972     862030000
1977     943455000
1982    1000281000
1987    1084035000
1992    1164970000
1997    1230075000
2002    1280400000
2007    1318683096
Name: pop, dtype: int64
<class 'pandas.core.series.Series'>
'''

获取前三行的数据

print(China_df[0:3])
'''
     country continent   lifeExp        pop   gdpPercap
year                                                   
1952   China      Asia  44.00000  556263527  400.448611
1957   China      Asia  50.54896  637408000  575.987001
1962   China      Asia  44.50136  665770000  487.674018
'''

从第一行开始,每隔一行取一行数据,一共取3行数据

China_df[起始行位置：结束行位置] 解释：根据指定范围对应行的所有列的数据，但不包括结束行位置

print(China_df[0:5:2])

'''
     country continent   lifeExp        pop   gdpPercap
year                                                   
1952   China      Asia  44.00000  556263527  400.448611
1962   China      Asia  44.50136  665770000  487.674018
1972   China      Asia  63.11888  862030000  676.900092
'''