学习目标:
-
课堂笔记:黑马程序员Python大数据课程 Pandas快速入门01 知道DataFrame和Series数据结构 能够加载csv 和tsv的数据集 能够区分DataFrame 的行列标签和行列位置编号 能够获取DataFrame 指定行列的数据 loc iloc loc和iloc的切片操作 []语法
全篇共8497字,内容仅适合初学者进行理解,如对您有所帮助,倍感荣幸!
学习内容:
pandas 版本1.1.3(案例版本)
'''pip install pandas==1.1.3'''
import pandas as pd
案例数据集链接:
某网盘提取码:q74dhttps://pan.baidu.com/s/10gTn1ve26_za6KgL_7DUDA?pwd=q74d
加载.csv数据集
加载CSV数据
tips = pd.read_csv('pandas数据包/tips.csv')
#''引号内容为数据集的路径 请切换为自己的路径
print(tips)
print(type(tips))
''' total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
'''
加载.tsv数据集
# 加载TSV数据
China_tsv = pd.read_csv('pandas数据包/china.tsv',sep='\t') #与csv的不同点 是需要以\t 进行分割
print(China_tsv)
print(type(China_tsv))
'''
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106
... ... ... ... ... ... ...
1699 Zimbabwe Africa 1987 62.351 9216418 706.157306
1700 Zimbabwe Africa 1992 60.377 10704340 693.420786
1701 Zimbabwe Africa 1997 46.809 11404948 792.449960
1702 Zimbabwe Africa 2002 39.989 11926563 672.038623
1703 Zimbabwe Africa 2007 43.487 12311143 469.709298
'''
获取DataFrame 的行标签
# 获取DataFrame 的行标签
print(China_tsv.index)
'''RangeIndex(start=0, stop=1704, step=1)'''
获取DataFrame 的列标签
# 获取DataFrame 的列标签
print(China_tsv.columns)
'''Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')'''
设置 DataFrame 的行标签(与其说是设置不如说把某字段设置为读取信息的行)
# 设置 DataFrame 的行标签
# ***注意 设置DataFrame 的行标签时,并不会改变原来的DataFrame,而是返回原来的数据副本
China_df = China_tsv.set_index('year')
print(China_df)
'''
country continent lifeExp pop gdpPercap
year
1952 Afghanistan Asia 28.801 8425333 779.445314
1957 Afghanistan Asia 30.332 9240934 820.853030
1962 Afghanistan Asia 31.997 10267083 853.100710
1967 Afghanistan Asia 34.020 11537966 836.197138
1972 Afghanistan Asia 36.088 13079460 739.981106
... ... ... ... ... ...
1987 Zimbabwe Africa 62.351 9216418 706.157306
1992 Zimbabwe Africa 60.377 10704340 693.420786
1997 Zimbabwe Africa 46.809 11404948 792.449960
2002 Zimbabwe Africa 39.989 11926563 672.038623
2007 Zimbabwe Africa 43.487 12311143 469.709298
'''
#重新获取行标签可发现行标签更换为了我们所设置的年份 year 的数值
print(China_df.index)
'''
Int64Index([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997,
...
1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002, 2007],
dtype='int64', name='year', length=1704)
'''
四种获取数据的语法
loc 函数 获取指定行列的内容
# loc 函数 获取指定行列的内容
# 获取行为1952 1962 1972 列为 country, pop , gdpPercap 的内容
print(China_df.loc[[1952, 1962, 1972], ['country', 'pop', 'gdpPercap']])
'''
country pop gdpPercap
year
1952 Afghanistan 8425333 779.445314
1952 Albania 1282697 1601.056136
1952 Algeria 9279525 2449.008185
1952 Angola 4232095 3520.610273
1952 Argentina 17876956 5911.315053
... ... ... ...
1972 Vietnam 44655014 699.501644
1972 West Bank and Gaza 1089572 3133.409277
1972 Yemen, Rep. 7407075 1265.047031
1972 Zambia 4506497 1773.498265
1972 Zimbabwe 5861135 799.362176
[426 rows x 3 columns]
'''
loc 函数 获取所有行的country, pop ,gdpPercap列的数据
#获取所有行的country, pop ,gdpPercap列的数据
print(China_df.loc[:, ['country', 'pop', 'gdpPercap']])
'''
country pop gdpPercap
year
1952 Afghanistan 8425333 779.445314
1957 Afghanistan 9240934 820.853030
1962 Afghanistan 10267083 853.100710
1967 Afghanistan 11537966 836.197138
1972 Afghanistan 13079460 739.981106
... ... ... ...
1987 Zimbabwe 9216418 706.157306
1992 Zimbabwe 10704340 693.420786
1997 Zimbabwe 11404948 792.449960
2002 Zimbabwe 11926563 672.038623
2007 Zimbabwe 12311143 469.709298
[1704 rows x 3 columns]
'''
loc 函数 获取行标签为1957行的所有列的数据
# 获取行标签为1957行的所有列的数据
print(China_df.loc[[1957]])
print(type(China_df.loc[[1957]]))
'''
country continent lifeExp pop gdpPercap
year
1957 Afghanistan Asia 30.332 9240934 820.853030
1957 Albania Europe 59.280 1476505 1942.284244
1957 Algeria Africa 45.685 10270856 3013.976023
1957 Angola Africa 31.999 4561361 3827.940465
1957 Argentina Americas 64.399 19610538 6856.856212
... ... ... ... ... ...
1957 Vietnam Asia 42.887 28998543 676.285448
1957 West Bank and Gaza Asia 45.671 1070439 1827.067742
1957 Yemen, Rep. Asia 33.970 5498090 804.830455
1957 Zambia Africa 44.077 3016000 1311.956766
1957 Zimbabwe Africa 50.469 3646340 518.764268
[142 rows x 5 columns]
<class 'pandas.core.frame.DataFrame'>
loc 函数 获取行标签为1957 行的lifeExp (期望寿命)的数据
# 获取行标签为1957 行的lifeExp (期望寿命)的数据
print(China_df.loc[[1957],['lifeExp']])
'''
lifeExp
year
1957 30.332
1957 59.280
1957 45.685
1957 31.999
1957 64.399
... ...
1957 42.887
1957 45.671
1957 33.970
1957 44.077
1957 50.469
[142 rows x 1 columns]
'''
iloc 函数 获取指定行列的数据
# iloc 函数 获取行位置为0,2,4行的0,1,2列的数据
print(China_df.iloc[[0, 2, 4], [0, 1, 2]])
'''
country continent lifeExp
year
1952 Afghanistan Asia 28.801
1962 Afghanistan Asia 31.997
1972 Afghanistan Asia 36.088
'''
iloc 函数 获取0 2 4 行所有列的数据
# iloc 函数 获取0 2 4 行所有列的数据
print(China_df.iloc[[0,2,4]])
'''
country continent lifeExp pop gdpPercap
year
1952 Afghanistan Asia 28.801 8425333 779.445314
1962 Afghanistan Asia 31.997 10267083 853.100710
1972 Afghanistan Asia 36.088 13079460 739.981106
'''
iloc 函数 获取所有行的列位置为0,1,2列的数据
# iloc 函数 获取所有行的列位置为0,1,2列的数据
print(China_df.iloc[:, [0, 1, 2]])
'''
country continent lifeExp
year
1952 Afghanistan Asia 28.801
1957 Afghanistan Asia 30.332
1962 Afghanistan Asia 31.997
1967 Afghanistan Asia 34.020
1972 Afghanistan Asia 36.088
... ... ... ...
1987 Zimbabwe Africa 62.351
1992 Zimbabwe Africa 60.377
1997 Zimbabwe Africa 46.809
2002 Zimbabwe Africa 39.989
2007 Zimbabwe Africa 43.487
[1704 rows x 3 columns]
'''
iloc 函数 获取行为1所有列的数据
# iloc 函数 获取行为1所有列的数据
print(China_df.iloc[[1]])
print(type(China_df.iloc[[1]]))
'''
country continent lifeExp pop gdpPercap
year
1957 Afghanistan Asia 30.332 9240934 820.85303
<class 'pandas.core.frame.DataFrame'>
'''
#Series 数据类型的写法 类型稍稍有所区别
print(China_df.iloc[1])
print(type(China_df.iloc[1]))
'''
country Afghanistan
continent Asia
lifeExp 30.332
pop 9240934
gdpPercap 820.85303
Name: 1957, dtype: object
<class 'pandas.core.series.Series'>
'''
iloc 函数 获取行为1 列为2的数据
# iloc 函数 获取行为1 列为2的数据
# 省略 行的中括号 得到的数据
print(China_df.iloc[1,[2]])
print(type(China_df.iloc[1,[2]]))
'''
lifeExp 30.332
Name: 1957, dtype: object
<class 'pandas.core.series.Series'>
'''
# 省略 列的中括号 得到的数据
print(China_df.iloc[[1],2])
print(type(China_df.iloc[[1],2]))
'''
year
1957 30.332
Name: lifeExp, dtype: float64
<class 'pandas.core.series.Series'>
'''
注意:为保证下方切片可以正确执行,请将csv中的数据集更换为下方 复制粘贴并保存到 china.csv 中:
country continent year lifeExp pop gdpPercap
China Asia 1952 44 556263527 400.448611
China Asia 1957 50.54896 637408000 575.9870009
China Asia 1962 44.50136 665770000 487.6740183
China Asia 1967 58.38112 754550000 612.7056934
China Asia 1972 63.11888 862030000 676.9000921
China Asia 1977 63.96736 943455000 741.2374699
China Asia 1982 65.525 1000281000 962.4213805
China Asia 1987 67.274 1084035000 1378.904018
China Asia 1992 68.69 1164970000 1655.784158
China Asia 1997 70.426 1230075000 2289.234136
China Asia 2002 72.028 1280400000 3119.280896
China Asia 2007 72.961 1318683096 4959.114854
因为源数据集中出现多个重复的年份会使切片报错如下:KeyError: 'Cannot get left slice bound for non-unique label: xxxx'
当用源china.tsv时 报错 没有唯一的1952 说明在进行切片操作时要有唯一的左右的切片值才能准确定位
loc 和 iloc 的切片操作
获取 China_df 中 前三行 和 前三列 的数据
#示例 获取China_df 中前三行和前三列的数据
# loc实现:
print(China_df.loc[1952:1962,'country':'lifeExp'])
'''
country continent lifeExp
year
1952 China Asia 44.00000
1957 China Asia 50.54896
1962 China Asia 44.50136
'''
# iloc 实现:
print(China_df.iloc[0:3, 0:3])
'''
country continent lifeExp
year
1952 China Asia 44.00000
1957 China Asia 50.54896
1962 China Asia 44.50136
[ ] 语法获取指定行列的数据
获取所有行的 country pop gdpPercap 列的数据
print(China_df[['country', 'pop', 'gdpPercap']])
'''
country pop gdpPercap
year
1952 China 556263527 400.448611
1957 China 637408000 575.987001
1962 China 665770000 487.674018
1967 China 754550000 612.705693
1972 China 862030000 676.900092
1977 China 943455000 741.237470
1982 China 1000281000 962.421381
1987 China 1084035000 1378.904018
1992 China 1164970000 1655.784158
1997 China 1230075000 2289.234136
2002 China 1280400000 3119.280896
2007 China 1318683096 4959.114854
'''
获取所有行的 pop 列的数据
print(China_df[['pop']])
'''
pop
year
1952 556263527
1957 637408000
1962 665770000
1967 754550000
1972 862030000
1977 943455000
1982 1000281000
1987 1084035000
1992 1164970000
1997 1230075000
2002 1280400000
2007 1318683096
'''
省略 print(China_df[['pop']]) 中的 中括号 以后会得到Series数据类型的结果
print(China_df['pop'])
print(type(China_df['pop']))
'''
year
1952 556263527
1957 637408000
1962 665770000
1967 754550000
1972 862030000
1977 943455000
1982 1000281000
1987 1084035000
1992 1164970000
1997 1230075000
2002 1280400000
2007 1318683096
Name: pop, dtype: int64
<class 'pandas.core.series.Series'>
'''
获取前三行的数据
print(China_df[0:3])
'''
country continent lifeExp pop gdpPercap
year
1952 China Asia 44.00000 556263527 400.448611
1957 China Asia 50.54896 637408000 575.987001
1962 China Asia 44.50136 665770000 487.674018
'''
从第一行开始,每隔一行取一行数据,一共取3行数据
China_df[起始行位置:结束行位置] 解释: 根据指定范围对应行的所有列的数据,但不包括结束行位置
print(China_df[0:5:2])
'''
country continent lifeExp pop gdpPercap
year
1952 China Asia 44.00000 556263527 400.448611
1962 China Asia 44.50136 665770000 487.674018
1972 China Asia 63.11888 862030000 676.900092
'''
END 注:本文章仅供本人学习记录使用,如有错误还恳请您积极批评指正!不胜感激!