假定所有操作都事先导入pandas
import pandas
我们以一个csv文件为例,来展示pandas是如何读取数据的:food_info
读入csv文件
food_info = pandas.read_csv("food_info.csv")
1、查看pandas的数据结构,pandas的数据结构为DataFrame类型
print(type(food_info))
OUT:
<class 'pandas.core.frame.DataFrame'>
2、查看pandas中的数据类型,pandas的数据类型包括int、float、object、datatime、bool,其中object指的是string值
print(food_info.dtypes)
OUT:
NDB_No int64
Shrt_Desc object
Water_(g) float64
Energ_Kcal int64
Protein_(g) float64
Lipid_Tot_(g) float64
…………………………
Cholestrl_(mg) float64
dtype: object
3、查看官方帮助文档
print(help(pandas.read_csv))
OUT:
Help on function read_csv in module pandas.io.parsers:
…………………………
Returns
-------
result : DataFrame or TextParser
None
4、①查看DataFrame的前5行,如果有参数,则显示参数n表示的前n行
food_info.head()
OUT:
②查看DataFrame的后5行,如果有参数,则显示参数n表示的后n行
food_info.tail()
OUT:
5、返回每列的列名
print(food_info.columns)
OUT:
Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)',
'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)',
'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)',
'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)',
'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)',
'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)',
'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg',
'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)',
'Cholestrl_(mg)'],
dtype='object')
6、查看DataFrame的行列情况
print(food_info.shape)
OUT:
(8618, 36)
7、查看指定索引行的数据
print(food_info.loc[0])
OUT:
NDB_No 1001
Shrt_Desc BUTTER WITH SALT
Water_(g) 15.87
Energ_Kcal 717
Protein_(g) 0.85
Lipid_Tot_(g) 81.11
Ash_(g) 2.11
Carbohydrt_(g) 0.06
Fiber_TD_(g) 0
Sugar_Tot_(g) 0.06
Calcium_(mg) 24
Iron_(mg) 0.02
Magnesium_(mg) 2
Phosphorus_(mg) 24
Potassium_(mg) 24
Sodium_(mg) 643
Zinc_(mg) 0.09
Copper_(mg) 0
Manganese_(mg) 0
Selenium_(mcg) 1
Vit_C_(mg) 0
Thiamin_(mg) 0.005
Riboflavin_(mg) 0.034
Niacin_(mg) 0.042
Vit_B6_(mg) 0.003
Vit_B12_(mcg) 0.17
Vit_A_IU 2499
Vit_A_RAE 684
Vit_E_(mg) 2.32
Vit_D_mcg 1.5
Vit_D_IU 60
Vit_K_(mcg) 7
FA_Sat_(g) 51.368
FA_Mono_(g) 21.021
FA_Poly_(g) 3.043
Cholestrl_(mg) 215
Name: 0, dtype: object
8、返回索引行切片值
print(food_info.loc[3:5])
OUT:
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) \
3 1004 CHEESE BLUE 42.41 353 21.40 28.74
4 1005 CHEESE BRICK 41.11 371 23.24 29.68
5 1006 CHEESE BRIE 48.42 334 20.75 27.68
Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) ... \
3 5.11 2.34 0.0 0.50 ...
4 3.18 2.79 0.0 0.51 ...
5 2.70 0.45 0.0 0.45 ...
Vit_A_IU Vit_A_RAE Vit_E_(mg) Vit_D_mcg Vit_D_IU Vit_K_(mcg) \
3 721.0 198.0 0.25 0.5 21.0 2.4
4 1080.0 292.0 0.26 0.5 22.0 2.5
5 592.0 174.0 0.24 0.5 20.0 2.3
FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) Cholestrl_(mg)
3 18.669 7.778 0.800 75.0
4 18.764 8.598 0.784 94.0
5 17.410 8.013 0.826 100.0
[3 rows x 36 columns]
9、返回某几行的索引值
two_five_ten = [2, 5, 10]
print(food_info.loc[two_five_ten])
OUT:
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) \
2 1003 BUTTER OIL ANHYDROUS 0.24 876 0.28
5 1006 CHEESE BRIE 48.42 334 20.75
10 1011 CHEESE COLBY 38.20 394 23.76
Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) \
2 99.48 0.00 0.00 0.0 0.00
5 27.68 2.70 0.45 0.0 0.45
10 32.11 3.36 2.57 0.0 0.52
... Vit_A_IU Vit_A_RAE Vit_E_(mg) Vit_D_mcg Vit_D_IU \
2 ... 3069.0 840.0 2.80 1.8 73.0
5 ... 592.0 174.0 0.24 0.5 20.0
10 ... 994.0 264.0 0.28 0.6 24.0
Vit_K_(mcg) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) Cholestrl_(mg)
2 8.6 61.924 28.732 3.694 256.0
5 2.3 17.410 8.013 0.826 100.0
10 2.7 20.218 9.280 0.953 95.0
[3 rows x 36 columns]
10、①查看DataFrame的某一列
food_info["NDB_No"]
OUT:
0 1001
1 1002
2 1003
3 1004
4 1005
5 1006
6 1007
7 1008
8 1009
9 1010
10 1011
11 1012
12 1013
13 1014
14 1015
15 1016
16 1017
17 1018
18 1019
19 1020
20 1021
21 1022
22 1023
23 1024
24 1025
25 1026
26 1027
27 1028
28 1029
29 1030
...
8588 43544
8589 43546
8590 43550
8591 43566
8592 43570
8593 43572
8594 43585
8595 43589
8596 43595
8597 43597
8598 43598
8599 44005
8600 44018
8601 44048
8602 44055
8603 44061
8604 44074
8605 44110
8606 44158
8607 44203
8608 44258
8609 44259
8610 44260
8611 48052
8612 80200
8613 83110
8614 90240
8615 90480
8616 90560
8617 93600
Name: NDB_No, Length: 8618, dtype: int64
②查看多列
col = ["Ash_(g)", "Fiber_TD_(g)"]
food_info[col]
OUT:
Ash_(g) Fiber_TD_(g)
0 2.11 0.0
1 2.11 0.0
2 0.00 0.0
3 5.11 0.0
4 3.18 0.0
5 2.70 0.0
6 3.68 0.0
7 3.28 0.0
8 3.71 0.0
9 3.60 0.0
10 3.36 0.0
11 1.41 0.0
12 1.20 0.2
13 1.71 0.0
14 1.27 0.0
15 1.39 0.0
16 1.32 0.0
17 4.22 0.0
18 5.20 0.0
19 3.79 0.0
20 4.75 0.0
21 3.94 0.0
22 4.30 0.0
23 3.79 0.0
24 3.55 0.0
25 3.28 0.0
26 2.91 0.0
27 3.27 0.0
28 3.80 0.0
29 3.66 0.0
... ... ...
8588 2.00 2.6
8589 0.76 1.6
8590 0.29 1.0
8591 1.85 5.7
8592 1.22 4.2
8593 1.71 14.2
8594 0.52 2.0
8595 3.50 0.0
8596 0.80 2.1
8597 2.40 0.0
8598 0.40 0.0
8599 0.00 0.0
8600 0.00 0.1
8601 4.74 0.0
8602 13.90 27.8
8603 9.90 6.1
8604 0.22 0.1
8605 0.08 0.8
8606 0.35 2.6
8607 0.07 0.0
8608 5.70 10.1
8609 1.86 0.9
8610 6.80 0.8
8611 1.00 0.6
8612 1.40 0.0
8613 13.40 0.0
8614 2.97 0.0
8615 0.86 0.0
8616 1.30 0.0
8617 1.20 0.0
8618 rows × 2 columns
11、找出food_info文件中单位是g的数据
col_name = food_info.columns.tolist()
print(col_name)
print("__________")
gram_columns = []
for c in col_name:
#endswith() 方法用于判断字符串是否以指定后缀结尾,如果以指定后缀结尾返回True,否则返回False
if c.endswith("(g)"):
gram_columns.append(c)
gram_df = food_info[gram_columns]
print(gram_df.head())
OUT:
['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)', 'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)', 'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)', 'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)', 'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)', 'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)', 'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg', 'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)', 'Cholestrl_(mg)']
__________
Water_(g) Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) \
0 15.87 0.85 81.11 2.11 0.06
1 15.87 0.85 81.11 2.11 0.06
2 0.24 0.28 99.48 0.00 0.00
3 42.41 21.40 28.74 5.11 2.34
4 41.11 23.24 29.68 3.18 2.79
Fiber_TD_(g) Sugar_Tot_(g) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g)
0 0.0 0.06 51.368 21.021 3.043
1 0.0 0.06 50.489 23.426 3.012
2 0.0 0.00 61.924 28.732 3.694
3 0.0 0.50 18.669 7.778 0.800
4 0.0 0.51 18.764 8.598 0.784
12、升序排序
food_info.sort_values("Sodium_(mg)", inplace = True, ascending = True)
print(food_info["Sodium_(mg)"])
参数解释:
inplace:是否新生成一个DataFrame。
ascending:是否升序排序
OUT:
760 0.0
8607 0.0
629 0.0
631 0.0
6470 0.0
654 0.0
8599 0.0
657 0.0
633 0.0
635 0.0
637 0.0
638 0.0
639 0.0
646 0.0
653 0.0
632 0.0
606 0.0
6463 0.0
634 0.0
666 0.0
8387 0.0
611 0.0
434 0.0
655 0.0
661 0.0
3663 0.0
3664 0.0
3665 0.0
656 0.0
3697 0.0
...
8153 NaN
8155 NaN
8156 NaN
8157 NaN
8158 NaN
8159 NaN
8160 NaN
8161 NaN
8163 NaN
8164 NaN
8165 NaN
8167 NaN
8169 NaN
8170 NaN
8172 NaN
8173 NaN
8174 NaN
8175 NaN
8176 NaN
8177 NaN
8178 NaN
8179 NaN
8180 NaN
8181 NaN
8183 NaN
8184 NaN
8185 NaN
8195 NaN
8251 NaN
8267 NaN
Name: Sodium_(mg), Length: 8618, dtype: float64
附上:
数据分析处理库Pandas-数据预处理
数据分析处理库Pandas-常用函数
数据分析处理库Pandas-Series结构