机器学习基本库之Pandas

最新推荐文章于 2023-12-31 18:00:47 发布

莫名其妙

最新推荐文章于 2023-12-31 18:00:47 发布

阅读量626

点赞数 2

文章标签： pandas 机器学习 python 数据分析

本文链接：https://blog.csdn.net/qq_60632452/article/details/125881118

版权

Pandas是机器学习中专门用于数据处理的库，遇到很多数据时首先要使用Pandas进行预处理得到我们想要的信息，下面让我们来看一下Pandas中有哪些操作

import pandas
food_info=pandas.read_csv("food_info.csv")#将csv文件中的数据进行读取
print(type(food_info))#pandas中的核心结构叫做DATAFRAME
print(food_info.head(3))#打印出来一个表格显示，默认显示前五行
print(food_info.tail(4))#用来显示尾几行

输出结果：

<class 'pandas.core.frame.DataFrame'>
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) \
0 1001 BUTTER WITH SALT 15.87 717 0.85
1 1002 BUTTER WHIPPED WITH SALT 15.87 717 0.85
2 1003 BUTTER OIL ANHYDROUS 0.24 876 0.28

Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) ... \
0 81.11 2.11 0.06 0.0 0.06 ...
1 81.11 2.11 0.06 0.0 0.06 ...
2 99.48 0.00 0.00 0.0 0.00 ...

Vit_A_IU Vit_A_RAE Vit_E_(mg) Vit_D_mcg Vit_D_IU Vit_K_(mcg) \
0 2499.0 684.0 2.32 1.5 60.0 7.0
1 2499.0 684.0 2.32 1.5 60.0 7.0
2 3069.0 840.0 2.80 1.8 73.0 8.6

FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) Cholestrl_(mg)
0 51.368 21.021 3.043 215.0
1 50.489 23.426 3.012 219.0
2 61.924 28.732 3.694 256.0

[3 rows x 36 columns]
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) \
8614 90240 SCALLOP (BAY&SEA) CKD STMD 70.25 111 20.54
8615 90480 SYRUP CANE 26.00 269 0.00
8616 90560 SNAIL RAW 79.20 90 16.10
8617 93600 TURTLE GREEN RAW 78.50 89 19.80

Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) \
8614 0.84 2.97 5.41 0.0 0.0
8615 0.00 0.86 73.14 0.0 73.2
8616 1.40 1.30 2.00 0.0 0.0
8617 0.50 1.20 0.00 0.0 0.0

... Vit_A_IU Vit_A_RAE Vit_E_(mg) Vit_D_mcg Vit_D_IU Vit_K_(mcg) \
8614 ... 5.0 2.0 0.0 0.0 2.0 0.0
8615 ... 0.0 0.0 0.0 0.0 0.0 0.0
8616 ... 100.0 30.0 5.0 0.0 0.0 0.1
8617 ... 100.0 30.0 0.5 0.0 0.0 0.1

FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) Cholestrl_(mg)
8614 0.218 0.082 0.222 41.0
8615 0.000 0.000 0.000 0.0
8616 0.361 0.259 0.252 50.0
8617 0.127 0.088 0.170 50.0

[4 rows x 36 columns]

print(food_info.columns)#显示列名
print(food_info.shape)#表示数据有8618个样本，每个样本有36个指标
print(food_info.loc[0,"NDB_No"])#不能直接通过索引读取，需要通过loc函数打印每行,加上列名可以定位到具体元素
print(food_info["NDB_No"])#通过列名来打印每一列

输出结果：

Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)',
'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)',
'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)',
'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)',
'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)',
'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)',
'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg',
'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)',
'Cholestrl_(mg)'],
dtype='object')
(8618, 36)
1001
0 1001
1 1002
2 1003
3 1004
4 1005
...
8613 83110
8614 90240
8615 90480
8616 90560
8617 93600
Name: NDB_No, Length: 8618, dtype: int64

# pandas中的计算可以直接加减乘除，会把数据表中每一元素进行相应计算
water_energy=food_info["Water_(g)"]*food_info["Energ_Kcal"]
iron_grams=water_energy/1000
print(food_info.shape)
food_info["Iron_(g)"]=iron_grams#新建一个列名并赋值
print(food_info.shape)

输出结果：

(8618, 36)
(8618, 37)

food_info.sort_values("Sodium_(mg)",inplace=True,ascending=False)#pandas中的排序操作指定一个列名,可把排序定为FALse
print(food_info["Sodium_(mg)"])
food_info_reindex=food_info.reset_index(drop=True)#此函数可将排序后的索引值改变
print(food_info_reindex)
# print(help(food_info.sort_values))
# print(help(food_info.reset_index()))

输出结果：

276     38758.0
5814    27360.0
6192    26050.0
1242    26000.0
1245    24000.0
         ...   
8184        NaN
8185        NaN
8195        NaN
8251        NaN
8267        NaN
Name: Sodium_(mg), Length: 8618, dtype: float64
      NDB_No                                     Shrt_Desc  Water_(g)  \
0       2047                                    SALT TABLE       0.20   
1      18372                  LEAVENING AGENTS BAKING SODA       0.20   
2      19225               DESSERTS RENNIN TABLETS UNSWTND       6.50   
3       6075             SOUP BF BROTH OR BOUILLON PDR DRY       3.27   
4       6081                    SOUP CHICK BROTH CUBES DRY       2.50   
...      ...                                           ...        ...   
8613   35092          WILLOW LEAVES IN OIL (ALASKA NATIVE)      28.00   
8614   35093     WILLOW YOUNG LEAVES CHOPD (ALASKA NATIVE)      68.70   
8615   35139                SQUASH INDIAN CKD BLD (NAVAJO)      96.21   
8616   35199  PRICKLY PEARS BRLD (NORTHERN PLAINS INDIANS)      75.83   
8617   35231          SEA LION STELLER FAT (ALASKA NATIVE)       4.70   

      Energ_Kcal  Protein_(g)  Lipid_Tot_(g)  Ash_(g)  Carbohydrt_(g)  \
0              0         0.00           0.00     99.8            0.00   
1              0         0.00           0.00     36.9            0.00   
2             84         1.00           0.10     72.5           19.80   
3            213        15.97           8.89     54.5           17.40   
4            198        14.60           4.70     54.7           23.50   
...          ...          ...            ...      ...             ...   
8613         592         2.60          61.00      0.3            8.10   
8614         122         6.10           1.60      2.9           20.70   
8615          16         0.31           0.15      0.1            3.22   
8616          91         0.39           0.31      1.9           21.57   
8617         850         0.90          94.00      0.2            0.00   

      Fiber_TD_(g)  Sugar_Tot_(g)  ...  Vit_A_RAE  Vit_E_(mg)  Vit_D_mcg  \
0              0.0           0.00  ...        0.0        0.00        0.0   
1              0.0           0.00  ...        0.0        0.00        0.0   
2              0.0            NaN  ...        0.0         NaN        NaN   
3              0.0          16.71  ...        0.0        2.17        0.0   
4              0.0           0.00  ...        0.0        0.09        NaN   
...            ...            ...  ...        ...         ...        ...   
8613           NaN            NaN  ...        NaN         NaN        NaN   
8614           NaN            NaN  ...        NaN         NaN        NaN   
8615           1.5           2.02  ...        NaN         NaN        NaN   
8616           NaN            NaN  ...        NaN         NaN        NaN   
8617           NaN            NaN  ...       97.0         NaN        0.0   

      Vit_D_IU  Vit_K_(mcg)  FA_Sat_(g)  FA_Mono_(g)  FA_Poly_(g)  \
0          0.0          0.0       0.000        0.000        0.000   
1          0.0          0.0       0.000        0.000        0.000   
2          NaN          NaN       0.041        0.038        0.007   
3          0.0          3.2       4.320        3.616        0.332   
4          NaN          0.0       1.200        1.920        1.620   
...        ...          ...         ...          ...          ...   
8613       NaN          NaN         NaN          NaN          NaN   
8614       NaN          NaN         NaN          NaN          NaN   
8615       NaN          NaN         NaN          NaN          NaN   
8616       NaN          NaN         NaN          NaN          NaN   
8617       0.0          NaN         NaN          NaN          NaN   

      Cholestrl_(mg)  Iron_(g)  
0                0.0   0.00000  
1                0.0   0.00000  
2                0.0   0.54600  
3               10.0   0.69651  
4               13.0   0.49500  
...              ...       ...  
8613             NaN  16.57600  
8614             NaN   8.38140  
8615             NaN   1.53936  
8616             NaN   6.90053  
8617            95.0   3.99500  

[8618 rows x 37 columns]

import pandas as pd
import numpy as np
titanic_survivral=pd.read_csv("titanic_train.csv")
titanic_survivral.head(4)
age=titanic_survivral["Age"]#选中要判断的一列
age_is_null=pd.isnull(age)#运用isnull函数进行判断
print(age_is_null)
print(titanic_survivral["Age"].mean())#可以把NAN排除后求均值的函数
passenger_survivor=titanic_survivral.pivot_table(index="Pclass",values="Survived",aggfunc=np.mean)
# 一个很重要的pivot_table函数，index代表以谁为基准，values表示那个数据与index之间有关系，aggfunc表示两个数据之间纯在什么关系
print(passenger_survivor)

输出结果：

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888     True
889    False
890    False
Name: Age, Length: 891, dtype: bool
29.69911764705882
        Survived
Pclass          
1       0.629630
2       0.472826
3       0.242363

new_titanic_survivor=titanic_survivral.dropna(axis=0,subset=["Age","Sex"])#此函数可以将选定列中的Nan值给去掉
print(new_titanic_survivor)

输出结果：

 PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
885          886         0       3   
886          887         0       2   
887          888         1       1   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                                                 ...     ...   ...    ...   
885               Rice, Mrs. William (Margaret Norton)  female  39.0      0   
886                              Montvila, Rev. Juozas    male  27.0      0   
887                       Graham, Miss. Margaret Edith  female  19.0      0   
889                              Behr, Mr. Karl Howell    male  26.0      0   
890                                Dooley, Mr. Patrick    male  32.0      0   

     Parch            Ticket     Fare Cabin Embarked  
0        0         A/5 21171   7.2500   NaN        S  
1        0          PC 17599  71.2833   C85        C  
2        0  STON/O2. 3101282   7.9250   NaN        S  
3        0            113803  53.1000  C123        S  
4        0            373450   8.0500   NaN        S  
..     ...               ...      ...   ...      ...  
885      5            382652  29.1250   NaN        Q  
886      0            211536  13.0000   NaN        S  
887      0            112053  30.0000   B42        S  
889      0            111369  30.0000  C148        C  
890      0            370376   7.7500   NaN        Q  

[714 rows x 12 columns]

还有一点请注意，Pandas中允许自定义函数def格式，通过.apply(functionName)可以调用自定义函数，以满足库中未提供的功能。

本程序中用到的csv数据集提取码：twwi