机器学习基本库之Pandas

Pandas是机器学习中专门用于数据处理的库,遇到很多数据时首先要使用Pandas进行预处理得到我们想要的信息,下面让我们来看一下Pandas中有哪些操作


import pandas
food_info=pandas.read_csv("food_info.csv")#将csv文件中的数据进行读取
print(type(food_info))#pandas中的核心结构叫做DATAFRAME
print(food_info.head(3))#打印出来一个表格显示,默认显示前五行
print(food_info.tail(4))#用来显示尾几行

输出结果:

<class 'pandas.core.frame.DataFrame'>
   NDB_No                 Shrt_Desc  Water_(g)  Energ_Kcal  Protein_(g)  \
0    1001          BUTTER WITH SALT      15.87         717         0.85   
1    1002  BUTTER WHIPPED WITH SALT      15.87         717         0.85   
2    1003      BUTTER OIL ANHYDROUS       0.24         876         0.28   

   Lipid_Tot_(g)  Ash_(g)  Carbohydrt_(g)  Fiber_TD_(g)  Sugar_Tot_(g)  ...  \
0          81.11     2.11            0.06           0.0           0.06  ...   
1          81.11     2.11            0.06           0.0           0.06  ...   
2          99.48     0.00            0.00           0.0           0.00  ...   

   Vit_A_IU  Vit_A_RAE  Vit_E_(mg)  Vit_D_mcg  Vit_D_IU  Vit_K_(mcg)  \
0    2499.0      684.0        2.32        1.5      60.0          7.0   
1    2499.0      684.0        2.32        1.5      60.0          7.0   
2    3069.0      840.0        2.80        1.8      73.0          8.6   

   FA_Sat_(g)  FA_Mono_(g)  FA_Poly_(g)  Cholestrl_(mg)  
0      51.368       21.021        3.043           215.0  
1      50.489       23.426        3.012           219.0  
2      61.924       28.732        3.694           256.0  

[3 rows x 36 columns]
      NDB_No                   Shrt_Desc  Water_(g)  Energ_Kcal  Protein_(g)  \
8614   90240  SCALLOP (BAY&SEA) CKD STMD      70.25         111        20.54   
8615   90480                  SYRUP CANE      26.00         269         0.00   
8616   90560                   SNAIL RAW      79.20          90        16.10   
8617   93600            TURTLE GREEN RAW      78.50          89        19.80   

      Lipid_Tot_(g)  Ash_(g)  Carbohydrt_(g)  Fiber_TD_(g)  Sugar_Tot_(g)  \
8614           0.84     2.97            5.41           0.0            0.0   
8615           0.00     0.86           73.14           0.0           73.2   
8616           1.40     1.30            2.00           0.0            0.0   
8617           0.50     1.20            0.00           0.0            0.0   

      ...  Vit_A_IU  Vit_A_RAE  Vit_E_(mg)  Vit_D_mcg  Vit_D_IU  Vit_K_(mcg)  \
8614  ...       5.0        2.0         0.0        0.0       2.0          0.0   
8615  ...       0.0        0.0         0.0        0.0       0.0          0.0   
8616  ...     100.0       30.0         5.0        0.0       0.0          0.1   
8617  ...     100.0       30.0         0.5        0.0       0.0          0.1   

      FA_Sat_(g)  FA_Mono_(g)  FA_Poly_(g)  Cholestrl_(mg)  
8614       0.218        0.082        0.222            41.0  
8615       0.000        0.000        0.000             0.0  
8616       0.361        0.259        0.252            50.0  
8617       0.127        0.088        0.170            50.0  

[4 rows x 36 columns] 

print(food_info.columns)#显示列名
print(food_info.shape)#表示数据有8618个样本,每个样本有36个指标
print(food_info.loc[0,"NDB_No"])#不能直接通过索引读取,需要通过loc函数打印每行,加上列名可以定位到具体元素
print(food_info["NDB_No"])#通过列名来打印每一列

 输出结果:

Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)',
       'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)',
       'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)',
       'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)',
       'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)',
       'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)',
       'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg',
       'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)',
       'Cholestrl_(mg)'],
      dtype='object')
(8618, 36)
1001
0        1001
1        1002
2        1003
3        1004
4        1005
        ...  
8613    83110
8614    90240
8615    90480
8616    90560
8617    93600
Name: NDB_No, Length: 8618, dtype: int64

# pandas中的计算可以直接加减乘除,会把数据表中每一元素进行相应计算
water_energy=food_info["Water_(g)"]*food_info["Energ_Kcal"]
iron_grams=water_energy/1000
print(food_info.shape)
food_info["Iron_(g)"]=iron_grams#新建一个列名并赋值
print(food_info.shape)

 输出结果:

(8618, 36)
(8618, 37)

food_info.sort_values("Sodium_(mg)",inplace=True,ascending=False)#pandas中的排序操作指定一个列名,可把排序定为FALse
print(food_info["Sodium_(mg)"])
food_info_reindex=food_info.reset_index(drop=True)#此函数可将排序后的索引值改变
print(food_info_reindex)
# print(help(food_info.sort_values))
# print(help(food_info.reset_index()))

输出结果:
 

276     38758.0
5814    27360.0
6192    26050.0
1242    26000.0
1245    24000.0
         ...   
8184        NaN
8185        NaN
8195        NaN
8251        NaN
8267        NaN
Name: Sodium_(mg), Length: 8618, dtype: float64
      NDB_No                                     Shrt_Desc  Water_(g)  \
0       2047                                    SALT TABLE       0.20   
1      18372                  LEAVENING AGENTS BAKING SODA       0.20   
2      19225               DESSERTS RENNIN TABLETS UNSWTND       6.50   
3       6075             SOUP BF BROTH OR BOUILLON PDR DRY       3.27   
4       6081                    SOUP CHICK BROTH CUBES DRY       2.50   
...      ...                                           ...        ...   
8613   35092          WILLOW LEAVES IN OIL (ALASKA NATIVE)      28.00   
8614   35093     WILLOW YOUNG LEAVES CHOPD (ALASKA NATIVE)      68.70   
8615   35139                SQUASH INDIAN CKD BLD (NAVAJO)      96.21   
8616   35199  PRICKLY PEARS BRLD (NORTHERN PLAINS INDIANS)      75.83   
8617   35231          SEA LION STELLER FAT (ALASKA NATIVE)       4.70   

      Energ_Kcal  Protein_(g)  Lipid_Tot_(g)  Ash_(g)  Carbohydrt_(g)  \
0              0         0.00           0.00     99.8            0.00   
1              0         0.00           0.00     36.9            0.00   
2             84         1.00           0.10     72.5           19.80   
3            213        15.97           8.89     54.5           17.40   
4            198        14.60           4.70     54.7           23.50   
...          ...          ...            ...      ...             ...   
8613         592         2.60          61.00      0.3            8.10   
8614         122         6.10           1.60      2.9           20.70   
8615          16         0.31           0.15      0.1            3.22   
8616          91         0.39           0.31      1.9           21.57   
8617         850         0.90          94.00      0.2            0.00   

      Fiber_TD_(g)  Sugar_Tot_(g)  ...  Vit_A_RAE  Vit_E_(mg)  Vit_D_mcg  \
0              0.0           0.00  ...        0.0        0.00        0.0   
1              0.0           0.00  ...        0.0        0.00        0.0   
2              0.0            NaN  ...        0.0         NaN        NaN   
3              0.0          16.71  ...        0.0        2.17        0.0   
4              0.0           0.00  ...        0.0        0.09        NaN   
...            ...            ...  ...        ...         ...        ...   
8613           NaN            NaN  ...        NaN         NaN        NaN   
8614           NaN            NaN  ...        NaN         NaN        NaN   
8615           1.5           2.02  ...        NaN         NaN        NaN   
8616           NaN            NaN  ...        NaN         NaN        NaN   
8617           NaN            NaN  ...       97.0         NaN        0.0   

      Vit_D_IU  Vit_K_(mcg)  FA_Sat_(g)  FA_Mono_(g)  FA_Poly_(g)  \
0          0.0          0.0       0.000        0.000        0.000   
1          0.0          0.0       0.000        0.000        0.000   
2          NaN          NaN       0.041        0.038        0.007   
3          0.0          3.2       4.320        3.616        0.332   
4          NaN          0.0       1.200        1.920        1.620   
...        ...          ...         ...          ...          ...   
8613       NaN          NaN         NaN          NaN          NaN   
8614       NaN          NaN         NaN          NaN          NaN   
8615       NaN          NaN         NaN          NaN          NaN   
8616       NaN          NaN         NaN          NaN          NaN   
8617       0.0          NaN         NaN          NaN          NaN   

      Cholestrl_(mg)  Iron_(g)  
0                0.0   0.00000  
1                0.0   0.00000  
2                0.0   0.54600  
3               10.0   0.69651  
4               13.0   0.49500  
...              ...       ...  
8613             NaN  16.57600  
8614             NaN   8.38140  
8615             NaN   1.53936  
8616             NaN   6.90053  
8617            95.0   3.99500  

[8618 rows x 37 columns]

 

import pandas as pd
import numpy as np
titanic_survivral=pd.read_csv("titanic_train.csv")
titanic_survivral.head(4)
age=titanic_survivral["Age"]#选中要判断的一列
age_is_null=pd.isnull(age)#运用isnull函数进行判断
print(age_is_null)
print(titanic_survivral["Age"].mean())#可以把NAN排除后求均值的函数
passenger_survivor=titanic_survivral.pivot_table(index="Pclass",values="Survived",aggfunc=np.mean)
# 一个很重要的pivot_table函数,index代表以谁为基准,values表示那个数据与index之间有关系,aggfunc表示两个数据之间纯在什么关系
print(passenger_survivor)

输出结果: 

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888     True
889    False
890    False
Name: Age, Length: 891, dtype: bool
29.69911764705882
        Survived
Pclass          
1       0.629630
2       0.472826
3       0.242363

new_titanic_survivor=titanic_survivral.dropna(axis=0,subset=["Age","Sex"])#此函数可以将选定列中的Nan值给去掉
print(new_titanic_survivor)

输出结果:

 PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
885          886         0       3   
886          887         0       2   
887          888         1       1   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                                                 ...     ...   ...    ...   
885               Rice, Mrs. William (Margaret Norton)  female  39.0      0   
886                              Montvila, Rev. Juozas    male  27.0      0   
887                       Graham, Miss. Margaret Edith  female  19.0      0   
889                              Behr, Mr. Karl Howell    male  26.0      0   
890                                Dooley, Mr. Patrick    male  32.0      0   

     Parch            Ticket     Fare Cabin Embarked  
0        0         A/5 21171   7.2500   NaN        S  
1        0          PC 17599  71.2833   C85        C  
2        0  STON/O2. 3101282   7.9250   NaN        S  
3        0            113803  53.1000  C123        S  
4        0            373450   8.0500   NaN        S  
..     ...               ...      ...   ...      ...  
885      5            382652  29.1250   NaN        Q  
886      0            211536  13.0000   NaN        S  
887      0            112053  30.0000   B42        S  
889      0            111369  30.0000  C148        C  
890      0            370376   7.7500   NaN        Q  

[714 rows x 12 columns]

 还有一点请注意,Pandas中允许自定义函数def格式,通过.apply(functionName)可以调用自定义函数,以满足库中未提供的功能。

本程序中用到的csv数据集 提取码:twwi

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值