pands常用操作

守护安静星空

已于 2024-02-15 16:07:34 修改

阅读量371

点赞数 10

文章标签： pandas

于 2024-02-15 03:34:47 首次发布

本文链接：https://blog.csdn.net/klp1358484518/article/details/136101902

版权

1.导入库和文件读取和文件分信息分析

import pandas as pd
import numpy as np
csvf = pd.read_csv('D:/各个站程序版本说明.csv')
csvf.info()

'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       51 non-null     int64
 1   B       51 non-null     int64
 2   C       51 non-null     int64
 3   D       51 non-null     int64
 4   E       51 non-null     int64
 5   F       51 non-null     int64
dtypes: int64(6)
memory usage: 2.5 KB

'''

csvf.head()

统计有多少行列，查看形状

2.增加列

    label = pd.Categorical(xl['交易时间'])
    print(label)
    print(label.codes)
    print(xl.shape)
    print(xl.columns)
    s = pd.DataFrame({'A':label.codes})
    xl = xl.join(s)
    print(xl.shape)
    print(xl.columns)

(307, 18)
Index(['运单号', '交易时间', '寄件人', '寄件人手机号', '收件人', '收件人手机号', '寄件地', '到件地', '托寄物名称',
       '产品类型', '交易类型', '主卡号', '副卡号', '支付方式', '消费总金额', '消费本金', '消费赠送金', '消费网点'],
      dtype='object')
(307, 19)
Index(['运单号', '交易时间', '寄件人', '寄件人手机号', '收件人', '收件人手机号', '寄件地', '到件地', '托寄物名称',
       '产品类型', '交易类型', '主卡号', '副卡号', '支付方式', '消费总金额', '消费本金', '消费赠送金', '消费网点',
       'A'],
      dtype='object')

3.删除列

    label = pd.Categorical(xl['交易时间'])
    print(label)
    print(label.codes)
    print(xl.shape)
    print(xl.columns)
    s = pd.DataFrame({'A':label.codes})
    xl = xl.join(s)
    print(xl.shape)
    print(xl.columns)
    del xl['交易时间']
    print(xl.shape)
    print(xl.columns)

(307, 18)
Index(['运单号', '交易时间', '寄件人', '寄件人手机号', '收件人', '收件人手机号', '寄件地', '到件地', '托寄物名称',
       '产品类型', '交易类型', '主卡号', '副卡号', '支付方式', '消费总金额', '消费本金', '消费赠送金', '消费网点'],
      dtype='object')
(307, 19)
Index(['运单号', '交易时间', '寄件人', '寄件人手机号', '收件人', '收件人手机号', '寄件地', '到件地', '托寄物名称',
       '产品类型', '交易类型', '主卡号', '副卡号', '支付方式', '消费总金额', '消费本金', '消费赠送金', '消费网点',
       'A'],
      dtype='object')
(307, 18)
Index(['运单号', '寄件人', '寄件人手机号', '收件人', '收件人手机号', '寄件地', '到件地', '托寄物名称', '产品类型',
       '交易类型', '主卡号', '副卡号', '支付方式', '消费总金额', '消费本金', '消费赠送金', '消费网点', 'A'],
      dtype='object')

Process finished with exit code 0

2.文件行列访问

1.查看列信息

csvf.columns
'''
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
'''

2.列访问

（1）访问某一列

csvf['A']
'''
0      2
1      3
2      4
3      5
4      6
5      7
6      8
7      9
......
Name: A, dtype: int64
'''

(2)访问几行

（3）访问指定行列 iloc[行：列]

（4）转换为ndarray数组使用 values

3.文件分析

1.数据分组 groupby，根据列项目相同的分为一组（类似于wps的筛选功能，列条目一样的就分为一组，可以使用列表设置选择几个列）。

print(list(xl.groupby('交易时间')))
#print(list(xl.groupby(['收件人','消费赠送金']))) #根据'收件人'和'消费赠送金'分组

分组后数据访问：

    for x,y in xl.groupby(['收件人','消费赠送金']):
        print(type(x),x)
        print(type(y),y)


[7 rows x 18 columns]
<class 'tuple'> ('马*', 1.5)
<class 'pandas.core.frame.DataFrame'>                  运单号                 交易时间  寄件人  ...  消费本金 消费赠送金     消费网点
1    SF1439857656942  2023-06-30 08:58:06  程*纪  ...  13.5   1.5  总部基地业务部
14   SF1409118353807  2023-06-25 10:42:45  邓*丝  ...  13.5   1.5  总部基地业务部
91   SF1439946551252  2023-06-09 08:50:48  李*平  ...  13.5   1.5  总部基地业务部
145  SF1439654255452  2023-06-06 18:46:12  郭*超  ...  13.5   1.5  总部基地业务部
156  SF1500739309308  2023-06-06 08:35:51  吴*宁  ...  13.5   1.5  总部基地业务部

[5 rows x 18 columns]
<class 'tuple'> ('马*', 2.3)
<class 'pandas.core.frame.DataFrame'>                 运单号                 交易时间  寄件人  ...  消费本金 消费赠送金     消费网点
29  SF1445516848370  2023-06-15 17:40:43  王*玉  ...  20.7   2.3  总部基地业务部

[1 rows x 18 columns]
<class 'tuple'> ('魏*', 0.0)
<class 'pandas.core.frame.DataFrame'>                  运单号                 交易时间  寄件人  ...  消费本金 消费赠送金     消费网点
219  SF1401907161910  2023-06-04 09:55:04  徐*彤  ...  18.0   0.0  总部基地业务部
226  SF1144693557714  2023-06-03 08:48:05  吴*磊  ...  13.0   0.0  总部基地业务部
249  SF1455224508209  2023-06-02 19:12:41  马*民  ...  13.0   0.0  总部基地业务部
255  SF1455384435201  2023-06-02 17:13:43   廖*  ...  14.0   0.0  总部基地业务部
281  SF1439613136618  2023-06-02 08:14:34   周*  ...  13.0   0.0  总部基地业务部
296  SF1417967024361  2023-06-02 08:11:55  韩*云  ...  13.0   0.0  总部基地业务部

[6 rows x 18 columns]
<class 'tuple'> ('魏*', 1.3)
<class 'pandas.core.frame.DataFrame'>                  运单号                 交易时间  寄件人  ...  消费本金 消费赠送金     消费网点
81   SF1450444235259  2023-06-09 08:54:56   王*  ...  11.7   1.3  总部基地业务部
92   SF1448510238715  2023-06-09 08:50:48   王*  ...  11.7   1.3  总部基地业务部
107  SF1442505158813  2023-06-08 11:06:36   逯*  ...  11.7   1.3  总部基地业务部
108  SF1439924609383  2023-06-08 11:06:36  杨*兰  ...  11.7   1.3  总部基地业务部
113  SF1454134351248  2023-06-08 08:22:34   赵*  ...  11.7   1.3  总部基地业务部
128  SF1439652204704  2023-06-07 08:21:44  徐*涛  ...  11.7   1.3  总部基地业务部
167  SF1442573948215  2023-06-06 08:34:23   裴*  ...  11.7   1.3  总部基地业务部
178  SF1429069960969  2023-06-06 08:34:23  韩*宇  ...  11.7   1.3  总部基地业务部

[8 rows x 18 columns]
<class 'tuple'> ('魏*', 1.5)
<class 'pandas.core.frame.DataFrame'>                  运单号                 交易时间  寄件人  ...  消费本金 消费赠送金     消费网点
171  SF1429046446869  2023-06-06 08:34:23  郭*丹  ...  13.5   1.5  总部基地业务部
172  SF1150185008864  2023-06-06 08:34:23  王*宁  ...  13.5   1.5  总部基地业务部

[2 rows x 18 columns]

分组后简单统计:

print( xl.groupby(['收件人', '消费赠送金']).size())

收件人  消费赠送金
刘*   1.1      1
孔*新  0.0      1
孔*燕  0.0      3
     1.3      1
孙*英  1.3      1
             ..
马*   1.5      5
     2.3      1
魏*   0.0      6
     1.3      8
     1.5      2
Length: 100, dtype: int64

Process finished with exit code 0

2.列对比筛选

print(xl[xl['消费赠送金']>1.4])

3.返回列的唯一编码

print(xl['消费赠送金'].unique())

4.文件数字化

1.把列文本的转为数字编码

label = pd.Categorical(xl['交易时间'])
print(label)
print(label.codes)

Categories (122, object): ['2023-06-01 09:11:06', '2023-06-01 09:11:46', '2023-06-01 09:12:10',
                           '2023-06-01 09:12:52', ..., '2023-06-30 08:57:39', '2023-06-30 08:57:52',
                           '2023-06-30 08:58:06', '2023-06-30 08:58:36']
[121 120 119 118 117 116 115 114 113 112 111 110 109 108 107 106 105 104
 103 102 101 100  99  98  97  96  96  95  94  93  92  92  92  92  91  91
  91  91  91  91  90  89  88  87  87  87  86  85  84  84  83  83  82  81
  80  79  79  79  79  79  79  79  79  79  79  79  79  79  78  77  76  75
  75  75  75  75  75  75  75  75  74  73  72  72  72  72  71  71  71  71
  71  71  71  71  71  71  71  71  71  71  70  70  70  69  68  67  67  66
  66  65  64  63  63  62  62  62  62  62  62  62  62  62  62  62  62  62
  62  62  61  61  61  61  61  61  61  61  61  61  61  61  61  61  61  61
  61  60  60  60  59  58  57  56  55  54  53  53  52  51  50  50  49  48
  48  48  48  48  48  48  48  48  48  48  48  48  48  48  48  48  48  48
  48  48  47  46  45  44  43  42  42  42  42  41  40  40  39  38  38  37
  36  35  34  33  33  33  33  33  33  33  33  33  33  33  33  33  32  31
  30  29  28  27  26  25  24  23  23  22  22  22  22  22  22  22  22  22
  22  22  22  22  22  22  22  22  22  22  22  22  22  21  20  19  18  17
  16  16  15  15  15  15  15  15  14  14  13  13  12  11  11  10  10  10
  10  10  10  10  10  10  10  10  10  10  10  10   9   9   9   9   9   9
   9   9   9   9   9   9   9   9   9   9   8   7   6   5   4   3   2   1
   0]

2.独热编码（统计这个列有多少属性（唯一值），然后把该列转换为独热编码，相当于根据唯一值设置为列，对应的编码的列，使用某个唯一值，该唯一值置位1，其他置位0）。

    t = pd.get_dummies(xl['产品类型'])
    print(xl['产品类型'].unique())
    print(type(t))
    print(t)

['顺丰标快' nan '陆运包裹' '同城半日达' '顺丰即日' '顺丰特快' '便利箱产品']
<class 'pandas.core.frame.DataFrame'>
     便利箱产品  同城半日达  陆运包裹  顺丰即日  顺丰标快  顺丰特快
0        0      0     0     0     1     0
1        0      0     0     0     1     0
2        0      0     0     0     1     0
3        0      0     0     0     1     0
4        0      0     0     0     1     0
..     ...    ...   ...   ...   ...   ...
302      0      0     0     0     1     0
303      0      0     0     0     1     0
304      0      0     1     0     0     0
305      0      0     1     0     0     0
306      0      0     0     0     1     0

[307 rows x 6 columns]