像Excel一样使用python进行数据分析 (周末两天的成果)

1: 先把下面表格再EXCEL 中保存:

namecityagesalarybirthday
davidshanghai39120001981-08-20
jerryshanghai40120011981-08-21
tinabeijing41120021981-08-22
zhangsuzhou42120031981-08-23
wangzhejiang43120041981-08-24
rosewuxi44120051981-08-25
Xiao-shengchangzhou45120061981-08-26
beckfujian46120071981-08-27
helloshenzhen47120081981-08-28
mouthdalian48120091981-08-29
eyedalian49120101981-08-30
noseshanghai50120111981-08-31

2: 下面进行python:

import  numpy as np
import pandas as pd

df = pd.DataFrame(pd.read_excel('/home/sheng/Desktop/Python_study/python_data.xlsx'))

print(df.info())

可以看到表的信息:

RangeIndex: 12 entries, 0 to 11
Data columns (total 5 columns):
name        12 non-null object
city        12 non-null object
age         12 non-null int64
salary      12 non-null int64
birthday    12 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(2)
memory usage: 560.0+ bytes
None

 --

2.2: 看一下数据纬度:

print(df.shape)

输出:(12, 5)

2.3: 查看数据表数值

Python中的Values函数用来查看数据表中的数值。以数组的形式返回,不包含表头信息

[['david' 'shanghai' 39 12000 Timestamp('1981-08-20 00:00:00')]
 ['jerry' 'shanghai' 40 12001 Timestamp('1981-08-21 00:00:00')]
 ['tina' 'beijing' 41 12002 Timestamp('1981-08-22 00:00:00')]
 ['zhang' 'suzhou' 42 12003 Timestamp('1981-08-23 00:00:00')]
 ['wang' 'zhejiang' 43 12004 Timestamp('1981-08-24 00:00:00')]
 ['rose' 'wuxi' 44 12005 Timestamp('1981-08-25 00:00:00')]
 ['Xiao-sheng' 'changzhou' 45 12006 Timestamp('1981-08-26 00:00:00')]
 ['beck' 'fujian' 46 12007 Timestamp('1981-08-27 00:00:00')]
 ['hello' 'shenzhen' 47 12008 Timestamp('1981-08-28 00:00:00')]
 ['mouth' 'dalian' 48 12009 Timestamp('1981-08-29 00:00:00')]
 ['eye' 'dalian' 49 12010 Timestamp('1981-08-30 00:00:00')]
 ['nose' 'shanghai' 50 12011 Timestamp('1981-08-31 00:00:00')]]

2.4: 查看表的内容

Pop函数用来单独去掉查看数据表中的列。

print(df)

df.pop('city')

print('\n')

print(df)

看一下output:

          name       city  age  salary   birthday
0        david   shanghai   39   12000 1981-08-20
1        jerry   shanghai   40   12001 1981-08-21
2         tina    beijing   41   12002 1981-08-22
3        zhang     suzhou   42   12003 1981-08-23
4         wang   zhejiang   43   12004 1981-08-24
5         rose       wuxi   44   12005 1981-08-25
6   Xiao-sheng  changzhou   45   12006 1981-08-26
7         beck     fujian   46   12007 1981-08-27
8        hello   shenzhen   47   12008 1981-08-28
9        mouth     dalian   48   12009 1981-08-29
10         eye     dalian   49   12010 1981-08-30
11        nose   shanghai   50   12011 1981-08-31


          name  age  salary   birthday
0        david   39   12000 1981-08-20
1        jerry   40   12001 1981-08-21
2         tina   41   12002 1981-08-22
3        zhang   42   12003 1981-08-23
4         wang   43   12004 1981-08-24
5         rose   44   12005 1981-08-25
6   Xiao-sheng   45   12006 1981-08-26
7         beck   46   12007 1981-08-27
8        hello   47   12008 1981-08-28
9        mouth   48   12009 1981-08-29
10         eye   49   12010 1981-08-30
11        nose   50   12011 1981-08-31

可以看到city 这一列已经removed.

2.5: 查看表的其中一列

df1=pd.DataFrame(pd.read_excel('/home/sheng/Desktop/Python_study/python_data.xlsx'))

print(df1)
print(df1['city'])

看到上面的city 已经被removed, 所以要把它找回来就重新read file,然后再输出。

0      shanghai
1      shanghai
2       beijing
3        suzhou
4      zhejiang
5          wuxi
6     changzhou
7        fujian
8      shenzhen
9        dalian
10       dalian
11     shanghai

2.6: 把数据表是空值的drop掉:

先看一下有空的表格:

namecityagesalarybirthday
davidshanghai39120001981-08-20
jerryshanghai40120011981-08-21
tinabeijing41120021981-08-22
zhangsuzhou4212003 
wangzhejiang43120041981-08-24
rosewuxi44120051981-08-25
Xiao-shengchangzhou4512006 
beckfujian46120071981-08-27
helloshenzhen47120081981-08-28
mouthdalian 120091981-08-29
eyedalian49120101981-08-30
noseshanghai5012011 

再用ipython 处理一下:

In [1]: import pandas as pd                                                     

In [2]: import numpy as np                                                      

In [3]: df = pd.DataFrame(pd.read_excel('/home/sheng/Desktop/Python_study/python
   ...: _data.xlsx'))                                                           

In [4]: df['birthday'].isnull()                                                 
Out[4]:
0     False
1     False
2     False
3      True
4     False
5     False
6      True
7     False
8     False
9     False
10    False
11     True
Name: birthday, dtype: bool

In [5]: df['birthday'].isnull().value_counts()                                  
Out[5]:
False    9
True     3
Name: birthday, dtype: int64

In [6]: df['birthday'] = df['birthday'].fillna('999')         # 用999 来填充 空值                      

In [7]: df[(df.birthday=='999')].index.tolist                                   
Out[7]: <bound method IndexOpsMixin.tolist of Int64Index([3, 6, 11], dtype='int64')>

In [8]: df[(df.birthday=='999')].index                                     
Out[8]: Int64Index([3, 6, 11], dtype='int64')

In [9]: print(df)                                                               
          name       city   age  salary             birthday
0        david   shanghai  39.0   12000  1981-08-20 00:00:00
1        jerry   shanghai  40.0   12001  1981-08-21 00:00:00
2         tina    beijing  41.0   12002  1981-08-22 00:00:00
3        zhang     suzhou  42.0   12003                  999
4         wang   zhejiang  43.0   12004  1981-08-24 00:00:00
5         rose       wuxi  44.0   12005  1981-08-25 00:00:00
6   Xiao-sheng  changzhou  45.0   12006                  999
7         beck     fujian  46.0   12007  1981-08-27 00:00:00
8        hello   shenzhen  47.0   12008  1981-08-28 00:00:00
9        mouth     dalian   NaN   12009  1981-08-29 00:00:00
10         eye     dalian  49.0   12010  1981-08-30 00:00:00
11        nose   shanghai  50.0   12011                  999

In [10]: df = df.drop([3,6,11])                     #删除第3,第6,第11                           

In [11]: print(df)                                                              
     name      city   age  salary             birthday
0   david  shanghai  39.0   12000  1981-08-20 00:00:00
1   jerry  shanghai  40.0   12001  1981-08-21 00:00:00
2    tina   beijing  41.0   12002  1981-08-22 00:00:00
4    wang  zhejiang  43.0   12004  1981-08-24 00:00:00
5    rose      wuxi  44.0   12005  1981-08-25 00:00:00
7    beck    fujian  46.0   12007  1981-08-27 00:00:00
8   hello  shenzhen  47.0   12008  1981-08-28 00:00:00
9   mouth    dalian   NaN   12009  1981-08-29 00:00:00
10    eye    dalian  49.0   12010  1981-08-30 00:00:00

In [12]:                                                                        

In [12]: df.info()                                                              
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 0 to 10
Data columns (total 5 columns):
name        9 non-null object
city        9 non-null object
age         8 non-null float64
salary      9 non-null int64
birthday    9 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 432.0+ bytes

In [13]:                            

利用dropna(how='any') 来删除有风险:(所有的含有空格的行都删除了)

In [7]: df.dropna(how='any')                                                                         
Out[7]:
     name      city   age  salary   birthday
0   david  shanghai  39.0   12000 1981-08-20
1   jerry  shanghai  40.0   12001 1981-08-21
2    tina   beijing  41.0   12002 1981-08-22
4    wang  zhejiang  43.0   12004 1981-08-24
5    rose      wuxi  44.0   12005 1981-08-25
7    beck    fujian  46.0   12007 1981-08-27
8   hello  shenzhen  47.0   12008 1981-08-28
10    eye    dalian  49.0   12010 1981-08-30

2.7: 大小写转化:

df['city']=df['city'].str.upper()

看一下output: 已经全部变成大写啦。

          name       city   age  salary   birthday
0        david   SHANGHAI  39.0   12000 1981-08-20
1        jerry   SHANGHAI  40.0   12001 1981-08-21
2         tina    BEIJING  41.0   12002 1981-08-22
3        zhang     SUZHOU  42.0   12003        NaT
4         wang   ZHEJIANG  43.0   12004 1981-08-24
5         rose       WUXI  44.0   12005 1981-08-25
6   Xiao-sheng  CHANGZHOU  45.0   12006        NaT
7         beck     FUJIAN  46.0   12007 1981-08-27
8        hello   SHENZHEN  47.0   12008 1981-08-28
9        mouth     DALIAN   NaN   12009 1981-08-29
10         eye     DALIAN  49.0   12010 1981-08-30
11        nose   SHANGHAI  50.0   12011        NaT

2.8: 修改列名称:

n [17]: df                                                                                          
Out[17]:
          name       city  age         working time  salary   birthday
0        david   shanghai   39             07:30:00   12000 1981-08-20
1        jerry   shanghai   40             07:32:00   12001 1981-08-21
2         tina    beijing   41  1900-01-01 07:35:00   12002 1981-08-22
3        zhang     suzhou   42  1900-01-02 07:36:00   12003        NaT
4         wang   zhejiang   43  1900-01-03 07:38:00   12004 1981-08-24
5         rose       wuxi   44  1900-01-04 17:36:00   12005 1981-08-25
6   Xiao-sheng  changzhou   45  1900-01-05 17:36:00   12006        NaT
7         beck     fujian   46  1900-01-06 17:36:00   12007 1981-08-27
8        hello   shenzhen   47  1900-01-07 17:36:00   12008 1981-08-28
9        mouth     dalian   30  1900-01-08 17:36:00   12009 1981-08-29
10         eye     dalian   49  1900-01-09 17:36:00   12010 1981-08-30
11       david   shanghai   50  1900-01-10 17:39:00   12011        NaT

In [18]: df.rename(columns={'working time': 'arriving time'})                                        
Out[18]:
          name       city  age        arriving time  salary   birthday
0        david   shanghai   39             07:30:00   12000 1981-08-20
1        jerry   shanghai   40             07:32:00   12001 1981-08-21
2         tina    beijing   41  1900-01-01 07:35:00   12002 1981-08-22
3        zhang     suzhou   42  1900-01-02 07:36:00   12003        NaT
4         wang   zhejiang   43  1900-01-03 07:38:00   12004 1981-08-24
5         rose       wuxi   44  1900-01-04 17:36:00   12005 1981-08-25
6   Xiao-sheng  changzhou   45  1900-01-05 17:36:00   12006        NaT
7         beck     fujian   46  1900-01-06 17:36:00   12007 1981-08-27
8        hello   shenzhen   47  1900-01-07 17:36:00   12008 1981-08-28
9        mouth     dalian   30  1900-01-08 17:36:00   12009 1981-08-29
10         eye     dalian   49  1900-01-09 17:36:00   12010 1981-08-30
11       david   shanghai   50  1900-01-10 17:39:00   12011        NaT

2.9: 删除重复值:

In [20]: df['working time'].drop_duplicates()                                                        
Out[20]:
0                07:30:00
1                07:32:00
2     1900-01-01 07:35:00
3     1900-01-02 07:36:00
4     1900-01-03 07:38:00
5     1900-01-04 17:36:00
6     1900-01-05 17:36:00
7     1900-01-06 17:36:00
8     1900-01-07 17:36:00
9     1900-01-08 17:36:00
10    1900-01-09 17:36:00
11    1900-01-10 17:39:00
Name: working time, dtype: object

发现重复的项没有删除,原因是, 我想删除"david" 的不同时间时间段的,应该是name 关键值:

In [21]: df['name'].drop_duplicates()                                                                
Out[21]:
0          david
1          jerry
2           tina
3          zhang
4           wang
5           rose
6     Xiao-sheng
7           beck
8          hello
9          mouth
10           eye
Name: name, dtype: object

成功。

如果要保留最后一项:  (这个在每天公司考勤中可以用到)

In [23]: df['name'].drop_duplicates(keep='last')                                                     
Out[23]:
1          jerry
2           tina
3          zhang
4           wang
5           rose
6     Xiao-sheng
7           beck
8          hello
9          mouth
10           eye
11         david
Name: name, dtype: object

3.0: 替换值

Python中使用replace函数实现数据替换。数据表中city字段上海存在两种写法,分别为shanghai和SH。我们使用replace函数对shanghai进行替换。

In [24]: df['city'].replace('shanghai','SH')                                                         
Out[24]:
0            SH
1            SH
2       beijing
3        suzhou
4      zhejiang
5          wuxi
6     changzhou
7        fujian
8      shenzhen
9        dalian
10       dalian
11           SH
Name: city, dtype: object

3.1: 对符合条件的表进行查找

In [7]: df.loc[df['city'].isin(['beijing','shanghai'])]                                              
Out[7]:
     name      city  age  salary   birthday
0   david  shanghai   39   12000 1981-08-20
1   jerry  shanghai   40   12001 1981-08-21
2    tina   beijing   41   12002 1981-08-22
11  david  shanghai   50   12011        NaT

3.2: 对特定列提取前一个字符

提取前2个字符,并生成数据表

In [11]: pd.DataFrame(df['city'].str[:2])                                                            
Out[11]:
   city
0    sh
1    sh
2    be
3    su
4    zh
5    wu
6    ch
7    fu
8    sh
9    da
10   da
11   sh

3.3: 对复合条件的表进行查找

In [14]: df.loc[( df['age'] > 45 ) & (df['city'] == 'shanghai')]                                     
Out[14]:
     name      city  age  salary birthday
11  david  shanghai   50   12011      NaT

不在上海的查找:

df.loc[( df['age'] > 45 ) & (df['city'] != 'shanghai')]

Out[20]:
     name      city  age  salary   birthday
7    beck    fujian   46   12007 1981-08-27
8   hello  shenzhen   47   12008 1981-08-28
10    eye    dalian   49   12010 1981-08-30

如果只要显示'name', 'age', 'birthday', 不显示'salary':

In [23]: df.loc[( df['age'] > 45 ) & (df['city'] != 'shanghai'),['name','city','age','birthday']]    
Out[23]:
     name      city  age   birthday
7    beck    fujian   46 1981-08-27
8   hello  shenzhen   47 1981-08-28
10    eye    dalian   49 1981-08-30

 

3.4: 用query 对的表进行过滤

In [22]: df.query('city == ["beijing","shanghai"]' )                                                 
Out[22]:
     name      city  age  salary   birthday
0   david  shanghai   39   12000 1981-08-20
1   jerry  shanghai   40   12001 1981-08-21
2    tina   beijing   41   12002 1981-08-22
11  david  shanghai   50   12011        NaT

3.5: 用query 对的表进行sum, min,max 的功能。

相当于excel中的sumifs函数的功能。

In [33]: df.query('city == ["beijing","shanghai"]' ).salary                                          
Out[33]:
0     12000
1     12001
2     12002
11    12011
Name: salary, dtype: int64

In [35]: df.query('city == ["beijing","shanghai"]' ).salary.sum()                                    
Out[35]: 48014

In [36]: df.query('city == ["beijing","shanghai"]' ).salary.max()                                    
Out[36]: 12011

In [37]: df.query('city == ["beijing","shanghai"]' ).salary.min()                                    
Out[37]: 12000

3.5: 用groupby 对数据进行汇总功能:

看一下table:

namecityagesalarybirthday
davidshanghai39120001981-08-20
jerryshanghai40120011981-08-21
tinabeijing41120021981-08-22
zhangsuzhou42120031981-08-23
wangshenzhen43120041981-08-24
rosewuxi44120051981-08-25
Xiao-shengchangzhou45120061981-08-26
beckfujian46120071981-08-27
helloshenzhen47120081981-08-28
mouthdalian30120091981-08-29
eyedalian49120101981-08-30
davidshanghai50120111981-08-31

统计一下每个城市有多少人:

df.groupby('city').count()

           name  age  salary  birthday
city                                  
beijing       1    1       1         1
changzhou     1    1       1         1
dalian        2    2       2         2
fujian        1    1       1         1
shanghai      3    3       3         3
shenzhen      2    2       2         2
suzhou        1    1       1         1
wuxi          1    1       1         1

 

3.6: 用groupby 对数据进行多纬度计算

对汇总后的数据同时按多个维度进行计算,下面的代码中按城市对salary字段进行汇总,并分别计算salary的数量,总金额和平均金额。

df.groupby('city')['salary'].agg([len,np.sum,np.mean])

           len    sum     mean
city                          
beijing      1  12002  12002.0
changzhou    1  12006  12006.0
dalian       2  24019  12009.5
fujian       1  12007  12007.0
shanghai     3  36012  12004.0
shenzhen     2  24012  12006.0
suzhou       1  12003  12003.0
wuxi         1  12005  12005.0

 

参考资料:https://www.cnblogs.com/nxld/p/6756492.html

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

shenghuiping2001

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值