1: 先把下面表格再EXCEL 中保存:
name | city | age | salary | birthday |
david | shanghai | 39 | 12000 | 1981-08-20 |
jerry | shanghai | 40 | 12001 | 1981-08-21 |
tina | beijing | 41 | 12002 | 1981-08-22 |
zhang | suzhou | 42 | 12003 | 1981-08-23 |
wang | zhejiang | 43 | 12004 | 1981-08-24 |
rose | wuxi | 44 | 12005 | 1981-08-25 |
Xiao-sheng | changzhou | 45 | 12006 | 1981-08-26 |
beck | fujian | 46 | 12007 | 1981-08-27 |
hello | shenzhen | 47 | 12008 | 1981-08-28 |
mouth | dalian | 48 | 12009 | 1981-08-29 |
eye | dalian | 49 | 12010 | 1981-08-30 |
nose | shanghai | 50 | 12011 | 1981-08-31 |
2: 下面进行python:
import numpy as np
import pandas as pd
df = pd.DataFrame(pd.read_excel('/home/sheng/Desktop/Python_study/python_data.xlsx'))
print(df.info())
可以看到表的信息:
RangeIndex: 12 entries, 0 to 11
Data columns (total 5 columns):
name 12 non-null object
city 12 non-null object
age 12 non-null int64
salary 12 non-null int64
birthday 12 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(2)
memory usage: 560.0+ bytes
None
--
2.2: 看一下数据纬度:
print(df.shape)
输出:(12, 5)
2.3: 查看数据表数值
Python中的Values函数用来查看数据表中的数值。以数组的形式返回,不包含表头信息。
[['david' 'shanghai' 39 12000 Timestamp('1981-08-20 00:00:00')]
['jerry' 'shanghai' 40 12001 Timestamp('1981-08-21 00:00:00')]
['tina' 'beijing' 41 12002 Timestamp('1981-08-22 00:00:00')]
['zhang' 'suzhou' 42 12003 Timestamp('1981-08-23 00:00:00')]
['wang' 'zhejiang' 43 12004 Timestamp('1981-08-24 00:00:00')]
['rose' 'wuxi' 44 12005 Timestamp('1981-08-25 00:00:00')]
['Xiao-sheng' 'changzhou' 45 12006 Timestamp('1981-08-26 00:00:00')]
['beck' 'fujian' 46 12007 Timestamp('1981-08-27 00:00:00')]
['hello' 'shenzhen' 47 12008 Timestamp('1981-08-28 00:00:00')]
['mouth' 'dalian' 48 12009 Timestamp('1981-08-29 00:00:00')]
['eye' 'dalian' 49 12010 Timestamp('1981-08-30 00:00:00')]
['nose' 'shanghai' 50 12011 Timestamp('1981-08-31 00:00:00')]]
2.4: 查看表的内容
Pop函数用来单独去掉查看数据表中的列。
print(df)
df.pop('city')
print('\n')
print(df)
看一下output:
name city age salary birthday
0 david shanghai 39 12000 1981-08-20
1 jerry shanghai 40 12001 1981-08-21
2 tina beijing 41 12002 1981-08-22
3 zhang suzhou 42 12003 1981-08-23
4 wang zhejiang 43 12004 1981-08-24
5 rose wuxi 44 12005 1981-08-25
6 Xiao-sheng changzhou 45 12006 1981-08-26
7 beck fujian 46 12007 1981-08-27
8 hello shenzhen 47 12008 1981-08-28
9 mouth dalian 48 12009 1981-08-29
10 eye dalian 49 12010 1981-08-30
11 nose shanghai 50 12011 1981-08-31
name age salary birthday
0 david 39 12000 1981-08-20
1 jerry 40 12001 1981-08-21
2 tina 41 12002 1981-08-22
3 zhang 42 12003 1981-08-23
4 wang 43 12004 1981-08-24
5 rose 44 12005 1981-08-25
6 Xiao-sheng 45 12006 1981-08-26
7 beck 46 12007 1981-08-27
8 hello 47 12008 1981-08-28
9 mouth 48 12009 1981-08-29
10 eye 49 12010 1981-08-30
11 nose 50 12011 1981-08-31
可以看到city 这一列已经removed.
2.5: 查看表的其中一列
df1=pd.DataFrame(pd.read_excel('/home/sheng/Desktop/Python_study/python_data.xlsx'))
print(df1)
print(df1['city'])
看到上面的city 已经被removed, 所以要把它找回来就重新read file,然后再输出。
0 shanghai
1 shanghai
2 beijing
3 suzhou
4 zhejiang
5 wuxi
6 changzhou
7 fujian
8 shenzhen
9 dalian
10 dalian
11 shanghai
2.6: 把数据表是空值的drop掉:
先看一下有空的表格:
name | city | age | salary | birthday |
david | shanghai | 39 | 12000 | 1981-08-20 |
jerry | shanghai | 40 | 12001 | 1981-08-21 |
tina | beijing | 41 | 12002 | 1981-08-22 |
zhang | suzhou | 42 | 12003 | |
wang | zhejiang | 43 | 12004 | 1981-08-24 |
rose | wuxi | 44 | 12005 | 1981-08-25 |
Xiao-sheng | changzhou | 45 | 12006 | |
beck | fujian | 46 | 12007 | 1981-08-27 |
hello | shenzhen | 47 | 12008 | 1981-08-28 |
mouth | dalian | 12009 | 1981-08-29 | |
eye | dalian | 49 | 12010 | 1981-08-30 |
nose | shanghai | 50 | 12011 |
再用ipython 处理一下:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame(pd.read_excel('/home/sheng/Desktop/Python_study/python
...: _data.xlsx'))
In [4]: df['birthday'].isnull()
Out[4]:
0 False
1 False
2 False
3 True
4 False
5 False
6 True
7 False
8 False
9 False
10 False
11 True
Name: birthday, dtype: bool
In [5]: df['birthday'].isnull().value_counts()
Out[5]:
False 9
True 3
Name: birthday, dtype: int64
In [6]: df['birthday'] = df['birthday'].fillna('999') # 用999 来填充 空值
In [7]: df[(df.birthday=='999')].index.tolist
Out[7]: <bound method IndexOpsMixin.tolist of Int64Index([3, 6, 11], dtype='int64')>
In [8]: df[(df.birthday=='999')].index
Out[8]: Int64Index([3, 6, 11], dtype='int64')
In [9]: print(df)
name city age salary birthday
0 david shanghai 39.0 12000 1981-08-20 00:00:00
1 jerry shanghai 40.0 12001 1981-08-21 00:00:00
2 tina beijing 41.0 12002 1981-08-22 00:00:00
3 zhang suzhou 42.0 12003 999
4 wang zhejiang 43.0 12004 1981-08-24 00:00:00
5 rose wuxi 44.0 12005 1981-08-25 00:00:00
6 Xiao-sheng changzhou 45.0 12006 999
7 beck fujian 46.0 12007 1981-08-27 00:00:00
8 hello shenzhen 47.0 12008 1981-08-28 00:00:00
9 mouth dalian NaN 12009 1981-08-29 00:00:00
10 eye dalian 49.0 12010 1981-08-30 00:00:00
11 nose shanghai 50.0 12011 999
In [10]: df = df.drop([3,6,11]) #删除第3,第6,第11
In [11]: print(df)
name city age salary birthday
0 david shanghai 39.0 12000 1981-08-20 00:00:00
1 jerry shanghai 40.0 12001 1981-08-21 00:00:00
2 tina beijing 41.0 12002 1981-08-22 00:00:00
4 wang zhejiang 43.0 12004 1981-08-24 00:00:00
5 rose wuxi 44.0 12005 1981-08-25 00:00:00
7 beck fujian 46.0 12007 1981-08-27 00:00:00
8 hello shenzhen 47.0 12008 1981-08-28 00:00:00
9 mouth dalian NaN 12009 1981-08-29 00:00:00
10 eye dalian 49.0 12010 1981-08-30 00:00:00
In [12]:
In [12]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 0 to 10
Data columns (total 5 columns):
name 9 non-null object
city 9 non-null object
age 8 non-null float64
salary 9 non-null int64
birthday 9 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 432.0+ bytes
In [13]:
利用dropna(how='any') 来删除有风险:(所有的含有空格的行都删除了)
In [7]: df.dropna(how='any')
Out[7]:
name city age salary birthday
0 david shanghai 39.0 12000 1981-08-20
1 jerry shanghai 40.0 12001 1981-08-21
2 tina beijing 41.0 12002 1981-08-22
4 wang zhejiang 43.0 12004 1981-08-24
5 rose wuxi 44.0 12005 1981-08-25
7 beck fujian 46.0 12007 1981-08-27
8 hello shenzhen 47.0 12008 1981-08-28
10 eye dalian 49.0 12010 1981-08-30
2.7: 大小写转化:
df['city']=df['city'].str.upper()
看一下output: 已经全部变成大写啦。
name city age salary birthday
0 david SHANGHAI 39.0 12000 1981-08-20
1 jerry SHANGHAI 40.0 12001 1981-08-21
2 tina BEIJING 41.0 12002 1981-08-22
3 zhang SUZHOU 42.0 12003 NaT
4 wang ZHEJIANG 43.0 12004 1981-08-24
5 rose WUXI 44.0 12005 1981-08-25
6 Xiao-sheng CHANGZHOU 45.0 12006 NaT
7 beck FUJIAN 46.0 12007 1981-08-27
8 hello SHENZHEN 47.0 12008 1981-08-28
9 mouth DALIAN NaN 12009 1981-08-29
10 eye DALIAN 49.0 12010 1981-08-30
11 nose SHANGHAI 50.0 12011 NaT
2.8: 修改列名称:
n [17]: df
Out[17]:
name city age working time salary birthday
0 david shanghai 39 07:30:00 12000 1981-08-20
1 jerry shanghai 40 07:32:00 12001 1981-08-21
2 tina beijing 41 1900-01-01 07:35:00 12002 1981-08-22
3 zhang suzhou 42 1900-01-02 07:36:00 12003 NaT
4 wang zhejiang 43 1900-01-03 07:38:00 12004 1981-08-24
5 rose wuxi 44 1900-01-04 17:36:00 12005 1981-08-25
6 Xiao-sheng changzhou 45 1900-01-05 17:36:00 12006 NaT
7 beck fujian 46 1900-01-06 17:36:00 12007 1981-08-27
8 hello shenzhen 47 1900-01-07 17:36:00 12008 1981-08-28
9 mouth dalian 30 1900-01-08 17:36:00 12009 1981-08-29
10 eye dalian 49 1900-01-09 17:36:00 12010 1981-08-30
11 david shanghai 50 1900-01-10 17:39:00 12011 NaT
In [18]: df.rename(columns={'working time': 'arriving time'})
Out[18]:
name city age arriving time salary birthday
0 david shanghai 39 07:30:00 12000 1981-08-20
1 jerry shanghai 40 07:32:00 12001 1981-08-21
2 tina beijing 41 1900-01-01 07:35:00 12002 1981-08-22
3 zhang suzhou 42 1900-01-02 07:36:00 12003 NaT
4 wang zhejiang 43 1900-01-03 07:38:00 12004 1981-08-24
5 rose wuxi 44 1900-01-04 17:36:00 12005 1981-08-25
6 Xiao-sheng changzhou 45 1900-01-05 17:36:00 12006 NaT
7 beck fujian 46 1900-01-06 17:36:00 12007 1981-08-27
8 hello shenzhen 47 1900-01-07 17:36:00 12008 1981-08-28
9 mouth dalian 30 1900-01-08 17:36:00 12009 1981-08-29
10 eye dalian 49 1900-01-09 17:36:00 12010 1981-08-30
11 david shanghai 50 1900-01-10 17:39:00 12011 NaT
2.9: 删除重复值:
In [20]: df['working time'].drop_duplicates()
Out[20]:
0 07:30:00
1 07:32:00
2 1900-01-01 07:35:00
3 1900-01-02 07:36:00
4 1900-01-03 07:38:00
5 1900-01-04 17:36:00
6 1900-01-05 17:36:00
7 1900-01-06 17:36:00
8 1900-01-07 17:36:00
9 1900-01-08 17:36:00
10 1900-01-09 17:36:00
11 1900-01-10 17:39:00
Name: working time, dtype: object
发现重复的项没有删除,原因是, 我想删除"david" 的不同时间时间段的,应该是name 关键值:
In [21]: df['name'].drop_duplicates()
Out[21]:
0 david
1 jerry
2 tina
3 zhang
4 wang
5 rose
6 Xiao-sheng
7 beck
8 hello
9 mouth
10 eye
Name: name, dtype: object
成功。
如果要保留最后一项: (这个在每天公司考勤中可以用到)
In [23]: df['name'].drop_duplicates(keep='last')
Out[23]:
1 jerry
2 tina
3 zhang
4 wang
5 rose
6 Xiao-sheng
7 beck
8 hello
9 mouth
10 eye
11 david
Name: name, dtype: object
3.0: 替换值
Python中使用replace函数实现数据替换。数据表中city字段上海存在两种写法,分别为shanghai和SH。我们使用replace函数对shanghai进行替换。
In [24]: df['city'].replace('shanghai','SH')
Out[24]:
0 SH
1 SH
2 beijing
3 suzhou
4 zhejiang
5 wuxi
6 changzhou
7 fujian
8 shenzhen
9 dalian
10 dalian
11 SH
Name: city, dtype: object
3.1: 对符合条件的表进行查找
In [7]: df.loc[df['city'].isin(['beijing','shanghai'])]
Out[7]:
name city age salary birthday
0 david shanghai 39 12000 1981-08-20
1 jerry shanghai 40 12001 1981-08-21
2 tina beijing 41 12002 1981-08-22
11 david shanghai 50 12011 NaT
3.2: 对特定列提取前一个字符
提取前2个字符,并生成数据表
In [11]: pd.DataFrame(df['city'].str[:2])
Out[11]:
city
0 sh
1 sh
2 be
3 su
4 zh
5 wu
6 ch
7 fu
8 sh
9 da
10 da
11 sh
3.3: 对复合条件的表进行查找
In [14]: df.loc[( df['age'] > 45 ) & (df['city'] == 'shanghai')]
Out[14]:
name city age salary birthday
11 david shanghai 50 12011 NaT
不在上海的查找:
df.loc[( df['age'] > 45 ) & (df['city'] != 'shanghai')]
Out[20]:
name city age salary birthday
7 beck fujian 46 12007 1981-08-27
8 hello shenzhen 47 12008 1981-08-28
10 eye dalian 49 12010 1981-08-30
如果只要显示'name', 'age', 'birthday', 不显示'salary':
In [23]: df.loc[( df['age'] > 45 ) & (df['city'] != 'shanghai'),['name','city','age','birthday']]
Out[23]:
name city age birthday
7 beck fujian 46 1981-08-27
8 hello shenzhen 47 1981-08-28
10 eye dalian 49 1981-08-30
3.4: 用query 对的表进行过滤
In [22]: df.query('city == ["beijing","shanghai"]' )
Out[22]:
name city age salary birthday
0 david shanghai 39 12000 1981-08-20
1 jerry shanghai 40 12001 1981-08-21
2 tina beijing 41 12002 1981-08-22
11 david shanghai 50 12011 NaT
3.5: 用query 对的表进行sum, min,max 的功能。
相当于excel中的sumifs函数的功能。
In [33]: df.query('city == ["beijing","shanghai"]' ).salary
Out[33]:
0 12000
1 12001
2 12002
11 12011
Name: salary, dtype: int64
In [35]: df.query('city == ["beijing","shanghai"]' ).salary.sum()
Out[35]: 48014
In [36]: df.query('city == ["beijing","shanghai"]' ).salary.max()
Out[36]: 12011
In [37]: df.query('city == ["beijing","shanghai"]' ).salary.min()
Out[37]: 12000
3.5: 用groupby 对数据进行汇总功能:
看一下table:
name | city | age | salary | birthday |
david | shanghai | 39 | 12000 | 1981-08-20 |
jerry | shanghai | 40 | 12001 | 1981-08-21 |
tina | beijing | 41 | 12002 | 1981-08-22 |
zhang | suzhou | 42 | 12003 | 1981-08-23 |
wang | shenzhen | 43 | 12004 | 1981-08-24 |
rose | wuxi | 44 | 12005 | 1981-08-25 |
Xiao-sheng | changzhou | 45 | 12006 | 1981-08-26 |
beck | fujian | 46 | 12007 | 1981-08-27 |
hello | shenzhen | 47 | 12008 | 1981-08-28 |
mouth | dalian | 30 | 12009 | 1981-08-29 |
eye | dalian | 49 | 12010 | 1981-08-30 |
david | shanghai | 50 | 12011 | 1981-08-31 |
统计一下每个城市有多少人:
df.groupby('city').count()
name age salary birthday
city
beijing 1 1 1 1
changzhou 1 1 1 1
dalian 2 2 2 2
fujian 1 1 1 1
shanghai 3 3 3 3
shenzhen 2 2 2 2
suzhou 1 1 1 1
wuxi 1 1 1 1
3.6: 用groupby 对数据进行多纬度计算
对汇总后的数据同时按多个维度进行计算,下面的代码中按城市对salary字段进行汇总,并分别计算salary的数量,总金额和平均金额。
df.groupby('city')['salary'].agg([len,np.sum,np.mean])
len sum mean
city
beijing 1 12002 12002.0
changzhou 1 12006 12006.0
dalian 2 24019 12009.5
fujian 1 12007 12007.0
shanghai 3 36012 12004.0
shenzhen 2 24012 12006.0
suzhou 1 12003 12003.0
wuxi 1 12005 12005.0
参考资料:https://www.cnblogs.com/nxld/p/6756492.html