Pandas 预处理函数

最新推荐文章于 2023-06-03 20:23:26 发布

Yuanling_2

最新推荐文章于 2023-06-03 20:23:26 发布

阅读量322

点赞数

文章标签： Python新手小白

本文链接：https://blog.csdn.net/sinat_20263049/article/details/100545058

版权

Pandas 预处理函数

import pandas as pd
import numpy as np

position=pd.read_csv('dataAnalyst_sql.csv')

company=pd.read_csv('company_sql.csv',encoding='utf')

(1) str #去除异常符号 [] ‘’ 等 (文本函数)

position.positionLables.str.count('分析师')    #分析师出现了几次   #str是针对值里面的字符串进行操作

position.positionLables.str[1:-1]    ##删除第一个与最后一个字符

position.positionLables.str[1:-1].str.replace('', '')    ##删除引号？？？？不太对 ('', '')

position.replace( 89024,'')     #对整张表具体的值进行替换

Signature: position.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method=‘pad’)
Docstring:Replace values given in to_replace with value.

(2)对空值进行处理

fillna 、 dropna

position.loc[position.city=='深圳','city'] = np.NaN         #最好是写成= np.NaN

position.city.fillna('abc')    #填充

Signature: position.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
Docstring:Fill NA/NaN values using the specified method

position.city=position.city.fillna('abc')    #赋值

position.dropna()          #删掉所有存在空值的行
position.dropna(axis=1)    #删掉所有存在空值的列

Signature: position.dropna(axis=0, how=‘any’, thresh=None, subset=None, inplace=False)
Docstring:Remove missing values.

(3) 删除重复元素

duplicated、 drop_duplicates

position.positionId.duplicated()    #返回布尔 bool

Signature: position.positionId.duplicated(keep='first')   
Docstring:Indicate duplicate Series values.

s=pd.Series([1,1,2,3,4])
s.duplicated()

0    False
1     True
2    False
3    False
4    False
dtype: bool

s    #找出重复值  第一次不会标注为重复值，第二次是就会标注为重复值

0    1
1    1
2    2
3    3
4    4
dtype: int64

s[s.duplicated()]    #超出重复值与位置

1    1
dtype: int64

s[~s.duplicated()]    #找出没有重复的----删除掉了重复的值，保留第一个遇到的

0    1
2    2
3    3
4    4
dtype: int64

s[~s.duplicated(keep='last')]    #保留最后一个遇到的

1    1
2    2
3    3
4    4
dtype: int64

s.drop_duplicates()    #直接返回值   duplicated是返回 bool （FALSE与true）

0    1
2    2
3    3
4    4
dtype: int64

(3) apply 可以对每一行每一列进行应用

position.companyId.astype('str')+'k'  #将int格式转换为字符串格式

0 8581k
1 23177k
2 57561k
3 7502k

position.companyId.apply(lambda x:str(x)+'k')  #  构造函数 对进行使用,输入x，输出 str(x)+'k'

0 8581k
1 23177k
2 57561k
3 7502k

def func(x):
    return str(x)+ 'k'

position.companyId.apply(func)  #这个方法更加简单方、方便

0 8581k
1 23177k
2 57561k
3 7502k

def func(x):
    if x> 130000:
        return '13000+k'
    else:
        return '0~13000'

position.companyId.apply(func)

0 0~13000
1 0~13000
2 0~13000
3 0~13000
4 13000+k
5 0~13000

Signature: position.companyId.apply(func, convert_dtype=True, args=(), **kwds)
Docstring:Invoke function on values of Series. Can be ufunc (a NumPy function
that applies to the entire Series) or a Python function that only works

position.apply(func)    #axis=0 表示应用到每一个列
position.apply(func,axis=1)    #axis=1 表示应用到每一个行

def func(x):
    if x.companyId > 130000:
        return '13000+k'
    else:
        return '0~13000'

position.apply(func,axis=1)

0 0~13000
1 0~13000
2 0~13000
3 0~13000
4 13000+k
5 0~13000
6 0~13000

def func(x):
    if x > 130000:
        return '13000+k'
    else:
        return '0~13000'

position.apply(lambda x:func(x,companyId),axis=1)

(4) apply 应用到聚合函数上面

#不同城市下薪资排名前5 的

def func(x,n,asc):
   r = x.sort_values('companyId',ascending=asc)
    return r[:n]
position.groupby('city').apply(func,n=3,asc=True)    #以数据框形式输入,数据变换为5行，整个形状变化

  File "<ipython-input-71-96fe1f4df9b8>", line 3
    return r[:n]
    ^
IndentationError: unexpected indent

position.sort_values('companyId',ascending=False)[:2]

	positionId	city	companyId	firstType	secondType	education	industryField	positionAdvantage	positionName	positionLables	salary	workYear
4718	2580536	成都	157744	市场/商务/销售类	销售	不限	金融、电子商务	环境优美，工作氛围轻松，充满激情的团队	网络销售/客服/分析师助理	['金融', '实习生', '在线', '经理', '销售']	3K-6K	不限
3577	2579118	北京	157665	产品/需求/项目类	数据分析	本科	数据服务	晋升空间大周末双休公司前景好优秀团队	分析师助理	['企业信用风险分']	2k-4k	不限

position.groupby('city').agg(['sum','mean'])[:6]    ##聚合

	positionId		companyId
	sum	mean	sum	mean
city
上海	2110318627	2.155586e+06	57673139	58910.254341
北京	5172979443	2.204082e+06	129070908	54993.995739
南京	187518208	2.259256e+06	5298264	63834.506024
厦门	58501438	1.950048e+06	1951895	65063.166667
天津	44432139	2.221607e+06	1451047	72552.350000
广州	742563610	2.216608e+06	20170238	60209.665672

##aagregate 聚合，不带有整形，完整对应，形状不会有变化，apply可以对列表进行整合拆分（更为灵活，更常用）

position.groupby('city').agg(lambda x:max(x)-min(x))[:6]   #只要不涉及表格形状变化都可以应用

	positionId	companyId
city
上海	2406115	157090
北京	2482622	157622
南京	1476696	151545
厦门	2382465	144494
天津	1938475	128827
广州	2433163	156545

Signature: position.agg(func, axis=0, *args, **kwargs)
Docstring: Aggregate using one or more operations over the specified axis.

(5) 数据透视

position.pivot_table( values= 'companyId',index=['city','education'],columns='workYear', aggfunc=[np.sum, np.mean])[:4]     #支持多重索引
 #明确生成什么养的表格  aggfunc默认是mean（功能）  
 # import numpy as np

		sum							mean
	workYear	1-3年	10年以上	1年以下	3-5年	5-10年	不限	应届毕业生	1-3年	10年以上	1年以下	3-5年	5-10年	不限	应届毕业生
city	education
上海	不限	946395.0	102820.0	224569.0	597364.0	202081.0	2195421.0	NaN	52577.500000	102820.0	112284.500000	49780.333333	67360.333333	68606.906250	NaN
	博士	187978.0	NaN	NaN	NaN	NaN	131806.0	NaN	93989.000000	NaN	NaN	NaN	NaN	131806.000000	NaN
	大专	3156389.0	NaN	160336.0	2819379.0	533659.0	966104.0	47063.0	73404.395349	NaN	80168.000000	62652.866667	76237.000000	80508.666667	47063.000000
	本科	12624586.0	50020.0	580385.0	14567700.0	7084628.0	4126611.0	1331155.0	53494.008475	25010.0	52762.272727	54357.089552	62145.859649	62524.409091	51198.269231

Signature: position.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
Docstring: Create a spreadsheet-style pivot table as a DataFrame. The levels in

position.pivot_table( values= ['companyId','positionId'],
                     columns='workYear', 
                     index=['city','education'],
                     aggfunc={'companyId':np.sum, 'positionId':np.mean})[:5]  #切片
#实现对 positionId 对 companyId 进行求和计算        将列表改成字典

		companyId							positionId
	workYear	1-3年	10年以上	1年以下	3-5年	5-10年	不限	应届毕业生	1-3年	10年以上	1年以下	3-5年	5-10年	不限	应届毕业生
city	education
上海	不限	946395.0	102820.0	224569.0	597364.0	202081.0	2195421.0	NaN	2.087630e+06	2294237.0	2385121.0	1.947482e+06	1.864679e+06	2.239489e+06	NaN
	博士	187978.0	NaN	NaN	NaN	NaN	131806.0	NaN	2.529990e+06	NaN	NaN	NaN	NaN	1.896700e+06	NaN
	大专	3156389.0	NaN	160336.0	2819379.0	533659.0	966104.0	47063.0	2.311422e+06	NaN	2291928.0	2.158873e+06	1.964906e+06	2.275759e+06	2.463114e+06
	本科	12624586.0	50020.0	580385.0	14567700.0	7084628.0	4126611.0	1331155.0	2.088047e+06	2328147.0	2312697.0	2.138806e+06	2.234212e+06	2.207387e+06	2.271003e+06
	硕士	1130985.0	48294.0	98495.0	1122603.0	973406.0	1425358.0	237549.0	1.859358e+06	1793759.0	2574210.0	2.106698e+06	2.174119e+06	2.262192e+06	2.471457e+06

position.pivot_table( values= ['companyId','positionId'],
                     columns='workYear', 
                     index=['city','education'],
                     aggfunc={'companyId':np.sum, 'positionId':np.mean}).reset_index().to_csv('test.csv')   #输出结果