Pandas

最新推荐文章于 2024-02-06 14:00:31 发布

某僧

最新推荐文章于 2024-02-06 14:00:31 发布

阅读量165

点赞数

文章标签： python

本文链接：https://blog.csdn.net/weixin_45656737/article/details/109297757

版权

numpy & Pandas（√） & Scipy

首先奉出W3Cschool。站在巨人的肩膀上~

Pandas

常用属性：index（会返回start、stop、step）、columns、dtypes、type、describe（各种描述信息：mean、max、min、四分位数）。魔法命令：%timeit运行某语句很多次求平均需要多长时间
数据结构
两种:

维数	名称	描述
1	Series	带标签的一维同构数组，dataframe的一行或一列。使用Series可以将字典转化为series
2	DataFrame	带标签的，大小可变的，二维异构表格

DataFrame 是 Series (标量的容器)的容器。Pandas中轴指行index（1）和列columns（1）。操作数据时尽量生成新的对象，而不是改变原来数据。

导入两个包：pandas和numpy，numpy为数组提供各种操作，pandas为各种数据结构提供操作。具体关系可以参考：关系。numpy的数据结构为array
as+别名，即为导入该包后为其起的小名，方便实用

import numpy as np
import pandas as pd

series查询同字典，seriesname[‘labelname’]，查询dataframe中的series使用 df[[‘column1’,‘column2’]]，或loc，见5
使用DataFrame将字典转化为dataframe
loc：查询，语法：
一行：df.loc[index]；
多行：df.loc[beginindex:endindex]；
查询某个值:df.loc[‘indexname’,‘columname‘]注意列表切片不包含末尾元素，但loc返回末尾元素）。
总结：loc可以传入列表，列名，行名，区间（使用冒号：划定区间），注意名字使用引号包围。
进阶：使用条件表达式进行查询。语法：df.loc[df[‘columnname’]>num,:] 条件是多个使用 & 进行连接
使用函数进行查询：

//方法一使用lambda函数
df.loc[lambda df:(condiction),:]
//方法二
def functionname(df):
	return df.index.str.startwith("字符串") & df['columnname']==condiction
//调用
df.loc[functionname,:]

//columnname列不为空的所有行
df.loc[df['columnname'].notnull(),:]

enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列，同时列出数据和数据下标，一般用在 for 循环当中。
iloc(使用数字索引进行切片),语法：dataframe.iloc[startindex:endindex]，可用于大表格拆分成小表格，使用concat将小表格合并成大表格；at、iat
设置索引:set_index(‘需要设index列的列名’,inplace=true,drop=False)其中drop表示设index的列不被删除；
.index.is_monotonic_increasing：是否是递增的；
.index.is_unique：是否唯一;
.sort_index(inplace=True),把索引按某个顺序进行排序
去重：unique()
规模： .size()
读取excel：pd.read_excel(‘src/sec…’,skiprows=2)skiprows用于跳过前两行；
读取dat文件：pd.read_csv(‘xxx.dat’,sep=’::’,engine=‘python’,names=‘column1::column2::column3::…’.split(’::’))
生成excel：df.to_excel(‘src’,index=False),index=false不保存索引列。
替换：df.loc[:,‘columnname’]=df[‘columnname’].str.replace(’需被替换内容‘,‘替换成的内容’).astype(‘int32’);str具有的replace方法，转换数据类型：astype(‘int32’)
新增数据列
直接赋值：df.loc[:,‘newindexname’]= expression
apply方法，使用轴:axis，axis=1表示列，对某列进行操作。语法：df.loc[:,‘newindexname’]=df.apply(functionname,axis=1)。
assign：可以同时新添多列，相当于多次使用apply。添加数据后返回新对象，原数据不做修改。语法：

df.assign(newcolumnname1=lambda x:expression1,
		newcolumnname2= lambda x:expression2)

按条件选择分组分别赋值

//广播
df[''newcolumname'']=''
//选择符合条件的行，为这些行的某列赋值
df.loc[expression,'newcolumname']='expression'

value_counts()方法计算某列中每种值出现的次数。
相关系数 df[‘columnname1’].cov(df[‘columnname2’])
协方差：df[‘columnname1’].corr(df[‘columnname2’])
协方差和相关系数关系
处理缺失值的函数

isnull()返回true/false,可用于整个表或某series。相反的函数：notnull()
dropna的使用

//按列删除（axis=’index‘按行），一列全部是空值才进行删除，对原表进行修改
df.dropna(axis='columns',how='all',inplace=True)

fillna的使用

//na即为空值，对空值填充0
df.fillna({'columnname':0})
df.loc[:,'columnname']=df['columnname'].fillna(0)

填充

//ffill使用前面一个值进行填充
df.loc[:,'columnname'] = df['columnname'].fillna(method='ffill')

settingwithcopywarning：对视图view进行了修改，此时应对copy进行修改：使用loc或先copy再对copy进行修改。
排序：series.sort_values(ascending=True,inplace=False)，ascending默认为升序，inplace为修改原始series；
df.sort_values(by，ascending=True,inplace=False)，by：字符串或list，单列排序或多列排序，ascending可以为列表，表示为多个列指定升序或降序
字符串处理：需要先获取字符串的str属性，只有series有str属性，dataframe没有。多次使用str方法可以写为链式结构：直接接连使用（str.方法.str.方法…）即可
str对应方法每次使用str方法之前都要使用str属性，再使用方法

str方法	描述
Series.str.capitalize(args, *kwargs)	Convert strings in the Series/Index to be capitalized.
Series.str.casefold(args, *kwargs)	Convert strings in the Series/Index to be casefolded.
Series.str.cat(args, *kwargs)	Concatenate strings in the Series/Index with given separator.
Series.str.center(args, *kwargs)	Pad left and right side of strings in the Series/Index.
*Series.str.contains(args, kwargs)	Test if pattern or regex is contained within a string of a Series or Index.返回值为true或false
Series.str.count(args, *kwargs)	Count occurrences of pattern in each string of the Series/Index.
Series.str.decode(encoding[, errors])	Decode character string in the Series/Index using indicated encoding.
Series.str.encode(args, *kwargs)	Encode character string in the Series/Index using indicated encoding.
Series.str.endswith(args, *kwargs)	Test if the end of each string element matches a pattern.
Series.str.extract(args, *kwargs)	Extract capture groups in the regex pat as columns in a DataFrame.
Series.str.extractall(args, *kwargs)	Extract capture groups in the regex pat as columns in DataFrame.
Series.str.find(args, *kwargs)	Return lowest indexes in each strings in the Series/Index.
Series.str.findall(args, *kwargs)	Find all occurrences of pattern or regular expression in the Series/Index.返回值为true或false
Series.str.get(i)	Extract element from each component at specified position.
Series.str.index(args, *kwargs)	Return lowest indexes in each string in Series/Index.
Series.str.join(args, *kwargs)	Join lists contained as elements in the Series/Index with passed delimiter.
*Series.str.len(args, kwargs)	Compute the length of each element in the Series/Index.
Series.str.ljust(args, *kwargs)	Pad right side of strings in the Series/Index.
Series.str.lower(args, *kwargs)	Convert strings in the Series/Index to lowercase.
Series.str.lstrip(args, *kwargs)	Remove leading characters.
Series.str.match(args, *kwargs)	Determine if each string starts with a match of a regular expression.
Series.str.normalize(args, *kwargs)	Return the Unicode normal form for the strings in the Series/Index.
Series.str.pad(args, *kwargs)	Pad strings in the Series/Index up to width.
Series.str.partition(args, *kwargs)	Split the string at the first occurrence of sep.
Series.str.repeat(args, *kwargs)	Duplicate each string in the Series or Index.
*Series.str.replace(args, kwargs)	Replace each occurrence of pattern/regex in the Series/Index.eg：.str.replace(’[年月日]’,’’)可以使用正则表达式，代表遇到符合表达式的都进行替换
Series.str.rfind(args, *kwargs)	Return highest indexes in each strings in the Series/Index.
Series.str.rindex(args, *kwargs)	Return highest indexes in each string in Series/Index.
Series.str.rjust(args, *kwargs)	Pad left side of strings in the Series/Index.
Series.str.rpartition(args, *kwargs)	Split the string at the last occurrence of sep.
Series.str.rstrip(args, *kwargs)	Remove trailing characters.
Series.str.slice([start, stop, step])	Slice substrings from each element in the Series or Index.eg:.str.slice(0,6),.str.[0:6]:切片方式
Series.str.slice_replace(args, *kwargs)	Replace a positional slice of a string with another value.
*Series.str.split(args, kwargs)	Split strings around given separator/delimiter.eg：year,month,day=x[‘ymd’].str.splite(’-’),使用{}+f" "对字符串拼接
Series.str.rsplit(args, *kwargs)	Split strings around given separator/delimiter.S
*Series.str.startswith(args, kwargs)	Test if the start of each string element matches a pattern.
Series.str.strip(args, *kwargs)	Remove leading and trailing characters.
Series.str.swapcase(args, *kwargs)	Convert strings in the Series/Index to be swapcased.
Series.str.title(args, *kwargs)	Convert strings in the Series/Index to titlecase.
Series.str.translate(args, *kwargs)	Map all characters in the string through the given mapping table.
Series.str.upper(args, *kwargs)	Convert strings in the Series/Index to uppercase.
Series.str.wrap(args, *kwargs)	Wrap strings in Series/Index at specified line width.
Series.str.zfill(args, *kwargs)	Pad strings in the Series/Index by prepending ‘0’ characters.
Series.str.isalnum(args, *kwargs)	Check whether all characters in each string are alphanumeric.
Series.str.isalpha(args, *kwargs)	Check whether all characters in each string are alphabetic.
Series.str.isdigit(args, *kwargs)	Check whether all characters in each string are digits.
Series.str.isspace(args, *kwargs)	Check whether all characters in each string are whitespace.
Series.str.islower(args, *kwargs)	Check whether all characters in each string are lowercase.
Series.str.isupper(args, *kwargs)	Check whether all characters in each string are uppercase.
Series.str.istitle(args, *kwargs)	Check whether all characters in each string are titlecase.
*Series.str.isnumeric(args, kwargs)	判断每个值是不是数字，true或false
Series.str.isdecimal(args, *kwargs)	Check whether all characters in each string are decimal.
Series.str.get_dummies(args, *kwargs)	Return DataFrame of dummy/indicator variables for Series.

axis=0/index：行
axis=1/columns
聚合操作（如mean），指定按哪个axis，该axis要动起来，另一个axis保持不变，直观说:n*m的df，指定按哪个axis，哪个axis的维数变化，另一个不变
下标index：数据查询，自动对齐（对索引对应的数据进行操作）、分类的index（categoricalindex）、多维索引（multiindex，用于groupby多维聚合后结果）、时间类型索引（datetimeindex）
Merge，将不同表按key关联到一个表（左边一个，右边一个）。可以实现笛卡儿积
注意：inner:一行的key同时存在于两个表该行才会被保留。left：左边全部保留，左边对应key在右边不存在value，该value设为null；right：右边全部保留，右边的key左边没有value，该value设为null；outer：左右两边都保留，不存在的数据设为null；当出现重复key时使用（on=‘key’）属性可以为merge后列名加后缀_x、_y，还可以自己指定后缀，语法（on=‘key’，suffixes=(‘指定名1’,‘指定名2’)）
两表间对应关系：一对一，一对多，多对多；一对多和多对多进行merge时会复制数据
concat、append

concat默认按行进行合并时，将第二个表格添加到第一个表格下面；默认使用outer合并方式，此时一个表格存在而另一个表格不存在的字段都保留，数值设为NAN；若ingore_index则生成新的index。示例

/中间使用逗号进行分割
pd.concat([dataframe1,dataframe2])
//一个为dataframe,一个为series,axis=1为添加一列，若方括号内添加多个列则合并多个列，若只有series则可以只合并列，还可以dataframe和series进行混合
pd.concat([dataframe1,series],axis=1)

//附加声明一个dataframe
//注意事项：括号里面为花括号，花括号内列名使用引号，列名和数据间使用冒号，数据使用方括号包围
data=df.DataFrame({'a':[1,2,3],
			     'b':[4,5,6]})
data=	df.DataFrame([[1,2],[3,4]],columns=['a','b'])//两行，第一个[]内为第一行，column为列名
data= df.DataFrame([[1,2],[3,4]],columns=list('AB'))		     //使用list方法

//一行一行的给dataframe添加数据，可以使用append+for循环的低性能方法，还可以使用第二种
//将DataFrame传入concat
pd.concat(pd.DataFrame([i],columns=['A'])for i in range(5)],ignore_index=true)

聚合函数：sum()、mean()、std()、max()、min()
df.groupby(‘columnName’):按column列进行分组，返回值有两个：groupname+groups；可以使用get_group(‘groupname’)获取某group
按列的组合进行分组：df.groupby([‘columnName1’,‘columnName2’],as_index=False)，as_index=False表示columns不变为二级索引；此时get_group为get_group((‘columnName1’,‘columnName2’))
花式使用聚合函数：组合聚合函数（sum()、mean、std）一起使用会按组分别进行聚合操作。1、简单实用，语法：.functionname()；2、若进行多个聚合操作使用语法：.agg([np.sum、np.mean、np.std])，3、可以groupby后对某列进行聚合操作，语法：df.groupby(‘columnname’)[‘columnnameSelected’].agg(…)，若增加新列则newcolumnname=np.functionname4、不同列使用不同聚合函数，语法：df.groupby(‘colunName’).agg({‘columnname1’:np.sum,‘columnname2’:np.mean})
分层索引：multiindex，表达更高维的数据形式，更方便进行数据筛选，使用groupby使用多个key时生成分层索引，此时index为索引的组合。
可以使用unstack()方法将二级索引变为列名；
使用reset_index()方法将二级索引变为普通列（填充空白）。
可以使用loc获取数据，语法：ser.loc[‘gropname1’,‘gropname2’]，ser.loc[‘gropname1’]，ser.loc[‘gropname1’,‘gropname2’]，ser.loc[:,‘gropname2’]；当选择数据时使用元祖，即（key1,key2）,代表多层索引，两个索引为不同层级；若使用列表[key1,key2]代表同一层的多个key，duogekey是并列关系。若选用某个索引的所有值时使用函数slice(None)
数据转换函数：map、apply、applymap。

map（值到值的映射）：传入的映射和dataframe组合使用，用于转换dataframe对应列。也可传入函数：.map(lambda x:映射[x]).
apply可用于series、dataframe，语法：series.apply(function),函数参数是每个值，处理series的每个值；dataframe.apply(function)函数参数是对应轴的series，处理dataframe的某个series
applymap应用于dataframe，语法：df.applymap(lambda x:int(x)),意为将所有值都转化为整型
groupby+apply的过程为先将所有组split，再对每组分别apply，最后conbine。

//代码示例1
//参数是分组后的dataframe
def ratings_norm(df):
	min_rating=df['rating'].min()
	max_rating=df['rating'].max()
	//新增列：归一化列
	df['rating_norm']=df['rating'].apply(lambda x:(x-min_rating)/(max_rating-min_rating))
	//返回df，返回的df可以和原来的df完全不一样
	return df
ratings=ratings.groupby('userId').apply(ratings_norm)
//代码示例2
def getWenduTopX(df,topn):
	//对dataframe使用sort_values第一个参数是by
	//取两列，使用列表
	//最后切片使用方括号，因为目的是求最高的一个数据，sort_values默认升序排列，所以使用最后开始数的方式，倒数第一个下标是-1
	return df.sort_values(by='bWendu')[['ymd','bWendu']][-topn:]
df.groupby('month').apply(getWenduTopX,topn=1).head()//此处head意为只取前几个月份

unstack、pivot实现数据透视（将列式数据变成二维交叉形式，便于分析）
unstack实现数据二维透视,即将某列转化为行(此时进行plot，行变为x轴，列变为y轴)。语法：dataframe.unstack(level=-1,fill_value=None)，level=-1为将最内层列转化为行
stack实现反透视，即二维变为一维，index变为column，unstack和stack是互逆操作。语法：dataframe.stack(level=-1,dropna=True)，level=-1代表将新的列转化为多层索引的最内层，level可等于0、1、2对应多层索引的对应层
pivot实现简化透视，先使用set_index为数据创建分层索引，然后调用pivot。语法：df_reset.pivot(index=‘indexname’,column=‘columnname’,values=‘values’)
对日期进行处理：
pd.to_datetime：函数，能将字符串、列表、series变为格式统一的日期形式。示例： pd.to_datetime(df[‘Timestap’],unit=‘s’)
Timestamp：pandas表示日期的对象形式。优点：可以使用loc直接定位某天、时间区间、月份、月份区间、年份，实现切片查询。可以获取周、月、季度属性，语法：.week .month .quarter
DatetimeIndex：pandas表示日期的对象列表形式。可以获取周、月、季度属性，语法：.week .month .quarter
属性

在这里插入图片描述

日期缺失值的处理
df.reindex:调整dataframe的索引以适应新的索引。需要先使用date_range(start=‘startdate’,end=‘enddate’)将日期转化为连续的日期（DatetimeIndex形式），再使用.reindex(date_range,fill_value=0)为不完整数据添加缺失日期，并使用0补齐缺失数据
df.resample,对时间序列重采样（采样规则），补充缺失值。原理：使用某采样规则改变时间频率。例如某天缺失，若使用两天一采样的规则，使用聚合函数mean，则可以补齐缺失的日期数据。示例df.resample(‘2D’).mean。又例如：df.redample(‘D’).mean().fillna(0)，意为每天一统计，并且求均值，对缺失值填充0
vlookup：merge函数
列名的重排序：remove+insert+reindex

//remove为删除移动的列名，insert将列名移动到指定位置，reindex原表索引重排列表格即被重新排序
for name in ['name','sex'][::-1]://[::-1]为倒序取出列表元素
	column_list.remove(name)
	column_list.insert(column_list.index(‘columnBeforeName’)+1,name)	
df=df.reindex(columns=column_list)

Echarts的python版本pyecharts
注意：python3.5版本只能使用pyechatsv0.5.x版本，python3.6+才能使用pyechatsv1.5.x。所以python版本不是3.5不能简单粗暴的pip install pyecharts
查询
第一种方法：多个条件之间使用&进行连接，条件使用（）进行包围：

df[(df['bwendu']<=30) & (df[ywendu]>=15) & (df['tianqi']=='sun') & (df['aqilevel']==1)]

第二种方法

//相对于的一种方法，不需要使用df查询列名，不需要使用小括号。
//query第一个表达式还可以使用字符串形式的算术表达式
//还可以使用@符号来使用外部变量
df.query('bwendu<=30 & ywendu>=15 & tianqi =='sun' & aqilevel==1')

expr：expression表达式
在这里插入图片描述

遍历：df.iterrows 每次返回行Series 需要时间最少，对于小数据
df.itertuples每次返回namedtuple 需要时间中等。推荐
for+zip每次返回原生元祖需要时间最少，对于大数据
模糊匹配：
第一种方法：暴力笛卡儿积+过滤（re.search(keyword,sentence,re.IGNORECASE)，用keyword匹配语句，忽略大小写），即将两个表格添加一列相同数据，使用笛卡儿积时左边表格匹配右边表格所有行，再使用过滤的方法删选符合条件的行，此时占用内存很大，数据量大不推荐。
第二种方法：小表变字典，与大表做merge，最后explode（一行匹配完成后结果包含列表，将一行转化为两行），适用于数据量很大，笛卡儿积内存放不下的情况

//小表变字典
key_word_dict={
	row.keyword : row.keyid
	for row in df_keyword.itertuples()
}
//新添加一列，用来存储该表格含有字典元素的id
def merge_func(row):
	row['keyids'] = [
		keyid
		for keyword,keyid in key_word_dict
			if re.search(keyword,row['sentense'],re.IGNORECASE)
	] 
	return row
df_merge = df.apply(merge_func,axis=1)
//展开做merge
df_explode=df_merge.explode('keyids')
//将大表格的id转化为文本。left_on、right_on为左边表格和右边表格进行匹配的字段
df_final=pd.merge(
	left=df_explode,
	right=df_keyword,
	left_on='keyids',
	right_on='keyid'
)

某僧

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Pandas

Pandas & Scipy首先奉出W3Cschool。站在巨人的肩膀上~Pandas数据结构两种:维数名称描述1Series带标签的一维同构数组2DataFrame带标签的，大小可变的，二维异构表格DataFrame 是 Series (标量的容器)的容器。Pandas中轴指行index（1）和列columns（1）。操作数据时尽量生成新的对象，而不是改变原来数据。导入两个包：pandas和numpy，numpy为数组提供各种操作，pan
复制链接

扫一扫