Python pandas
关键字1:Python导入数据;Python导出数据;Python DataFrame;Python数据处理;Python数据排序;Python重复数据处理;Python缺失数据处理;
关键字2:Pandas导入数据;Pandas导出数据;Pandas DataFrame;Pandas数据处理;Pandas数据排序;Pandas重复数据处理;Pandas缺失数据处理;Pandas数据类型转换;DataFrame数据类型转换
import pandas as pd
import numpy as np
导入数据
csv、txt
pandas.read_csv( )
read_csv(filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]], sep=',', header='infer', names=None, index_col=None, usecols=None, engine=None, converters=None, skiprows=None, nrows=None, na_values=None, na_filter=True, thousands=None,)
参数 | 说明 |
---|---|
filepath_or_buffer | 文件路径或文件名 |
sep | 文件使用的分隔符 |
header | 指定文件中的表头位置,0表示第一行为表头,None表示文件中无表头 |
name | list,指定列名 |
index_col | seq,指定行标签;False不使用第一列为行标签 |
usecols | 指定显示的列 |
engine | 指定为python,数据中有中文时不会乱码 |
converters | dict,在指定列转换某些值 |
skiprows | list,跳过某些行 |
nrows | 指定读取行数,对大文件有用 |
na_values | str,list,指定空值 |
na_filter | bool,过滤空值 |
thousands | str,Thousands separator |
示例
只有文件名是必须给定的
pd.read_csv("csv1.csv")
Excel
pandas.read_excel( )
read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, engine=None, converters=None, skiprows=None, nrows=None, na_values=None, thousands=None)
参数基本同read_csv( )
参数 | 说明 |
---|---|
io | Excel文件名或路径 |
sheet_name | str, int, list,指定sheet |
示例
只有文件名是必须给定的
pd.read_excel("test.xlsx",sheet_name='sheet1')
json
pandas.read_json()
示例
json_str='{"country":"Netherlands","dma_code":"0","timezone":"Europe\/Amsterdam","area_code":"0","ip":"46.19.37.108",\
"asn":"AS196752","continent_code":"EU","isp":"Tilaa V.O.F.",}'
js=pd.read_json(json_str,typ='series')
print(js)
###up
country Netherlands
dma_code 0
timezone Europe/Amsterdam
area_code 0
ip 46.19.37.108
asn AS196752
continent_code EU
isp Tilaa V.O.F.
dtype: object
###down
js['country']='Brazil'
js
js.to_json()
###up
country Brazil
dma_code 0
timezone Europe/Amsterdam
area_code 0
ip 46.19.37.108
asn AS196752
continent_code EU
isp Tilaa V.O.F.
dtype: object
###mid
'{"country":"Brazil","dma_code":"0","timezone":"Europe\\/Amsterdam","area_code":"0","ip":"46.19.37.108","asn":"AS196752","continent_code":"EU","isp":"Tilaa V.O.F."}'
导出数据
csv
pandas.DataFrame.to_csv()
to_csv(path_or_buf,sep: str = ',',index=True, header=True, encoding='UTF8')
参数 | 说明 |
---|---|
path_or_buf | 文件名或文件路径,未给输出字符串 |
sep | 指定分隔符,默认为"," |
index | 是否输出索引,默认为True,即输出索引 |
header | 是否输出列名,默认为True,即输出列名 |
encoding | 文件保存的编码格式,默认为UTF8 |
DataFrame
取数
df[‘field’]
df.column[num]
df.field
df.loc[ ]
df.iloc[ ]
DataFrame.field_name
同
DataFrame['field_name']
类型为Series,只取field_name列
修改
1、令df['field_name']= 下面的返回结果即可完成修改;
2、有inplace参数的设为True;
增列
df['new_field_name']=list
增行
数据处理
排序
pandas.DataFrame.sort_values()
pandas.DataFrame.field_name.sort_values()
sort_values(self, by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False)
参数 | 说明 |
---|---|
by | str, list, Name or list of names to sort by |
axis | 0 or ‘index’, 1 or ‘columns’}, default 0. Axis to be sorted. |
ascending | bool or list of bool, default True Sort ascending vs. descending |
inplace | bool, default False. If True, perform operation in-place. |
kind | {‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’. Choice of sorting algorithm. |
na_position | {‘first’, ‘last’}, default ‘last’. Puts NaNs at the beginning if first ; last puts NaNs at the end. |
ignore_index | bool, default False If True, the resulting axis will be labeled 0, 1, …, n - 1. |
实例
>>> df = pd.DataFrame({
... 'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
... 'col2': [2, 1, 9, 8, 7, 4],
... 'col3': [0, 1, 9, 4, 2, 3],
... })
>>> df
col1 col2 col3
0 A 2 0
1 A 1 1
2 B 9 9
3 NaN 8 4
4 D 7 2
5 C 4 3
Sort by multiple columns
>>> df.sort_values(by=['col1', 'col2'])
col1 col2 col3
1 A 1 1
0 A 2 0
2 B 9 9
5 C 4 3
4 D 7 2
3 NaN 8 4
Sort Descending
>>> df.sort_values(by='col1', ascending=False)
col1 col2 col3
4 D 7 2
5 C 4 3
2 B 9 9
0 A 2 0
1 A 1 1
3 NaN 8 4
DataFrame.field_name.sort_values()#可不带参数
重复数据
标记
pandas.DataFrame.duplicated()
duplicated(self, subset: Union[Hashable, Sequence[Hashable], NoneType] = None, keep: Union[str, bool] = 'first')
参数 | 说明 |
---|---|
subset | column label or sequence of labels, optional. Only consider certain columns for identifying duplicates, by default use all of the columns. |
keep | {‘first’, ‘last’, False}, default ‘first’ Determines which duplicates (if any) to mark. - first : Mark duplicates as True except for the first occurrence. - last : Mark duplicates as True except for the last occurrence. - False : Mark all duplicates as True . |
inplace | bool, default False Whether to drop duplicates in place or to return a copy. |
ignore_index | bool, default False If True, the resulting axis will be labeled 0, 1, …, n - 1. |
示例
df1
df1.duplicated() #标记是否为重复行
###up
f1 f2 f3 f4 f5
0 11 21 31 41 51
1 12 22 32 42 52
2 13 23 33 43 53
3 14 24 34 44 54
4 15 25 35 45 55
5 16 26 36 46 56
6 17 27 37 47 57
7 18 28 38 48 58
8 19 29 39 49 59
9 11 21 31 41 51
###mid
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 True
dtype: bool
删除
pandas.DataFrame.drop_duplicates()
drop_duplicates(self, subset: Union[Hashable, Sequence[Hashable], NoneType] = None, keep: Union[str, bool] = 'first', inplace: bool = False, ignore_index: bool = False)
参数 | 说明 |
---|---|
subset | column label or sequence of labels, optional. Only consider certain columns for identifying duplicates, by default use all of the columns. |
keep | {‘first’, ‘last’, False}, default ‘first’ Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence.- last : Drop duplicates except for the last occurrence.- False : Drop all duplicates. |
缺失数据
补齐
pandas.DataFrame.fillna(value)
pandas.Series.fillna(value)
pandas.DataFrame.field_name.fillna(value)
其中value可为DataFrame.field_name.mean()
fillna(self, value=None, method=None, axis=None, inplace=False, limit=None)
参数 | 说明 |
---|---|
value | scalar, dict, Series, or DataFrame Value to use to fill holes (e.g. 0) This value cannot be a list. |
method | {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use next valid observation to fill gap. |
inplace | bool, default False If True, fill in-place. |
limit | int, default None If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. |
示例
>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],
... [3, 4, np.nan, 1],
... [np.nan, np.nan, np.nan, 5],
... [np.nan, 3, np.nan, 4]],
... columns=list('ABCD'))
>>> df
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
Replace all NaN elements with 0s.
>>> df.fillna(0)
A B C D
0 0.0 2.0 0.0 0
1 3.0 4.0 0.0 1
2 0.0 0.0 0.0 5
3 0.0 3.0 0.0 4
We can also propagate non-null values forward or backward.
>>> df.fillna(method='ffill')
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 3.0 4.0 NaN 5
3 3.0 3.0 NaN 4
Replace all NaN elements in column 'A', 'B', 'C', and 'D', with 0, 1,
2, and 3 respectively.
>>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
>>> df.fillna(value=values)
A B C D
0 0.0 2.0 2.0 0
1 3.0 4.0 2.0 1
2 0.0 1.0 2.0 5
3 0.0 3.0 2.0 4
Only replace the first NaN element.
>>> df.fillna(value=values, limit=1)
A B C D
0 0.0 2.0 2.0 0
1 3.0 4.0 NaN 1
2 NaN 1.0 NaN 5
3 NaN 3.0 NaN 4
删除
pandas.DataFrame.dropna(axis=0,how='any')
dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False)
参数 | 说明 |
---|---|
axis | {0 or ‘index’, 1 or ‘columns’}, default 0 Determine if rows or columns which contain missing values are removed. * 0, or ‘index’ : Drop rows which contain missing values. * 1, or ‘columns’ : Drop columns which contain missing value. |
how | {‘any’, ‘all’}, default ‘any’ Determine if row or column is removed from DataFrame, when we have at least one NA or all NA. * ‘any’ : If any NA values are present, drop that row or column. * ‘all’ : If all values are NA, drop that row or column. |
thresh | int, optional Require that many non-NA values. |
subset | array-like, optional Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include. |
inplace | bool, default False If True, do operation inplace and return None. |
示例
>>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
... "toy": [np.nan, 'Batmobile', 'Bullwhip'],
... "born": [pd.NaT, pd.Timestamp("1940-04-25"),
... pd.NaT]})
>>> df
name toy born
0 Alfred NaN NaT
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
Drop the rows where at least one element is missing.
>>> df.dropna()
name toy born
1 Batman Batmobile 1940-04-25
Drop the columns where at least one element is missing.
>>> df.dropna(axis='columns')
name
0 Alfred
1 Batman
2 Catwoman
Drop the rows where all elements are missing.
>>> df.dropna(how='all')
name toy born
0 Alfred NaN NaT
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
Keep only the rows with at least 2 non-NA values.
>>> df.dropna(thresh=2)
name toy born
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
Define in which columns to look for missing values.
>>> df.dropna(subset=['name', 'born'])
name toy born
1 Batman Batmobile 1940-04-25
Keep the DataFrame with valid entries in the same variable.
>>> df.dropna(inplace=True)
>>> df
name toy born
1 Batman Batmobile 1940-04-25
数据头尾
str.strip看成一个函数
pandas.Series.str.strip()
可以指定 value 值来删除非空格字符
pandas.Series.str.lstrip()
可以指定 value 值来删除非空格字符
pandas.Series.str.rstrip()
可以指定 value 值来删除非空格字符
示例
>>> s = pd.Series(['1. Ant. ', '2. Bee!\n', '3. Cat?\t', np.nan])
>>> s
0 1. Ant.
1 2. Bee!\n
2 3. Cat?\t
3 NaN
dtype: object
>>> s.str.strip('123.!? \n\t')
0 Ant
1 Bee
2 Cat
3 NaN
dtype: object
类型转换
查看类型
pandas.DataFrame.dtype
查看所有列的类型
pandas.DataFrame.field_name.dtype
查看field_name列的类型
转数值字符
pandas.DataFrame.field_name.astype(dtype)
dtype:
int:整数
float:小数
str:字符型
字符转时间
时间转换
pandas.to_datetime()
to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, format=None, unit=None, origin='unix')
参数 | 说明 |
---|---|
arg | int, float, str, datetime, list, tuple, 1-d array, Series DataFrame/dict-like The object to convert to a datetime. |
errors | {‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’ - If ‘raise’, then invalid parsing will raise an exception. - If ‘coerce’, then invalid parsing will be set as NaT. - If ‘ignore’, then invalid parsing will return the input. |
dayfirst | bool, default False Specify a date parse order if arg is str or its list-likes.If True, parses dates with the day first, eg 10/11/12 is parsed as 2012-11-10. Warning: dayfirst=True is not strict, but will prefer to parse with day first (this is a known bug, based on dateutil behavior). |
yearfirst | bool, default False Specify a date parse order if arg is str or its list-likes.- If True parses dates with the year first, eg 10/11/12 is parsed as 2010-11-12. - If both dayfirst and yearfirst are True, yearfirst is preceded (same as dateutil). Warning: yearfirst=True is not strict, but will prefer to parse with year first (this is a known bug, based on dateutil behavior). |
format | str, default None The strftime to parse time, eg “%d/%m/%Y”, note that “%f” will parse all the way up to nanoseconds. |
unit | str, default ‘ns’ The unit of the arg (D,s,ms,us,ns) denote the unit, which is an integer or float number. This will be based off the origin. Example, with unit=‘ms’ and origin=‘unix’ (the default), this would calculate the number of milliseconds to the unix epoch start |
origin | scalar, default ‘unix’ Define the reference date. The numeric values would be parsed as number of units (defined by unit ) since this reference date.- If ‘unix’ (or POSIX) time; origin is set to 1970-01-01. - If ‘julian’, unit must be ‘D’, and origin is set to beginning of Julian Calendar. Julian day number 0 is assigned to the day starting at noon on January 1, 4713 BC. - If Timestamp convertible, origin is set to Timestamp identified by origin. |
示例
pandas.to_datatime(df.field_name)
>>> df = pd.DataFrame({'year': [2015, 2016],
... 'month': [2, 3],
... 'day': [4, 5]})
>>> pd.to_datetime(df)
0 2015-02-04
1 2016-03-05
dtype: datetime64[ns]
>>> pd.to_datetime(1490195805, unit='s')
Timestamp('2017-03-22 15:16:45')
>>> pd.to_datetime(1490195805433502912, unit='ns')
Timestamp('2017-03-22 15:16:45.433502912')
Using a non-unix epoch origin
>>> pd.to_datetime([1, 2, 3], unit='D',
...origin=pd.Timestamp('1960-01-01'))
DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'], dtype='datetime64[ns]', freq=None)
时间格式化
pandas.Series.dt.strftime(format)
将时间数据按特定格式转为字符串数据
format 时间格式如:Y%-m%-d%
'Y%m%d% H%M%S%' 年月日 时分秒
示例
df.field_name.dt.strftime('Y%-m%')
数据抽取
也成数据拆分
字段拆分
位置拆分
pandas.Series.str.slice(start,stop)
示例
bans=df['tel'].str.slice(0,3)
area=df['tel'].str.slice(3,7)
nums=df['tel'].str.slice(7,11)
data['brands']=brands
data['area']=area
data['nums']=nums
字符拆分
pandas.Series.str.split(pat=' ',n=-1,expand=False)
参数 | 说明 |
---|---|
pat | 分隔符,默认为空格’ ’ |
n | 分割为n+1列,-1表返回所有分割列 |
expand | 是否展开为DataFrame,默认为False,一般设为True |
示例
newdf=df['name'].str.split(' ',1,True)
newdf.columns=['brand','name']#重命名列名