Python之pandas

关键字1:Python导入数据;Python导出数据;Python DataFrame;Python数据处理;Python数据排序;Python重复数据处理;Python缺失数据处理;

关键字2:Pandas导入数据;Pandas导出数据;Pandas DataFrame;Pandas数据处理;Pandas数据排序;Pandas重复数据处理;Pandas缺失数据处理;Pandas数据类型转换;DataFrame数据类型转换

import pandas as pd
import numpy as np

导入数据

csv、txt
pandas.read_csv( )
read_csv(filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]], sep=',', header='infer', names=None, index_col=None, usecols=None, engine=None, converters=None, skiprows=None, nrows=None, na_values=None, na_filter=True, thousands=None,)
参数说明
filepath_or_buffer文件路径或文件名
sep文件使用的分隔符
header指定文件中的表头位置,0表示第一行为表头,None表示文件中无表头
namelist,指定列名
index_colseq,指定行标签;False不使用第一列为行标签
usecols指定显示的列
engine指定为python,数据中有中文时不会乱码
convertersdict,在指定列转换某些值
skiprowslist,跳过某些行
nrows指定读取行数,对大文件有用
na_valuesstr,list,指定空值
na_filterbool,过滤空值
thousandsstr,Thousands separator

示例

只有文件名是必须给定的
pd.read_csv("csv1.csv")
Excel
pandas.read_excel( )
read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, engine=None, converters=None, skiprows=None, nrows=None, na_values=None, thousands=None)

参数基本同read_csv( )

参数说明
ioExcel文件名或路径
sheet_namestr, int, list,指定sheet

示例

只有文件名是必须给定的
pd.read_excel("test.xlsx",sheet_name='sheet1')
json
pandas.read_json()

示例

json_str='{"country":"Netherlands","dma_code":"0","timezone":"Europe\/Amsterdam","area_code":"0","ip":"46.19.37.108",\
"asn":"AS196752","continent_code":"EU","isp":"Tilaa V.O.F.",}'

js=pd.read_json(json_str,typ='series')
print(js)

###up
country                Netherlands
dma_code                         0
timezone          Europe/Amsterdam
area_code                        0
ip                    46.19.37.108
asn                       AS196752
continent_code                  EU
isp                   Tilaa V.O.F.
dtype: object
###down

js['country']='Brazil'
js
js.to_json()

###up
country                     Brazil
dma_code                         0
timezone          Europe/Amsterdam
area_code                        0
ip                    46.19.37.108
asn                       AS196752
continent_code                  EU
isp                   Tilaa V.O.F.
dtype: object

###mid
'{"country":"Brazil","dma_code":"0","timezone":"Europe\\/Amsterdam","area_code":"0","ip":"46.19.37.108","asn":"AS196752","continent_code":"EU","isp":"Tilaa V.O.F."}'

导出数据

csv
pandas.DataFrame.to_csv()
to_csv(path_or_buf,sep: str = ',',index=True, header=True, encoding='UTF8')
参数说明
path_or_buf文件名或文件路径,未给输出字符串
sep指定分隔符,默认为","
index是否输出索引,默认为True,即输出索引
header是否输出列名,默认为True,即输出列名
encoding文件保存的编码格式,默认为UTF8

DataFrame

取数

df[‘field’]

df.column[num]

df.field

df.loc[ ]

df.iloc[ ]

DataFrame.field_name
同
DataFrame['field_name']
类型为Series,只取field_name列
修改
1、令df['field_name']= 下面的返回结果即可完成修改;
2、有inplace参数的设为True;
增列
df['new_field_name']=list
增行

数据处理

排序
pandas.DataFrame.sort_values()
pandas.DataFrame.field_name.sort_values()
sort_values(self, by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False)
参数说明
bystr, list, Name or list of names to sort by
axis0 or ‘index’, 1 or ‘columns’}, default 0. Axis to be sorted.
ascendingbool or list of bool, default True Sort ascending vs. descending
inplacebool, default False. If True, perform operation in-place.
kind{‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’. Choice of sorting algorithm.
na_position{‘first’, ‘last’}, default ‘last’. Puts NaNs at the beginning if first; last puts NaNs at the end.
ignore_indexbool, default False If True, the resulting axis will be labeled 0, 1, …, n - 1.

实例

 >>> df = pd.DataFrame({
    ...     'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
    ...     'col2': [2, 1, 9, 8, 7, 4],
    ...     'col3': [0, 1, 9, 4, 2, 3],
    ... })
    >>> df
        col1 col2 col3
    0   A    2    0
    1   A    1    1
    2   B    9    9
    3   NaN  8    4
    4   D    7    2
    5   C    4    3
    
    Sort by multiple columns
    
    >>> df.sort_values(by=['col1', 'col2'])
        col1 col2 col3
    1   A    1    1
    0   A    2    0
    2   B    9    9
    5   C    4    3
    4   D    7    2
    3   NaN  8    4
    
    Sort Descending
    
    >>> df.sort_values(by='col1', ascending=False)
        col1 col2 col3
    4   D    7    2
    5   C    4    3
    2   B    9    9
    0   A    2    0
    1   A    1    1
    3   NaN  8    4
    
DataFrame.field_name.sort_values()#可不带参数
重复数据
标记
pandas.DataFrame.duplicated()
duplicated(self, subset: Union[Hashable, Sequence[Hashable], NoneType] = None, keep: Union[str, bool] = 'first')
参数说明
subsetcolumn label or sequence of labels, optional. Only consider certain columns for identifying duplicates, by default use all of the columns.
keep{‘first’, ‘last’, False}, default ‘first’ Determines which duplicates (if any) to mark.
- first : Mark duplicates as True except for the first occurrence.
- last : Mark duplicates as True except for the last occurrence.
- False : Mark all duplicates as True.
inplacebool, default False
Whether to drop duplicates in place or to return a copy.
ignore_indexbool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.

示例

df1
df1.duplicated() #标记是否为重复行

###up
	f1	f2	f3	f4	f5
0	11	21	31	41	51
1	12	22	32	42	52
2	13	23	33	43	53
3	14	24	34	44	54
4	15	25	35	45	55
5	16	26	36	46	56
6	17	27	37	47	57
7	18	28	38	48	58
8	19	29	39	49	59
9	11	21	31	41	51

###mid
0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9     True
dtype: bool
删除
pandas.DataFrame.drop_duplicates()
drop_duplicates(self, subset: Union[Hashable, Sequence[Hashable], NoneType] = None, keep: Union[str, bool] = 'first', inplace: bool = False, ignore_index: bool = False)
参数说明
subsetcolumn label or sequence of labels, optional. Only consider certain columns for identifying duplicates, by default use all of the columns.
keep{‘first’, ‘last’, False}, default ‘first’
Determines which duplicates (if any) to keep.
- first : Drop duplicates except for the first occurrence.
- last : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.
缺失数据
补齐
pandas.DataFrame.fillna(value)
pandas.Series.fillna(value)
pandas.DataFrame.field_name.fillna(value)
其中value可为DataFrame.field_name.mean()
fillna(self, value=None, method=None, axis=None, inplace=False, limit=None)
参数说明
valuescalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0)
This value cannot be a list.
method{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series
pad / ffill: propagate last valid observation forward to next valid
backfill / bfill: use next valid observation to fill gap.
inplacebool, default False
If True, fill in-place.
limitint, default None
If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill.

示例

>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],
    ...                    [3, 4, np.nan, 1],
    ...                    [np.nan, np.nan, np.nan, 5],
    ...                    [np.nan, 3, np.nan, 4]],
    ...                   columns=list('ABCD'))
    >>> df
         A    B   C  D
    0  NaN  2.0 NaN  0
    1  3.0  4.0 NaN  1
    2  NaN  NaN NaN  5
    3  NaN  3.0 NaN  4
    
    Replace all NaN elements with 0s.
    
    >>> df.fillna(0)
        A   B   C   D
    0   0.0 2.0 0.0 0
    1   3.0 4.0 0.0 1
    2   0.0 0.0 0.0 5
    3   0.0 3.0 0.0 4
    
    We can also propagate non-null values forward or backward.
    
    >>> df.fillna(method='ffill')
        A   B   C   D
    0   NaN 2.0 NaN 0
    1   3.0 4.0 NaN 1
    2   3.0 4.0 NaN 5
    3   3.0 3.0 NaN 4
    
    Replace all NaN elements in column 'A', 'B', 'C', and 'D', with 0, 1,
    2, and 3 respectively.
    
    >>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
    >>> df.fillna(value=values)
        A   B   C   D
    0   0.0 2.0 2.0 0
    1   3.0 4.0 2.0 1
    2   0.0 1.0 2.0 5
    3   0.0 3.0 2.0 4
    
    Only replace the first NaN element.
    
    >>> df.fillna(value=values, limit=1)
        A   B   C   D
    0   0.0 2.0 2.0 0
    1   3.0 4.0 NaN 1
    2   NaN 1.0 NaN 5
    3   NaN 3.0 NaN 4
删除
pandas.DataFrame.dropna(axis=0,how='any')
dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False)
参数说明
axis{0 or ‘index’, 1 or ‘columns’}, default 0
Determine if rows or columns which contain missing values are removed.
* 0, or ‘index’ : Drop rows which contain missing values.
* 1, or ‘columns’ : Drop columns which contain missing value.
how{‘any’, ‘all’}, default ‘any’
Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
* ‘any’ : If any NA values are present, drop that row or column.
* ‘all’ : If all values are NA, drop that row or column.
threshint, optional
Require that many non-NA values.
subsetarray-like, optional
Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.
inplacebool, default False
If True, do operation inplace and return None.

示例

    >>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
    ...                    "toy": [np.nan, 'Batmobile', 'Bullwhip'],
    ...                    "born": [pd.NaT, pd.Timestamp("1940-04-25"),
    ...                             pd.NaT]})
    >>> df
           name        toy       born
    0    Alfred        NaN        NaT
    1    Batman  Batmobile 1940-04-25
    2  Catwoman   Bullwhip        NaT
    
    Drop the rows where at least one element is missing.
    
    >>> df.dropna()
         name        toy       born
    1  Batman  Batmobile 1940-04-25
    
    Drop the columns where at least one element is missing.
    
    >>> df.dropna(axis='columns')
           name
    0    Alfred
    1    Batman
    2  Catwoman
    
    Drop the rows where all elements are missing.
    
    >>> df.dropna(how='all')
           name        toy       born
    0    Alfred        NaN        NaT
    1    Batman  Batmobile 1940-04-25
    2  Catwoman   Bullwhip        NaT
    
    Keep only the rows with at least 2 non-NA values.
    
    >>> df.dropna(thresh=2)
           name        toy       born
    1    Batman  Batmobile 1940-04-25
    2  Catwoman   Bullwhip        NaT
    
    Define in which columns to look for missing values.
    
    >>> df.dropna(subset=['name', 'born'])
           name        toy       born
    1    Batman  Batmobile 1940-04-25
    
    Keep the DataFrame with valid entries in the same variable.
    
    >>> df.dropna(inplace=True)
    >>> df
         name        toy       born
    1  Batman  Batmobile 1940-04-25
数据头尾
str.strip看成一个函数
pandas.Series.str.strip()
可以指定 value 值来删除非空格字符
pandas.Series.str.lstrip()
可以指定 value 值来删除非空格字符
pandas.Series.str.rstrip()
可以指定 value 值来删除非空格字符

示例

    >>> s = pd.Series(['1. Ant.  ', '2. Bee!\n', '3. Cat?\t', np.nan])
    >>> s
    0    1. Ant.
    1    2. Bee!\n
    2    3. Cat?\t
    3          NaN
    dtype: object
    
    >>> s.str.strip('123.!? \n\t')
    0    Ant
    1    Bee
    2    Cat
    3    NaN
    dtype: object    
类型转换
查看类型
pandas.DataFrame.dtype
查看所有列的类型
pandas.DataFrame.field_name.dtype
查看field_name列的类型
转数值字符
pandas.DataFrame.field_name.astype(dtype)

dtype:
int:整数
float:小数
str:字符型
字符转时间
时间转换
pandas.to_datetime()
to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, format=None, unit=None, origin='unix')
参数说明
argint, float, str, datetime, list, tuple, 1-d array, Series DataFrame/dict-like
The object to convert to a datetime.
errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’
- If ‘raise’, then invalid parsing will raise an exception.
- If ‘coerce’, then invalid parsing will be set as NaT.
- If ‘ignore’, then invalid parsing will return the input.
dayfirstbool, default False
Specify a date parse order if arg is str or its list-likes.
If True, parses dates with the day first, eg 10/11/12 is parsed as 2012-11-10.
Warning: dayfirst=True is not strict, but will prefer to parse with day first (this is a known bug, based on dateutil behavior).
yearfirstbool, default False
Specify a date parse order if arg is str or its list-likes.
- If True parses dates with the year first, eg 10/11/12 is parsed as 2010-11-12.
- If both dayfirst and yearfirst are True, yearfirst is preceded (same as dateutil). Warning: yearfirst=True is not strict, but will prefer to parse with year first (this is a known bug, based on dateutil behavior).
formatstr, default None The strftime to parse time, eg “%d/%m/%Y”, note that “%f” will parse all the way up to nanoseconds.
unitstr, default ‘ns’
The unit of the arg (D,s,ms,us,ns) denote the unit, which is an integer or float number. This will be based off the origin. Example, with unit=‘ms’ and origin=‘unix’ (the default), this would calculate the number of milliseconds to the unix epoch start
originscalar, default ‘unix’
Define the reference date. The numeric values would be parsed as number of units (defined by unit) since this reference date.
- If ‘unix’ (or POSIX) time; origin is set to 1970-01-01.
- If ‘julian’, unit must be ‘D’, and origin is set to beginning of Julian Calendar. Julian day number 0 is assigned to the day starting at noon on January 1, 4713 BC.
- If Timestamp convertible, origin is set to Timestamp identified by origin.

示例

pandas.to_datatime(df.field_name)
    >>> df = pd.DataFrame({'year': [2015, 2016],
    ...                    'month': [2, 3],
    ...                    'day': [4, 5]})
    >>> pd.to_datetime(df)
    0   2015-02-04
    1   2016-03-05
    dtype: datetime64[ns]
    
    >>> pd.to_datetime(1490195805, unit='s')
    Timestamp('2017-03-22 15:16:45')
    >>> pd.to_datetime(1490195805433502912, unit='ns')
    Timestamp('2017-03-22 15:16:45.433502912')
    
    Using a non-unix epoch origin
    
    >>> pd.to_datetime([1, 2, 3], unit='D',
    ...origin=pd.Timestamp('1960-01-01'))
    DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'], dtype='datetime64[ns]', freq=None)
时间格式化
pandas.Series.dt.strftime(format)
将时间数据按特定格式转为字符串数据
format 时间格式如:Y%-m%-d%

'Y%m%d% H%M%S%' 年月日 时分秒

示例

df.field_name.dt.strftime('Y%-m%')
数据抽取

也成数据拆分

字段拆分
位置拆分
pandas.Series.str.slice(start,stop)

示例

bans=df['tel'].str.slice(0,3)
area=df['tel'].str.slice(3,7)
nums=df['tel'].str.slice(7,11)
data['brands']=brands
data['area']=area
data['nums']=nums
字符拆分
pandas.Series.str.split(pat=' ',n=-1,expand=False)
参数说明
pat分隔符,默认为空格’ ’
n分割为n+1列,-1表返回所有分割列
expand是否展开为DataFrame,默认为False,一般设为True

示例

newdf=df['name'].str.split(' ',1,True)
newdf.columns=['brand','name']#重命名列名
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值