Python之pandas

9呀

已于 2022-07-11 21:45:18 修改

阅读量366

点赞数

分类专栏： python 文章标签： pandas python 数据分析

于 2022-07-11 21:31:17 首次发布

本文链接：https://blog.csdn.net/EnHengNa/article/details/125654012

版权

python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

关键字1：Python导入数据；Python导出数据；Python DataFrame；Python数据处理；Python数据排序；Python重复数据处理；Python缺失数据处理；

关键字2：Pandas导入数据；Pandas导出数据；Pandas DataFrame；Pandas数据处理；Pandas数据排序；Pandas重复数据处理；Pandas缺失数据处理；Pandas数据类型转换；DataFrame数据类型转换

import pandas as pd
import numpy as np

导入数据

csv、txt

pandas.read_csv( )

read_csv(filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]], sep=',', header='infer', names=None, index_col=None, usecols=None, engine=None, converters=None, skiprows=None, nrows=None, na_values=None, na_filter=True, thousands=None,)

参数	说明
filepath_or_buffer	文件路径或文件名
sep	文件使用的分隔符
header	指定文件中的表头位置，0表示第一行为表头，None表示文件中无表头
name	list，指定列名
index_col	seq，指定行标签；False不使用第一列为行标签
usecols	指定显示的列
engine	指定为python，数据中有中文时不会乱码
converters	dict，在指定列转换某些值
skiprows	list，跳过某些行
nrows	指定读取行数，对大文件有用
na_values	str，list，指定空值
na_filter	bool，过滤空值
thousands	str，Thousands separator

示例

只有文件名是必须给定的
pd.read_csv("csv1.csv")

Excel

pandas.read_excel( )

read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, engine=None, converters=None, skiprows=None, nrows=None, na_values=None, thousands=None)

参数基本同read_csv( )

参数	说明
io	Excel文件名或路径
sheet_name	str, int, list,指定sheet

示例

只有文件名是必须给定的
pd.read_excel("test.xlsx",sheet_name='sheet1')

json

pandas.read_json()

示例

json_str='{"country":"Netherlands","dma_code":"0","timezone":"Europe\/Amsterdam","area_code":"0","ip":"46.19.37.108",\
"asn":"AS196752","continent_code":"EU","isp":"Tilaa V.O.F.",}'

js=pd.read_json(json_str,typ='series')
print(js)

###up
country                Netherlands
dma_code                         0
timezone          Europe/Amsterdam
area_code                        0
ip                    46.19.37.108
asn                       AS196752
continent_code                  EU
isp                   Tilaa V.O.F.
dtype: object
###down

js['country']='Brazil'
js
js.to_json()

###up
country                     Brazil
dma_code                         0
timezone          Europe/Amsterdam
area_code                        0
ip                    46.19.37.108
asn                       AS196752
continent_code                  EU
isp                   Tilaa V.O.F.
dtype: object

###mid
'{"country":"Brazil","dma_code":"0","timezone":"Europe\\/Amsterdam","area_code":"0","ip":"46.19.37.108","asn":"AS196752","continent_code":"EU","isp":"Tilaa V.O.F."}'

导出数据

csv

pandas.DataFrame.to_csv()

to_csv(path_or_buf,sep: str = ',',index=True, header=True, encoding='UTF8')

参数	说明
path_or_buf	文件名或文件路径，未给输出字符串
sep	指定分隔符，默认为","
index	是否输出索引，默认为True，即输出索引
header	是否输出列名，默认为True，即输出列名
encoding	文件保存的编码格式，默认为UTF8

DataFrame

取数

df[‘field’]

df.column[num]

df.field

df.loc[ ]

df.iloc[ ]

DataFrame.field_name
同
DataFrame['field_name']
类型为Series，只取field_name列

修改

1、令df['field_name']= 下面的返回结果即可完成修改；
2、有inplace参数的设为True；

增列

df['new_field_name']=list

增行

数据处理

排序

pandas.DataFrame.sort_values()
pandas.DataFrame.field_name.sort_values()

sort_values(self, by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False)

参数	说明
by	str, list, Name or list of names to sort by
axis	0 or ‘index’, 1 or ‘columns’}, default 0. Axis to be sorted.
ascending	bool or list of bool, default True Sort ascending vs. descending
inplace	bool, default False. If True, perform operation in-place.
kind	{‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’. Choice of sorting algorithm.
na_position	{‘first’, ‘last’}, default ‘last’. Puts NaNs at the beginning if `first`; `last` puts NaNs at the end.
ignore_index	bool, default False If True, the resulting axis will be labeled 0, 1, …, n - 1.

实例

 >>> df = pd.DataFrame({
    ...     'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
    ...     'col2': [2, 1, 9, 8, 7, 4],
    ...     'col3': [0, 1, 9, 4, 2, 3],
    ... })
    >>> df
        col1 col2 col3
    0   A    2    0
    1   A    1    1
    2   B    9    9
    3   NaN  8    4
    4   D    7    2
    5   C    4    3
    
    Sort by multiple columns
    
    >>> df.sort_values(by=['col1', 'col2'])
        col1 col2 col3
    1   A    1    1
    0   A    2    0
    2   B    9    9
    5   C    4    3
    4   D    7    2
    3   NaN  8    4
    
    Sort Descending
    
    >>> df.sort_values(by='col1', ascending=False)
        col1 col2 col3
    4   D    7    2
    5   C    4    3
    2   B    9    9
    0   A    2    0
    1   A    1    1
    3   NaN  8    4
    
DataFrame.field_name.sort_values()#可不带参数

重复数据

标记

pandas.DataFrame.duplicated()

duplicated(self, subset: Union[Hashable, Sequence[Hashable], NoneType] = None, keep: Union[str, bool] = 'first')

参数	说明
subset	column label or sequence of labels, optional. Only consider certain columns for identifying duplicates, by default use all of the columns.
keep	{‘first’, ‘last’, False}, default ‘first’ Determines which duplicates (if any) to mark. - `first` : Mark duplicates as `True` except for the first occurrence. - `last` : Mark duplicates as `True` except for the last occurrence. - False : Mark all duplicates as `True`.
inplace	bool, default False Whether to drop duplicates in place or to return a copy.
ignore_index	bool, default False If True, the resulting axis will be labeled 0, 1, …, n - 1.

示例

df1
df1.duplicated() #标记是否为重复行

###up
	f1	f2	f3	f4	f5
0	11	21	31	41	51
1	12	22	32	42	52
2	13	23	33	43	53
3	14	24	34	44	54
4	15	25	35	45	55
5	16	26	36	46	56
6	17	27	37	47	57
7	18	28	38	48	58
8	19	29	39	49	59
9	11	21	31	41	51

###mid
0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9     True
dtype: bool

删除

pandas.DataFrame.drop_duplicates()

drop_duplicates(self, subset: Union[Hashable, Sequence[Hashable], NoneType] = None, keep: Union[str, bool] = 'first', inplace: bool = False, ignore_index: bool = False)

参数	说明
subset	column label or sequence of labels, optional. Only consider certain columns for identifying duplicates, by default use all of the columns.
keep	{‘first’, ‘last’, False}, default ‘first’ Determines which duplicates (if any) to keep. - `first` : Drop duplicates except for the first occurrence. - `last` : Drop duplicates except for the last occurrence. - False : Drop all duplicates.

缺失数据

补齐

pandas.DataFrame.fillna(value)
pandas.Series.fillna(value)

pandas.DataFrame.field_name.fillna(value)
其中value可为DataFrame.field_name.mean()

fillna(self, value=None, method=None, axis=None, inplace=False, limit=None)

参数	说明
value	scalar, dict, Series, or DataFrame Value to use to fill holes (e.g. 0) This value cannot be a list.
method	{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use next valid observation to fill gap.
inplace	bool, default False If True, fill in-place.
limit	int, default None If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill.

示例

>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],
    ...                    [3, 4, np.nan, 1],
    ...                    [np.nan, np.nan, np.nan, 5],
    ...                    [np.nan, 3, np.nan, 4]],
    ...                   columns=list('ABCD'))
    >>> df
         A    B   C  D
    0  NaN  2.0 NaN  0
    1  3.0  4.0 NaN  1
    2  NaN  NaN NaN  5
    3  NaN  3.0 NaN  4
    
    Replace all NaN elements with 0s.
    
    >>> df.fillna(0)
        A   B   C   D
    0   0.0 2.0 0.0 0
    1   3.0 4.0 0.0 1
    2   0.0 0.0 0.0 5
    3   0.0 3.0 0.0 4
    
    We can also propagate non-null values forward or backward.
    
    >>> df.fillna(method='ffill')
        A   B   C   D
    0   NaN 2.0 NaN 0
    1   3.0 4.0 NaN 1
    2   3.0 4.0 NaN 5
    3   3.0 3.0 NaN 4
    
    Replace all NaN elements in column 'A', 'B', 'C', and 'D', with 0, 1,
    2, and 3 respectively.
    
    >>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
    >>> df.fillna(value=values)
        A   B   C   D
    0   0.0 2.0 2.0 0
    1   3.0 4.0 2.0 1
    2   0.0 1.0 2.0 5
    3   0.0 3.0 2.0 4
    
    Only replace the first NaN element.
    
    >>> df.fillna(value=values, limit=1)
        A   B   C   D
    0   0.0 2.0 2.0 0
    1   3.0 4.0 NaN 1
    2   NaN 1.0 NaN 5
    3   NaN 3.0 NaN 4

删除

pandas.DataFrame.dropna(axis=0,how='any')

dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False)

参数	说明
axis	{0 or ‘index’, 1 or ‘columns’}, default 0 Determine if rows or columns which contain missing values are removed. * 0, or ‘index’ : Drop rows which contain missing values. * 1, or ‘columns’ : Drop columns which contain missing value.
how	{‘any’, ‘all’}, default ‘any’ Determine if row or column is removed from DataFrame, when we have at least one NA or all NA. * ‘any’ : If any NA values are present, drop that row or column. * ‘all’ : If all values are NA, drop that row or column.
thresh	int, optional Require that many non-NA values.
subset	array-like, optional Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.
inplace	bool, default False If True, do operation inplace and return None.

示例

    >>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
    ...                    "toy": [np.nan, 'Batmobile', 'Bullwhip'],
    ...                    "born": [pd.NaT, pd.Timestamp("1940-04-25"),
    ...                             pd.NaT]})
    >>> df
           name        toy       born
    0    Alfred        NaN        NaT
    1    Batman  Batmobile 1940-04-25
    2  Catwoman   Bullwhip        NaT
    
    Drop the rows where at least one element is missing.
    
    >>> df.dropna()
         name        toy       born
    1  Batman  Batmobile 1940-04-25
    
    Drop the columns where at least one element is missing.
    
    >>> df.dropna(axis='columns')
           name
    0    Alfred
    1    Batman
    2  Catwoman
    
    Drop the rows where all elements are missing.
    
    >>> df.dropna(how='all')
           name        toy       born
    0    Alfred        NaN        NaT
    1    Batman  Batmobile 1940-04-25
    2  Catwoman   Bullwhip        NaT
    
    Keep only the rows with at least 2 non-NA values.
    
    >>> df.dropna(thresh=2)
           name        toy       born
    1    Batman  Batmobile 1940-04-25
    2  Catwoman   Bullwhip        NaT
    
    Define in which columns to look for missing values.
    
    >>> df.dropna(subset=['name', 'born'])
           name        toy       born
    1    Batman  Batmobile 1940-04-25
    
    Keep the DataFrame with valid entries in the same variable.
    
    >>> df.dropna(inplace=True)
    >>> df
         name        toy       born
    1  Batman  Batmobile 1940-04-25

数据头尾

str.strip看成一个函数

pandas.Series.str.strip()
可以指定 value 值来删除非空格字符

pandas.Series.str.lstrip()
可以指定 value 值来删除非空格字符

pandas.Series.str.rstrip()
可以指定 value 值来删除非空格字符

示例

    >>> s = pd.Series(['1. Ant.  ', '2. Bee!\n', '3. Cat?\t', np.nan])
    >>> s
    0    1. Ant.
    1    2. Bee!\n
    2    3. Cat?\t
    3          NaN
    dtype: object
    
    >>> s.str.strip('123.!? \n\t')
    0    Ant
    1    Bee
    2    Cat
    3    NaN
    dtype: object

类型转换

查看类型

pandas.DataFrame.dtype
查看所有列的类型

pandas.DataFrame.field_name.dtype
查看field_name列的类型

转数值字符

pandas.DataFrame.field_name.astype(dtype)

dtype:
int:整数
float：小数
str：字符型

字符转时间

时间转换

pandas.to_datetime()

to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, format=None, unit=None, origin='unix')

参数	说明
arg	int, float, str, datetime, list, tuple, 1-d array, Series DataFrame/dict-like The object to convert to a datetime.
errors	{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’ - If ‘raise’, then invalid parsing will raise an exception. - If ‘coerce’, then invalid parsing will be set as NaT. - If ‘ignore’, then invalid parsing will return the input.
dayfirst	bool, default False Specify a date parse order if `arg` is str or its list-likes. If True, parses dates with the day first, eg 10/11/12 is parsed as 2012-11-10. Warning: dayfirst=True is not strict, but will prefer to parse with day first (this is a known bug, based on dateutil behavior).
yearfirst	bool, default False Specify a date parse order if `arg` is str or its list-likes. - If True parses dates with the year first, eg 10/11/12 is parsed as 2010-11-12. - If both dayfirst and yearfirst are True, yearfirst is preceded (same as dateutil). Warning: yearfirst=True is not strict, but will prefer to parse with year first (this is a known bug, based on dateutil behavior).
format	str, default None The strftime to parse time, eg “%d/%m/%Y”, note that “%f” will parse all the way up to nanoseconds.
unit	str, default ‘ns’ The unit of the arg (D,s,ms,us,ns) denote the unit, which is an integer or float number. This will be based off the origin. Example, with unit=‘ms’ and origin=‘unix’ (the default), this would calculate the number of milliseconds to the unix epoch start
origin	scalar, default ‘unix’ Define the reference date. The numeric values would be parsed as number of units (defined by `unit`) since this reference date. - If ‘unix’ (or POSIX) time; origin is set to 1970-01-01. - If ‘julian’, unit must be ‘D’, and origin is set to beginning of Julian Calendar. Julian day number 0 is assigned to the day starting at noon on January 1, 4713 BC. - If Timestamp convertible, origin is set to Timestamp identified by origin.

示例

pandas.to_datatime(df.field_name)

    >>> df = pd.DataFrame({'year': [2015, 2016],
    ...                    'month': [2, 3],
    ...                    'day': [4, 5]})
    >>> pd.to_datetime(df)
    0   2015-02-04
    1   2016-03-05
    dtype: datetime64[ns]
    
    >>> pd.to_datetime(1490195805, unit='s')
    Timestamp('2017-03-22 15:16:45')
    >>> pd.to_datetime(1490195805433502912, unit='ns')
    Timestamp('2017-03-22 15:16:45.433502912')
    
    Using a non-unix epoch origin
    
    >>> pd.to_datetime([1, 2, 3], unit='D',
    ...origin=pd.Timestamp('1960-01-01'))
    DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'], dtype='datetime64[ns]', freq=None)

时间格式化

pandas.Series.dt.strftime(format)
将时间数据按特定格式转为字符串数据
format 时间格式如：Y%-m%-d%

'Y%m%d% H%M%S%' 年月日 时分秒

示例

df.field_name.dt.strftime('Y%-m%')

数据抽取

也成数据拆分

字段拆分

位置拆分

pandas.Series.str.slice(start,stop)

示例

bans=df['tel'].str.slice(0,3)
area=df['tel'].str.slice(3,7)
nums=df['tel'].str.slice(7,11)
data['brands']=brands
data['area']=area
data['nums']=nums

字符拆分

pandas.Series.str.split(pat=' ',n=-1,expand=False)

参数	说明
pat	分隔符，默认为空格’ ’
n	分割为n+1列，-1表返回所有分割列
expand	是否展开为DataFrame，默认为False，一般设为True

示例

newdf=df['name'].str.split(' ',1,True)
newdf.columns=['brand','name']#重命名列名

9呀

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python之pandas

Python pandas
复制链接

扫一扫

专栏目录