pandas简易教程

星火流明

已于 2023-06-03 12:01:57 修改

阅读量65

点赞数 1

分类专栏： pandas笔记文章标签： pandas python 开发语言

于 2023-05-27 13:01:17 首次发布

本文链接：https://blog.csdn.net/A2233776/article/details/130900054

版权

pandas笔记专栏收录该内容

9 篇文章 0 订阅

订阅专栏

文章目录

Pandas Tools

xlrd
openpyxl
tabulate
pandas_profiling

File reading and writing functions

Read function

pandas could read many kinds of file，such as csv, excel, txt.

import pandas as pd
df_csv = pd.read_csv('file.csv')
df_table = pd.read_table('file.txt') 
df_excel = pd.read_excel('file.xlsx')

Shared parameters

header: str = "infer" denotes that the first line is not col name.
index_col denotes that it makes some cols as index.
usecols is the set of cols which we will read.
parse_dates indicates that the cols would be read as time.
nrows is the num of rows read.

Private paramters

`pd.read_table()`

sep is the Separator with regular expression.
We should use the parameter engine='python' at the same time.

Writing function

df.to_csv('file.csv', sep='\t')
df.to_excel('file.xlsx')
df.to_markdown('file.md')
df.to_latex()

index=False removes indices.
The function df.to_csv() have a sep parameter to select a separator.
The functions df.to_markdown() and df.to_latex() need a pre-package tabulate

Series and Dataframe

Series

pd.Series is a kind of data structure consisted of four parts

data, the values of series
index, the index (with a name or without)
dtype, the date type
name, the name of series

Dataframe

Dataframe is a table.
DataFrame has added column indexes on top of Series, i.e.columns.

Normal Attributes

df.values # return all value
df.index # return all index
df.columns # return all column indexs
df.dtypes # return all dtypes corresponding 
df.shape # return the shape
df.T # return the transposition of df

Normal function

df.set_index(cols)

Aggregation function

Sampling

df.head(n) # take the previous n rows
df.tail(n) # take the last n rows

Global features

df.info() # information overview
df.describe() # main statistics

More, we need the package pandas-profiling.

Statistics function

sum, mean
median, quantile
var, std
max, min, idxmax, idmin
count

along parameter axis

Unique function

df[cols].unique() # unique cols
df[cols].nunique() # num of unique cols
df[cols].value_counts()
df[cols].drop_duplicates(keep='first'|'last'|'false') # remove duplicates
df[cols].duplicated(keep='first'|'last'|'false') # bool sequence

drop_duplicates removes element according to duplicated.

Replace function

s.replace(to_replace: list, value: list)
s.replace(d: dict[to_replace, value])

s.where(bs: bool series, value) # replace false
s.mask(bs: bool series, value) # replace true

s.round()
s.abs()
s.clip(inf: num, sup: num) # replace lower and higher with inf or sup

Sort function

df.sort_index(indexs, ascending: bool series)
df.sort_values(cols, ascending: bool series = bs)

`apply` function

df.apply(func)
df.apply(lambda)

input: every column one by one (pd.Series)
output: a series as a column

Windows function

s.rolling(window: int = len) 
s.expanding()
s.ewm(alpha: num = coefficient)