Pandas基础

最新推荐文章于 2024-07-13 21:34:41 发布

迷糊小财迷

最新推荐文章于 2024-07-13 21:34:41 发布

阅读量136

点赞数

分类专栏： pandas 文章标签： python

本文链接：https://blog.csdn.net/weixin_41660160/article/details/105644106

版权

pandas 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

本文记录pandas相关学习记录

Pandas导入

#导入pandas 
import pandas as pd
#导入numpy
import numpy as np

#查看pandas 版本  
pd.__version__

ps：本次pandas训练使用1.0.3版本，如低版本可通过 pip install --upgrade pandas==1.0.3 指定版本升级

文件读取及写入

读取

#csv格式
df = pd.read_csv('data/table.csv')
#txt或者日常运维常见log、acc日志等
df_txt = pd.read_table('data/table.txt')
df_txt = pd.read_table('data/acc.log',sep='\s+') #sep 设置对应的分隔符
#xls或xlsx文件
df_excel = pd.read_excel('data/table.xlsx')
#同时以上路径均为相对路径，可设置绝对路径服务，window环境如下绝对路径示例
df_win = pd.read_excel('c:\\users\\downloads\\data\\table.xlsx')

#sql读取
#sqlite3 示例
import sqlite3
conn  = sqlite3.connect('test.db') #相对路径，存放在jupyter打开路径
#创建表
conn.execute('create table person(id varchar(8) primary key,name varchar(8))') 
conn.commit()
#导入数据
conn.execute('insert into person values('1','jerry'),('2','tom')')
conn.commit()
#读取sql内容
sql_ln = 'select id,name from person'
pd.read_sql(sql_ln,conn)

写入

#csv格式
df.to_csv('data/new_index_table.csv')
df.to_csv('data/new_noindex_table.csv',index=False)#忽略索引

#xlsx格式
df.to_excel('dxata/new_table2.xlsx', sheet_name='Sheet1')

基本数据结构

Series

Series 为带标签的一维同构数组

s =pd.Series(np.random.randn(5),index=['a','b','c','d','e'],name='一个Series数组',dtype='float64')
#s展示如下
a    0.486934
b    0.005319
c    0.041949
d   -0.500936
e    0.404433
Name: 一个Series数组, dtype: float64

#最常用的属性为值（values），索引（index），名字（name），类型（dtype）
s.values
#输出
array([ 1.06995138, -0.10068972,  0.00838377, -1.13360582, -0.88613285])
#dtype数据类型如下
1. float
2. int
3. bool
4. datetime64[ns]
5. datetime64[ns, tz]
6. timedelta[ns]
7. category
8. object 

#Series转换为DataFrame
s.to_frame()

DataFrame

DataFrame 带标签的，大小可变的，二维异构表格

df = pd.DataFrame({'col1':list('abcde'),'col2':range(5,10),'col3':[1.3,2.5,3.6,4.6,5.8]},
                 index=list('一二三四五'))

#df输出如下
  col1 col2 col3
一 	a 	5 	1.3
二 	b 	6 	2.5
三 	c 	7 	3.6
四 	d 	8 	4.6
五 	e 	9 	5.8

#选择指定列
df['col2']

一    5
二    6
三    7
四    8
五    9
Name: col2, dtype: int64

df[['col1','col3']]

 	col1 	col3
一 	a 	1.3
二 	b 	2.5
三 	c 	3.6
四 	d 	4.6
五 	e 	5.8

df.loc[:,['col1','col2']]
 	col1 	col2
一 	a 	5
二 	b 	6
三 	c 	7
四 	d 	8
五 	e 	9

#单独选择一列为Series
type(df['col1'])
pandas.core.series.Series

#cloumns 及index 重命名
df.rename(index={'一':'one','二':'two'},columns={'col1':'new_col1'})
 	new_col1 	col2 	col3
one 	a 	5 	1.3
two 	b 	6 	2.5
三 	c 	7 	3.6
四 	d 	8 	4.6
五 	e 	9 	5.8


#索引对齐特性
#如下相减是安装索引1、2、3顺序来进行的
df1 = pd.DataFrame({'A':[1,2,3]},index=[1,2,3])
df2 = pd.DataFrame({'A':[1,2,3]},index=[3,1,2])
df1-df2
	A
1 	-1
2 	-1
3 	2

#列删除及添加
df['col4'] = [1,2,3,4,5]
 	col1 	col2 	col3 	col4
一 	a 	5 	1.3 	1
二 	b 	6 	2.5 	2
三 	c 	7 	3.6 	3
四 	d 	8 	4.6 	4
五 	e 	9 	5.8 	5

df.pop('col4')
一    1
二    2
三    3
四    4
五    5
Name: col4, dtype: int64

del df['col1']
df
 	col2 	col3
一 	5 	1.3
二 	6 	2.5
三 	7 	3.6
四 	8 	4.6
五 	9 	5.8

#assign也是按照索引对齐，所以索引3为NaN 缺少值，不会对原DataFrame修改
pd.Series(list('def')
0    d
1    e
2    f
dtype: object

df1.assign(C=pd.Series(list('def')))
 	A 	C
1 	1 	e
2 	2 	f
3 	3 	NaN

常用基本函数

info 及 describe

查看对应数据源的基础信息

#哪些列、有多少非缺失值、每列的类型
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   School   35 non-null     object 
 1   Class    35 non-null     object 
 2   ID       35 non-null     int64  
 3   Gender   35 non-null     object 
 4   Address  35 non-null     object 
 5   Height   35 non-null     int64  
 6   Weight   35 non-null     int64  
 7   Math     35 non-null     float64
 8   Physics  35 non-null     object 
dtypes: float64(1), int64(3), object(5)
memory usage: 2.6+ KB

#统计数值型数据的各个统计量
df.describe()
 	ID 	Height 	Weight 	Math
count 	35.00000 	35.000000 	35.000000 	35.000000
mean 	1803.00000 	174.142857 	74.657143 	61.351429
std 	536.87741 	13.541098 	12.895377 	19.915164
min 	1101.00000 	155.000000 	53.000000 	31.500000
25% 	1204.50000 	161.000000 	63.000000 	47.400000
50% 	2103.00000 	173.000000 	74.000000 	61.700000
75% 	2301.50000 	187.500000 	82.000000 	77.100000
max 	2405.00000 	195.000000 	100.000000 	97.000000

#非数值统计
df['Physics'].describe()
count     35
unique     7
top       B+
freq       9
Name: Physics, dtype: object

head 及 tail

df.head(3) #默认头5行，可配置为其他数值

	School 	Class 	ID 	Gender 	Address 	Height 	Weight 	Math 	Physics
0 	S_1 	C_1 	1101 	M 	street_1 	173 	63 	34.0 	A+
1 	S_1 	C_1 	1102 	F 	street_2 	192 	73 	32.5 	B+
2 	S_1 	C_1 	1103 	M 	street_2 	186 	82 	87.2 	B+

df.tail()
 	School 	Class 	ID 	Gender 	Address 	Height 	Weight 	Math 	Physics
30 	S_2 	C_4 	2401 	F 	street_2 	192 	62 	45.3 	A
31 	S_2 	C_4 	2402 	M 	street_7 	166 	82 	48.7 	B
32 	S_2 	C_4 	2403 	F 	street_6 	158 	60 	59.7 	B+
33 	S_2 	C_4 	2404 	F 	street_2 	160 	84 	67.7 	B
34 	S_2 	C_4 	2405 	F 	street_6 	193 	54 	47.6 	B

unique 及 nunique

#nunique显示有多少个唯一值
df['Physics'].unique()
array(['A+', 'B+', 'B-', 'A-', 'B', 'A', 'C'], dtype=object)

#unique显示所有的唯一值
df['Physics'].unique()
array(['A+', 'B+', 'B-', 'A-', 'B', 'A', 'C'], dtype=object)

count 及 value_counts

#count返回非缺失值元素个数
df['Physics'].count()
35

#value_counts返回每个元素有多少个
df['Physics'].value_counts()


B+    9
B     8
B-    6
A     4
A+    3
A-    3
C     2
Name: Physics, dtype: int64

idxmax 及 nlargest

#idxmax函数返回最大值的行数，idxmin功能相反
df['Math'].idxmax()
5
df['Math'].idxmin()
10

#nlargest函数返回前几个大的元素值，nsmallest功能相反
df['Math'].nlargest(3)
5     97.0
28    95.5
11    87.7
Name: Math, dtype: float64

df['Math'].nsmallest(2)
10    31.5
1     32.5
Name: Math, dtype: float64

clip 及 replace

#clip是对超过或者低于某些值的数进行截断
#数组中的元素限制在(a_min, a_max)之间，大于a_max的就使得它等于 a_max，小于a_min,的就使得它等于a_min
df['Math'].clip(33,80).head()
0    34.0
1    33.0
2    80.0
3    80.0
4    80.0
Name: Math, dtype: float64

#replace是对某些值进行替换
df['Address'].head()
0    street_1
1    street_2
2    street_2
3    street_2
4    street_4
Name: Address, dtype: object

df['Address'].replace(['street_1','street_2'],['one','two']).head()
0         one
1         two
2         two
3         two
4    street_4
Name: Address, dtype: object

apply函数

apply()函数可用于Series 和 DataFrame

如下一组DataFrame
name nationality score
0 张汉 400
1 李回 450
2 王汉 470

#少数民族加5分
df['extrascore'] = df['nationality'].apply(lambda x:5 if x != '汉' else 0)
name 	nationality 	score 	extrascore
0 	张 	汉 	400 	0
1 	李 	回 	450 	5
2 	王 	汉 	470 	0

df['totalscore'] = df['score']+ df['extrascore']
 	name 	nationality 	score 	extrascore 	totalscore
0 	张 	汉 	400 	0 	400
1 	李 	回 	450 	5 	455
2 	王 	汉 	470 	0 	470

DataFrame apply 遍历每个元素，对指定元素运行function

matrix = [[1,2,3,],[4,5,6],[7,8,9]]
df = pd.DataFrame(matrix,columns=list('xyz'),index=list('abc'))
df.apply(np.square)
 	x 	y 	z
a 	1 	4 	9
b 	16 	25 	36
c 	49 	64 	81

df.apply(lambda x:np.square(x) if x.name in ['x','y'] else x)
 	x 	y 	z
a 	1 	4 	3
b 	16 	25 	6
c 	49 	64 	9

迷糊小财迷

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Pandas基础

本文记录pandas相关学习记录Pandas导入#导入pandas import pandas as pd#导入numpyimport numpy as np#查看pandas 版本 pd.__version__ps：本次pandas训练使用1.0.3版本，如低版本可通过 pip install --upgrade pandas==1.0.3 指定版本升级文件读取及写入读...
复制链接

扫一扫