Day02_Pandas

最新推荐文章于 2024-08-13 18:29:56 发布

White Hsiao

最新推荐文章于 2024-08-13 18:29:56 发布

阅读量181

点赞数

文章标签： python numpy 数据分析大数据

本文链接：https://blog.csdn.net/Hiraphael/article/details/108133093

版权

Day02_Pandas基础

一. 什么是pandas

pandas : Python Data Analysis Library 是基于NumPy 的一种工具，是为了解决数据分析任务而创建的
pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具
pandas提供了大量能使我们快速便捷地处理数据的函数和方法
它使Python成为强大而高效的数据分析环境的重要因素之一

二. 导入

它使Python成为强大而高效的数据分析环境的重要因素之一
Series 是一个类似数组的数据结构

DataFrame 数据框
类似于Excel，DataFrame组织数据，处理数据

#pandas 源于 numpy 两个总是一起使用
import numpy as np
import pandas as pd
from pandas import Series,DataFrame

三. Series

Series创建

由列表或numpy数组创建
默认索引为0到N-1的整数型索引

obj = Series([1,2,3,4])
print(obj)

# 输出：
0    1
1    2
2    3
3    4
dtype: int64

# 还可以通过设置index参数指定索引
obj2 = Series([1,2,3,4],index=['a','b','c','d'])
obj2
# 输出：
a    1
b    2
c    3
d    4
dtype: int64

# 特别地，由ndarray创建的是引用，而不是副本。对Series元素的改变也会改变原来的ndarray对象中的元素。（列表没有这种情况）
a = np.array([1,2,3])
obj = Series(a)
obj
# 输出：
0    1
1    2
2    3
dtype: int64


obj[0]=0
print(a)
print(obj)
# 输出：
[0 2 3]
0    0
1    2
2    3
dtype: int64

由字典创建

obj = Series({'a':1,'b':2})
obj

# 输出：
a    1
b    2
dtype: int64

练习1：
使用多种方法创建以下Series，命名为s：语文 150 数学 150 英语 150 理综 300

sss = pd.Series([150,150,150,300],index=['语文','数学','英语','理综'])
sss

sss = pd.Series({'语文':150,'数学':150,'英语':150,'理综':150})
sss

Series索引和切片

(1) 显式索引：

使用index中的元素作为索引值
使用.loc[]（推荐）
注意，此时是闭区间

obj = Series({'a':10,'b':12,'c':17})
obj.loc['a']
a    10
b    12
dtype: int64

obj['a']
# 输出：
10

obj['a':'c']
# 输出：
a    10
b    12
c    17
dtype: int64

(2) 隐式索引：

使用整数作为索引值

使用.iloc[]（推荐）
注意，此时是半开区间

obj[0:1]
# 输出
a    10
dtype: int64

obj.iloc[0]
# 输出：
10

obj.iloc[0:1]
# 输出：
a    10
dtype: int64

练习2：
使用多种方法对练习1创建的Series sss进行索引和切片：
索引：数学 150
切片：语文 150 数学 150 英语 150

print(sss.loc['数学'])
print(sss[0])
print(sss)
print(sss[0:1])

# 输出：
150
150
数学    150
理综    150
英语    150
语文    150
dtype: int64
数学    150
dtype: int64

Series的基本概念

可以把Series看成一个定长的有序字典
可以通过shape，size，index,values等得到series的属性
可以通过head(),tail()快速查看Series对象的样式
当索引没有对应的值时，可能出现缺失数据显示NaN（not a number）的情况

可以使用pd.isnull()，pd.notnull()，或自带isnull(),notnull()函数检测缺失数据

obj = Series([10,4,np.nan])
#判断Series是否为不为null，为null返回false
notnull = pd.notnull(obj)
#如果为false将空值设为0
for i,d in enumerate(notnull):
    if d ==0:
        obj[i] = 0
print(obj)

obj.isnull()

Series对象本身及其索引都有一个name属性

obj.name='123'
print(obj)
Series.name = "Hello World"
print(Series.name)

# 输出：
a    1.0
b    2.0
d    NaN
Name: 123, dtype: float64
Hello World

Series的运算

适用于numpy的数组运算也适用于Series

Series之间的运算
在运算中自动对齐不同索引的数据
如果索引不对应，则补NaN

注意：要想保留所有的index，则需要使用.add()函数

A = pd.Series([2,4,6],index=[0,1,2])
B = pd.Series([1,3,5],index=[1,2,3])
display(A,B)

# 输出：
0    2
1    4
2    6
dtype: int64
1    1
2    3
3    5
dtype: int64


A+B
# 输出：
0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

四. DataFrame

DataFrame创建

最常用的方法是传递一个字典来创建。DataFrame以字典的键作为每一【列】的名称，以字典的值（一个数组）作为每一列。
此外，DataFrame会自动加上每一行的索引（和Series一样）。
同Series一样，若传入的列与字典的键不匹配，则相应的值为NaN。

import pandas as ps
from pandas import Series,DataFrame
data = {'color':['blue','green','yellow','red','white'],
       'object':['ball','pen','pencil','paper','mug'],
       'price':[1.2,1.0,0.6,0.9,1.7]}
frame = DataFrame(data,columns=['color','object','price','weight'],
                 index = ['one','two','three','four','five'])
frame

# 输出：
			color	object	   price	  weight
one		blue		ball		1.2			NaN
two		green	    pen	   	 1.0			NaN
three  yellow	   pencil	  0.6			NaN
four		red		paper	     0.9			NaN
five		white		mug	 1.7			NaN

练习4：
根据以下考试成绩表，创建一个DataFrame：
张三李四
语文 150 0
数学 150 0
英语 150 0
理综 300 0

# dic = {'张三':[150,150,150,300],"李四":[0,0,0,0]}
# ddd = DataFrame(dic,index=['语文','数学','英语','理综'])
# ddd
import numpy as np
dic = {'张三':[150,150,150,147],'李四':[0,0,0,0]}
ind = np.array(['语文','数学','英语','编程'])
ddd = DataFrame(dic,index=ind)
ddd

# 输出：

		张三		李四
语文	150		0
数学	150		0
英语	150		0
编程	147		0

DataFrame的索引

对列进行索引

对列进行索引
- 通过类似字典的方式
- 通过属性的方式
可以将DataFrame的列获取为一个Series。返回的Series拥有原DataFrame相同的索引，且name属性也已经设置好了，就是相应的列名。


print(frame)
frame['color']

# 输出：
        color  object  price weight
one      blue    ball    1.2    NaN
two     green     pen    1.0    NaN
three  yellow  pencil    0.6    NaN
four      red   paper    0.9    NaN
five    white     mug    1.7    NaN

Out[14]:
one        blue
two       green
three    yellow
four        red
five      white
Name: color, dtype: object

方式二：frame.color

# 输出：
one        blue
two       green
three    yellow
four        red
five      white
Name: color, dtype: object

对行进行索引

(2) 对行进行索引
- 使用.ix[]来进行行索引
- 使用.loc[]加index来进行行索引
- 使用.iloc[]加整数来进行行索引
同样返回一个Series，index为原来的columns。

frame.ix['one']

# 输出：
color     blue
object    ball
price      1.2
weight     NaN
Name: one, dtype: object

type(frame.ix['one'])
# 输出：
Out[52]:
pandas.core.series.Series


frame.loc["two"]
# 输出：
color     green
object      pen
price         1
weight      NaN
Name: two, dtype: object
//////////////////////////////////////////////////////////
print(frame)
frame.iloc[0:10]
# 输出：
        color  object  price weight
one      blue    ball    1.2    NaN
two     green     pen    1.0    NaN
three  yellow  pencil    0.6    NaN
four      red   paper    0.9    NaN
five    white     mug    1.7    NaN
Out[41]:
		color		object	price	weight
one	blue		ball		1.2	NaN
two	green	pen		1.0	NaN
three	yellow	pencil	0.6	NaN
four	red		paper	0.9	NaN
five	white		mug		1.7	NaN

对元素索引的方法

对元素索引的方法
- 先使用列索引
- 先使用行索引
- 使用values属性（二维numpy数组）

print(frame)
print("使用列索引")
print(frame['color']['one'])
print(frame.color['one'])
print("使用行索引")
print(frame.ix['one']['color'])
print(frame.loc['one','color'])
print(frame.iloc[0][0:2])
print("使用values属性")
print(frame.values[[0][0]])
print(frame.values[0][1:3])

# 输出：
        color  object  price weight
one      blue    ball    1.2    NaN
two     green     pen    1.0    NaN
three  yellow  pencil    0.6    NaN
four      red   paper    0.9    NaN
five    white     mug    1.7    NaN
# 使用列索引
blue
blue
# 使用行索引
blue
blue
color     blue
object    ball
Name: one, dtype: object
# 使用values属性
['blue' 'ball' 1.2 nan]
['ball' 1.2]

【注意】直接用中括号时：
索引表示的是列索引
切片表示的是行切片

#这是列索引
print(frame['color'])

# 输出：
one        blue
two       green
three    yellow
four        red
five      white
Name: color, dtype: object


# 使用切片--------->对应行

frame['one':'two']
# 输出：
	color	object	price	weight
one	blue	ball	1.2	NaN
two	green	pen	1.0	NaN

DataFrame的运算

DataFrame之间的运算

同Series一样：
在运算中自动对齐不同索引的数据
如果索引不对应，则补NaN

A = DataFrame(np.random.randint(0,20,(2,2)),columns = list('ab'))
A
# 输出：
	a		b
0	1		1
1	15	10

B = DataFrame(np.random.randint(0,10,(3,3)),columns = list('abc'))
B

# 输出：
	a	b	c
0	6	3	8
1	2	7	4
2	6	7	4

A+B

# 输出：
		a		b			c
0		14.0	12.0		NaN
1		6.0		14.0		NaN
2		NaN		NaN			NaN

A.add(B,fill_value=0)

# 输出：
	a		b			c
0	14.0	12.0		8.0
1	6.0		14.0		4.0
2	6.0		7.0			4.0

# 转换成int类型数据：
A.add(B,fill_value=0).astype('int')

# 输出：

    a	b		c
0	10	4		8
1	15	10	    3
2	8	4		4

Series与DataFrame之间的运算

【重要】
使用Python操作符：以行为单位操作（参数必须是行），对所有行都有效。（类似于numpy中二维数组与一维数组的运算，但可能出现NaN）

A = np.random.randint(10,size = (3,4))
A
# 输出：
array([[2, 2, 6, 3],
       [1, 3, 4, 1],
       [3, 8, 3, 6]])

B = np.random.randint(10,size = (4))
B

# 输出：
array([4, 6, 9, 0])


A-B
# 输出：
array([[ 0,  0,  0,  0],
       [ 2,  9,  0,  2],
       [-3,  4,  4,  3]])


# 使用pandas操作函数：
#    axis=0：以列为单位操作（参数必须是列），对所有列都有效。
#    axis=1：以行为单位操作（参数必须是行），对所有列都有效。

#   axis=0：以列为单位操作（参数必须是列），对所有列都有效。
display(df)
display(df['Q'])
df.sub(df['Q'],axis = 0)

#输出：
 	Q		W	E		R
0	7		9		3		5
1	2		4		7		6
2	8		8		1		6

0    7
1    2
2    8
Name: Q, dtype: int64

	Q		W	E		R
0	0		2		-4	-2
1	0		2		5		4
2	0		0		-7	-2

#   axis=1：以行为单位操作（参数必须是行），对所有列都有效
display(df)
display(df.iloc[0:2])
df.sub(df.iloc[0,::2],axis = 1)

# 输出：
Q	W	E	R
0	7	9	3	5
1	2	4	7	6
2	8	8	1	6
Q	W	E	R
0	7	9	3	5
1	2	4	7	6
Out[143]:
E	Q	R	W
0	0.0	0.0	NaN	NaN
1	4.0	-5.0	NaN	NaN
2	-2.0	1.0	NaN	NaN

五. DataFrame处理丢失数据

有两种丢失数据

None

None是Python自带的，其类型为python object。因此，None不能参与到任何计算中

object类型的运算要比int类型的运算慢得多
%timeit sum_int = np.arange(1E6,dtype=int).sum()
%timeit sum_int = np.arange(1E6,dtype=float).sum()
%timeit sum_int = np.arange(1E6,dtype=object).sum()

np.nan(NaN)

# np.nan是浮点类型，能参与到计算中。但计算的结果总是NaN

pandas中的None与NaN

pandas中None与np.nan都视作np.nan

a = Series([1,np.nan,2,None])
a

# 输出：
0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

pandas中None与np.nan的操作

判断函数: isnull() 和 notnull()

data = Series([1,np.nan,'hello',None])
data

# 输出：
0        1
1      NaN
2    hello
3     None
dtype: object


data.isnull()

# 输出：
0    False
1     True
2    False
3     True
dtype: bool

data[data.notnull()]

# 输出：
0        1
2    hello
dtype: object

dropna(): 过滤丢失数据

df = DataFrame([[1,np.nan,2],
               [2,3,5],
               [np.nan,4,6]],columns = ['昨天','今天','明天'],index = ['吃饭','睡觉','过家家'])
df

# 输出：
		昨天		今天 	明天
吃饭		1.0		NaN	   2
睡觉		2.0		3.0	   5
过家家	   NaN	   4.0	  6


df.dropna()

# 输出：
	 昨天	今天	明天
睡觉	2.0	3.0	  5

# 可以选择过滤的是行还是列（默认为行）

df.dropna(axis=1)

# 输出：
			明天
吃饭			2
睡觉		 	7
过家家		   6
小桥流水人家	1024

fillna(): 填充丢失数据

data = Series([1,np.nan,2,None,4],index = list('abcdf'))
data

# 输出：
a    1.0
b    NaN
c    2.0
d    NaN
f    4.0
dtype: float64


data.fillna(10)
# 输出：
a     1.0
b    10.0
c     2.0
d    10.0
e     3.0
dtype: float64

对于DataFrame来说，还要选择填充的轴axis。记住，对于DataFrame来说：
axis=0：index/行
axis=1：columns/列

White Hsiao

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Day02_Pandas

Day02_Pandas基础一. 什么是pandaspandas : Python Data Analysis Library 是基于NumPy 的一种工具，是为了解决数据分析任务而创建的pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具pandas提供了大量能使我们快速便捷地处理数据的函数和方法它使Python成为强大而高效的数据分析环境的重要因素之一二. 导入它使Python成为强大而高效的数据分析环境的重要因素之一Series 是一个类似数
复制链接

扫一扫