python dataframe isin_Python 数据科学系列 の Numpy、Series 和 DataFrame介绍

本課主題

Numpy 的介绍和操作实战

Series 的介绍和操作实战

DataFrame 的介绍和操作实战

Numpy 的介绍和操作实战

numpy 是 Python 在数据计算领域里很常用的模块

import numpy as np

np.array([11,22,33]) #接受一个列表数据

创建 numpy array

>>> importnumpy as np>>> mylist = [1,2,3]>>> x =np.array(mylist)>>>x

array([1, 2, 3])>>> y = np.array([4,5,6])>>>y

array([4, 5, 6])>>> m = np.array([[7,8,9],[10,11,12]])>>>m

array([[7, 8, 9],

[10, 11, 12]])

创建 numpy array(例子)

查看 numpy array 的

>>> m.shape #array([1, 2, 3])

(2, 3)>>> x.shape #array([4, 5, 6])

(3,)>>> y.shape #array([[ 7, 8, 9], [10, 11, 12]])

(3,)

View Code

numpy.arrange

>>> n = np.arange(0,30,2)>>>n

array([ 0,2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28])

numpy.arrange( )(例子)

改变numpy array的位置

>>> n = np.arange(0,30,2)>>>n

array([ 0,2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28])>>>n.shape

(15,)>>> n = n.reshape(3,5) #从15列改成3列5行

>>>n

array([[ 0,2, 4, 6, 8],

[10, 12, 14, 16, 18],

[20, 22, 24, 26, 28]])

numpy.reshape( )(例子一)

>>> o = np.linspace(0,4,9)>>>o

array([ 0. ,0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. ])>>> o.resize(3,3)>>>o

array([[ 0. ,0.5, 1. ],

[1.5, 2. , 2.5],

[3. , 3.5, 4. ]])

numpy.reshape( )(例子二)

numpy.ones( ) ,numpy.zeros( ),numpy.eye( )

>>> r1 = np.ones((3,2))>>>r1

array([[1., 1.],

[1., 1.],

[1., 1.]])>>> r1 = np.zeros((2,3))>>>r1

array([[ 0., 0., 0.],

[ 0., 0., 0.]])>>> r2 = np.eye(3)>>>r2

array([[1., 0., 0.],

[ 0.,1., 0.],

[ 0., 0.,1.]])

numpy.ones/zeros/eye( )(例子)

可以定义整数

>>> r5 = np.ones([2,3], int)>>>r5

array([[1, 1, 1],

[1, 1, 1]])>>> r5 = np.ones([2,3])>>>r5

array([[1., 1., 1.],

[1., 1., 1.]])

numpy.ones(x,int)(例子)

numpy.diag( )

>>> y = np.array([4,5,6])>>>y

array([4, 5, 6])>>>np.diag(y)

array([[4, 0, 0],

[0,5, 0],

[0, 0,6]])

diag( )(例子)

复制 numpy array

>>> r3 = np.array([1,2,3] * 3)>>>r3

array([1, 2, 3, 1, 2, 3, 1, 2, 3])>>> r4 = np.repeat([1,2,3],3)>>>r4

array([1, 1, 1, 2, 2, 2, 3, 3, 3])

复制numpy array(例子)

numpy中的 vstack和 hstack

>>> r5 = np.ones([2,3], int)>>>r5

array([[1, 1, 1],

[1, 1, 1]])>>> r6 = np.vstack([r5,2*r5])>>>r6

array([[1, 1, 1],

[1, 1, 1],

[2, 2, 2],

[2, 2, 2]])>>> r7 = np.hstack([r5,2*r5])>>>r7

array([[1, 1, 1, 2, 2, 2],

[1, 1, 1, 2, 2, 2]])

numpy.vstack( )和np.hstack( )(例子)

numpy 中的加减乘除操作一 (+-*/)

>>> mylist = [1,2,3]>>> x =np.array(mylist)>>> y = np.array([4,5,6])>>> x+y

array([5, 7, 9])>>> x-y

array([-3, -3, -3])>>> x*y

array([4, 10, 18])>>> x**2array([1, 4, 9])>>>x.dot(y)32

numpy中的加减乘除(例子一)

numpy 中的加减乘除操作二:sum( )、max( )、min( )、mean( )、std( )

>>> a = np.array([1,2,3,4,5])>>>a.sum()15

>>>a.max()5

>>>a.min()1

>>>a.mean()3.0

>>>a.std()1.4142135623730951

>>>a.argmax()4

>>>a.argmin()

0

numpy中的加减乘除(例子二)

查看numpy array 的数据类型

>>> y = np.array([4,5,6])>>> z = np.array([y, y**2])>>>z

array([[4, 5, 6],

[16, 25, 36]])>>>z.shape

(2, 3)>>>z.T.shape

(3, 2)>>>z.dtype

dtype('int64')>>> z = z.astype('f')>>>z.dtype

dtype('float32')

numpy array 的数据类型

numpy 中的索引和切片

>>> s = np.arange(13)>>>s

array([ 0,1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])>>> s = np.arange(13) ** 2

>>>s

array([ 0,1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144])>>> s[0],s[4],s[0:3]

(0,16, array([0, 1, 4]))>>> s[1:5]

array([1, 4, 9, 16])>>> s[-4:]

array([81, 100, 121, 144])>>> s[-5:-2]

array([64, 81, 100])

numpy索引和切片(例子一)

>>> r = np.arange(36)>>> r.resize((6,6))>>>r

array([[ 0,1, 2, 3, 4, 5],

[6, 7, 8, 9, 10, 11],

[12, 13, 14, 15, 16, 17],

[18, 19, 20, 21, 22, 23],

[24, 25, 26, 27, 28, 29],

[30, 31, 32, 33, 34, 35]])>>> r[2,2]14

>>> r[3,3:6]

array([21, 22, 23])>>> r[:2,:-1]

array([[ 0,1, 2, 3, 4],

[6, 7, 8, 9, 10]])>>> r[-1,::2]

array([30, 32, 34])>>> r[r > 30] #取r大于30的数据

array([31, 32, 33, 34, 35])>>> re2 = r[r > 30] = 30

>>>re230

>>> r8 = r[:3,:3]>>>r8

array([[ 0,1, 2],

[6, 7, 8],

[12, 13, 14]])>>> r8[:] =0>>>r8

array([[0, 0, 0],

[0, 0, 0],

[0, 0, 0]])>>>r

array([[ 0, 0, 0,3, 4, 5],

[ 0, 0, 0,9, 10, 11],

[ 0, 0, 0,15, 16, 17],

[18, 19, 20, 21, 22, 23],

[24, 25, 26, 27, 28, 29],

[30, 30, 30, 30, 30, 30]])

numpy索引和切片(例子二)

copy numpy array 的数组

>>> r = np.arange(36)>>> r.resize((6,6))>>> r_copy =r.copy()>>>r

array([[ 0,1, 2, 3, 4, 5],

[6, 7, 8, 9, 10, 11],

[12, 13, 14, 15, 16, 17],

[18, 19, 20, 21, 22, 23],

[24, 25, 26, 27, 28, 29],

[30, 31, 32, 33, 34, 35]])>>>r_copy

array([[ 0,1, 2, 3, 4, 5],

[6, 7, 8, 9, 10, 11],

[12, 13, 14, 15, 16, 17],

[18, 19, 20, 21, 22, 23],

[24, 25, 26, 27, 28, 29],

[30, 31, 32, 33, 34, 35]])>>> r_copy[:] = 10

>>>r_copy

array([[10, 10, 10, 10, 10, 10],

[10, 10, 10, 10, 10, 10],

[10, 10, 10, 10, 10, 10],

[10, 10, 10, 10, 10, 10],

[10, 10, 10, 10, 10, 10],

[10, 10, 10, 10, 10, 10]])

copy( )例子

其他操作

>>> test = np.random.randint(0,10,(4,3))>>>test

array([[3, 5, 2],

[7, 7, 9],

[8, 9, 2],

[2, 9, 1]])>>> for row intest:

...print(row)

...

[3 5 2]

[7 7 9]

[8 9 2]

[2 9 1]>>> for i inrange(len(test)):

...print(test[i])

...

[3 5 2]

[7 7 9]

[8 9 2]

[2 9 1]>>> for i, row inenumerate(test):

...print('row', i, 'is', row)

...

row 0is [3 5 2]

row1 is [7 7 9]

row2 is [8 9 2]

row3 is [2 9 1]>>> test2 = test ** 2

>>>test2

array([[9, 25, 4],

[49, 49, 81],

[64, 81, 4],

[4, 81, 1]])>>> for i,j, inzip(test,test2):

...print(i, '+', j, '=', i +j)

...

[3 5 2] + [ 9 25 4] = [12 30 6]

[7 7 9] + [49 49 81] = [56 56 90]

[8 9 2] + [64 81 4] = [72 90 6]

[2 9 1] + [ 4 81 1] = [ 6 90 2]>>>

numpy array 的其他操作例子

Series 的介绍和操作实战

如果是输入一个字典类型的话,字典的键会自动变成 Index,然后它的值是Value

from pandas import Series, DataFrame

import pandas as pd

pd.Series(['Dog','Bear','Tiger','Moose','Giraffe','Hippopotamus','Mouse'], name='Animals') #接受一个列表类型的数据

def __init__(self, data=None, index=None, dtype=None, name=None,

copy=False, fastpath=False):

Series的__init__方法

创建 Series 类型

第一:你可以传入一个列表或者是字典来创建 Series,如果传入的是列表,Python会自动把 [0,1,2] 作为 Series 的索引。

第二:如果你传入的是字符串类型的数据,Series 返回的dtype是object;如果你传入的是数字类型的数据,Series 返回的dtype是int64

>>> from pandas importSeries, DataFrame>>> importpandas as pd>>> animals = ['Tiger','Bear','Moose']>>> s1 =pd.Series(animals)>>>s1

0 Tiger1Bear2Moose

dtype: object>>> s2 = pd.Series([1,2,3])>>>s2

01

1 2

2 3dtype: int64

创建 Series

Series如何处理 NaN的数据?

>>> animals2 = ['Tiger','Bear',None]>>> s3 =pd.Series(animals2)>>>s3

0 Tiger1Bear2None

dtype: object>>> s4 = pd.Series([1,2,None])>>>s4

01.0

1 2.0

2NaN

dtype: float64

Series NaN数据(范例)

Series 中的 NaN数据和如何检查 NaN数据是否相等,这时候需要调用 np.isnan( )方法

>>> importnumpy as np>>> np.nan ==None

False>>> np.nan ==np.nan

False>>>np.isnan(np.nan)

True

np.isnan( )

Series 默应 Index 是 [0,1,2],但也可以自定义 Series 中的Index

>>> importnumpy as np>>> sports ={

...'Archery':'Bhutan',

...'Golf':'Scotland',

...'Sumo':'Japan',

...'Taekwondo':'South Korea'... }>>> s5 =pd.Series(sports)>>>s5

Archery Bhutan

Golf Scotland

Sumo Japan

Taekwondo South Korea

dtype: object>>>s5.index

Index(['Archery', 'Golf', 'Sumo', 'Taekwondo'], dtype='object')

自定义 Series 中的Index(例子一)

>>> from pandas importSeries, DataFrame>>> importpandas as pd>>> s6 = pd.Series(['Tiger','Bear','Moose'], index=['India','America','Canada'])>>>s6

India Tiger

America Bear

Canada Moose

dtype: object

自定义 Series 中的Index(例子一)

查询 Series 的数据有两种方法,第一是通过index方法 e.g. s.iloc[2];第二是通过label方法 e.g. s.loc['America']

>>> from pandas importSeries, DataFrame>>> importpandas as pd>>>s6

India Tiger

America Bear

Canada Moose

dtype: object>>> s6.iloc[2] #获取 index2位置的数据

'Moose'

>>> s6.loc['America'] #获取 label: America 的值

'Bear'

>>> s6[1] #底层调用了 s6.iloc[1]

'Bear'

>>> s6['India'] #底层调用了 s6.loc['India']

'Tiger'

查询Series(例子)

Series 的数据操作: sum( ),它底层也是调用 numpy 的方法

>>> s7 = pd.Series([100.00,120.00,101.00,3.00])>>>s7

0100.0

1 120.0

2 101.0

3 3.0dtype: float64>>> total =0>>> for item ins7:

... total+=item

...>>>total324.0

>>> total2 =np.sum(s7)>>>total2324.0

np.sum(s7)

>>> s8 = pd.Series(np.random.randint(0,1000,10000))>>>s8.head()

025

1 399

2 326

3 479

4 603dtype: int64>>>len(s8)10000

head( )例子

Series 也可以存储混合型数据

>>> s9 = pd.Series([1,2,3])>>> s9.loc['Animals'] = 'Bears'

>>>s9

01

1 2

2 3Animals Bears

dtype: object

混合型存储数据(例子)

Series 中的 append( ) 用法

>>> original_sports = pd.Series({'Archery':'Bhutan',

...'Golf':'Scotland',

...'Sumo':'Japan',

...'Taekwondo':'South Korea'})>>> cricket_loving_countries = pd.Series(['Australia', 'Barbados','Pakistan','England'],

... index=['Cricket','Cricket','Cricket','Cricket'])>>> all_countries =original_sports.append(cricket_loving_countries)>>>original_sports

Archery Bhutan

Golf Scotland

Sumo Japan

Taekwondo South Korea

dtype: object>>>cricket_loving_countries

Cricket Australia

Cricket Barbados

Cricket Pakistan

Cricket England

dtype: object>>>all_countries

Archery Bhutan

Golf Scotland

Sumo Japan

Taekwondo South Korea

Cricket Australia

Cricket Barbados

Cricket Pakistan

Cricket England

dtype: object

Series类型的append( )

DataFrame

这是创建一个DataFrame对象的基本语句:接受字典类型的数据;字典中的Key (e.g. Animals, Owners) 对应 DataFrame中的Columns,它的 Value 也相当于数据库表中的每一行数据。

data = {

'Animals':['Dog','Bear','Tiger','Moose','Giraffe','Hippopotamus','Mouse'],

'Owners':['Chris','Kevyn','Bob','Vinod','Daniel','Fil','Stephanie']

}

df = DataFrame(data, columns=['Animals','Owners'])

基础操作

创建DataFrame

>>> from pandas importSeries, DataFrame>>> importpandas as pd>>> data = {'name':['yahoo','google','facebook'],

...'marks':[200,400,800],

...'price':[9,3,7]}>>> df =DataFrame(data)>>>df

marks name price

0200 yahoo 9

1 400 google 3

2 800 facebook 7

创建DataFrame(例子一)

>>> df2 = DataFrame(data, columns=['name','price','marks'])>>>df2

name price marks

0 yahoo9 200

1 google 3 400

2 facebook 7 800

>>> df3 = DataFrame(data, columns=['name','price','marks'], index=['a','b','c'])>>>df3

name price marks

a yahoo9 200b google3 400c facebook7 800

>>> df4 = DataFrame(data, columns=['name','price','marks', 'debt'], index=['a','b','c'])>>>df4

name price marks debt

a yahoo9 200NaN

b google3 400NaN

c facebook7 800 NaN

创建DataFrame(例子二)

>>> importpandas as pd>>> purchase_1 = pd.Series({'Name':'Chris','Item Purchased':'Dog Food','Cost':22.50})>>> purchase_2 = pd.Series({'Name':'Kelvin','Item Purchased':'Kitty Litter','Cost':2.50})>>> purchase_3 = pd.Series({'Name':'Vinod','Item Purchased':'Bird Seed','Cost':5.00})>>>

>>> df = pd.DataFrame([purchase_1,purchase_2,purchase_3],index=['Store 1','Store 2','Store 1'])>>>df

Cost Item Purchased Name

Store1 22.5Dog Food Chris

Store2 2.5Kitty Litter Kelvin

Store1 5.0 Bird Seed Vinod

创建DataFrame(例子三)

查询 dataframe 的index:df.loc['index']

>>> df.loc['Store 2']

Cost2.5Item Purchased Kitty Litter

Name Kelvin

Name: Store2, dtype: object

df.loc['Store 2']

>>> df.loc['Store 1']

Cost Item Purchased Name

Store1 22.5Dog Food Chris

Store1 5.0 Bird Seed Vinod

df.loc['Store 1']

>>> df['Item Purchased']

Store1Dog Food

Store2Kitty Litter

Store1Bird Seed

Name: Item Purchased, dtype: object

df['Item Purchased']

查 store1 的 cost 是多少

>>> df.loc['Store 1', 'Cost']

Store1 22.5Store1 5.0Name: Cost, dtype: float64

df.loc['Store 1', 'Cost']

查询Cost大于3的Name

>>> df['Name'][df['Cost']>3]

Store1Chris

Store1Vinod

Name: Name, dtype: object

df['Name'][df['Cost']>3]

查询DataFrame 的类型

>>> type(df.loc['Store 2'])

type( )例子

drop dataframe (但这不会把原来的 dataframe drop 掉)

>>> df.drop('Store 1')

Cost Item Purchased Name

Store2 2.5Kitty Litter Kelvin>>>df

Cost Item Purchased Name

Store1 22.5Dog Food Chris

Store2 2.5Kitty Litter Kelvin

Store1 5.0 Bird Seed Vinod

df.drop('Store 1')

>>> copy_df =df.copy()>>>copy_df

Cost Item Purchased Name

Store1 22.5Dog Food Chris

Store2 2.5Kitty Litter Kelvin

Store1 5.0Bird Seed Vinod>>> copy_df = df.drop('Store 1')>>>copy_df

Cost Item Purchased Name

Store2 2.5 Kitty Litter Kelvin

把dataframe数据drop的例子

也可以用 del 把 Column 列删除掉

>>> del copy_df['Name']>>>copy_df

Cost Item Purchased

Store2 2.5 Kitty Litter

del copy_df['Name']

set_index

rename column

可以修改dataframe里的数据

>>> df = pd.DataFrame([purchase_1,purchase_2,purchase_3],index=['Store 1','Store 2','Store 1'])>>>df

Cost Item Purchased Name

Store1 22.5Dog Food Chris

Store2 2.5Kitty Litter Kelvin

Store1 5.0Bird Seed Vinod>>> df['Cost'] = df['Cost'] * 0.8

>>>df

Cost Item Purchased Name

Store1 18.0Dog Food Chris

Store2 2.0Kitty Litter Kelvin

Store1 4.0 Bird Seed Vinod

df['Cost'] * 0.8

>>> df = pd.DataFrame([purchase_1,purchase_2,purchase_3],index=['Store 1','Store 2','Store 1'])>>> costs = df['Cost']>>>costs

Store1 22.5Store2 2.5Store1 5.0Name: Cost, dtype: float64>>> costs += 2

>>>costs

Store1 24.5Store2 4.5Store1 7.0Name: Cost, dtype: float64

costs = df['Cost']

进阶操作

Merge

Full Outer Join

Inner Join

Left Join

Right Join

apply

group by

agg

astype

cut

s = pd.Series([168, 180, 174, 190, 170, 185, 179, 181, 175, 169, 182, 177, 180, 171])

pd.cut(s,3)

pd.cut(s,3, labels=['Small', 'Medium', 'Large'])

cut( )

pivot table

Date in DataFrame

Timestampe

period

DatetimeINdex

PeriodIndex

to_datetime

Timedelta

date_range

difference between date value

resample

asfreq - changing the frequency of the date

读取 csv 文件

import pandas as pd

pd.read_csv('student.csv')

读取csv

>>> from pandas importSeries, DataFrame>>> importpandas as pd>>> df_student = pd.read_csv('student.csv')>>>df_student

nameclassmarks age

janice python80 22alex python95 21peter python85 25ken java75 28lawerance java50 22

pd.read_csv('student.csv')(例子一)

df_student = pd.read_csv('student.csv', index_col=0, skiprows=1)

pd.read_csv('student.csv')(例子二)

获取分数大于70的数据

>>> df_student['marks'] > 70True

True

True

True

False

Name: marks, dtype: bool

方法一: df_student['marks'] > 70

>>> df_student.where(df_student['marks']>70)

nameclassmarks age

janice python80.0 22.0alex python95.0 21.0peter python85.0 25.0ken java75.0 28.0NaN NaN NaN NaN

方法二: df_student.where(df_student['marks']>70)

>>> df_student[df_student['marks'] > 70]

nameclassmarks age

0 janice python80 22

1 alex python 95 21

2 peter python 85 25

3 ken java 75 28

方法三: df_student[df_student['marks'] > 70]

获取class = 'python' 的数据,df.count( ) 是不会把 NaN数据计算在其中

>>> df2 = df_student.where(df_student['class'] == 'python')>>>df2

nameclassmarks age

0 janice python80.0 22.0

1 alex python 95.0 21.0

2 peter python 85.0 25.0

3NaN NaN NaN NaN4NaN NaN NaN NaN>>> df2 = df_student[df_student['class'] == 'python']>>>df2

nameclassmarks age

0 janice python80 22

1 alex python 95 21

2 peter python 85 25

df_student.where( )例子

计算 class 的数目 e.g. count( )

>>> df2['class'].count() #不会把 NaN也计算

3

>>> df_student['class'].count() #会把 NaN也计算

5

df.count( )例子

删取NaN数据

>>> df3 =df2.dropna()>>>df3

nameclassmarks age

0 janice python80.0 22.0

1 alex python 95.0 21.0

2 peter python 85.0 25.0

df2.dropna()

获取age大于23 学生的数据

>>>df_student

nameclassmarks age

0 janice python80 22

1 alex python 95 21

2 peter python 85 25

3 ken java 75 28

4 lawerance java 50 22

>>> df_student[df_student['age'] > 23]

nameclassmarks age2 peter python 85 25

3 ken java 75 28

>>> df_student['age'] > 230 False1False2True3True4False

Name: age, dtype: bool>>> len(df_student[df_student['age'] > 23])2

df_student[df_student['age'] > 23]

获取age大于23和分数大于80分学生的数据

>>>df_student

nameclassmarks age

0 janice python80 22

1 alex python 95 21

2 peter python 85 25

3 ken java 75 28

4 lawerance java 50 22

>>> df_and = df_student[(df_student['age'] > 23) & (df_student['marks'] > 80)]>>>df_and

nameclassmarks age2 peter python 85 25

df_student[(df_student['age'] > 23) & (df_student['marks'] > 80)]

获取age大于23或分数大于80分学生的数据

>>>df_student

nameclassmarks age

0 janice python80 22

1 alex python 95 21

2 peter python 85 25

3 ken java 75 28

4 lawerance java 50 22

>>> df_or = df_student[(df_student['age'] > 23) | (df_student['marks'] > 80)]>>>df_or

nameclassmarks age1 alex python 95 21

2 peter python 85 25

3 ken java 75 28

df_student[(df_student['age'] > 23) | (df_student['marks'] > 80)]

重新定义index的数值 df.set_index( )

>>> df_student = pd.read_csv('student.csv')>>>df_student

nameclassmarks age

0 janice python80 22

1 alex python 95 21

2 peter python 85 25

3 ken java 75 28

4 lawerance java 50 22

>>> df_student['order_id'] =df_student.index>>>df_student

nameclassmarks age order_id

0 janice python80 2201 alex python 95 21 1

2 peter python 85 25 2

3 ken java 75 28 3

4 lawerance java 50 22 4

>>> df_student = df_student.set_index('class')>>>df_student

name marks age order_idclasspython janice80 220

python alex95 21 1python peter85 25 2java ken75 28 3java lawerance50 22 4

df_student.set_index( )例子

获取在 dataframe column 中唯一的数据

>>> df_student = pd.read_csv('student.csv')>>> df_student['class'].unique()

array(['python', 'java'], dtype=object)

df.unique( )例子

python 的可视化 matplotlib

plot

參考資料

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值