2. Pandas介绍

最新推荐文章于 2021-12-25 17:12:35 发布

Sleepallday

最新推荐文章于 2021-12-25 17:12:35 发布

阅读量346

点赞数

分类专栏： Python for Data Science

本文链接：https://blog.csdn.net/weixin_44657010/article/details/87710942

版权

Python for Data Science 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Introduction to Pandas

You can think of pandas as an extremely powerful version of Excel, with a lot more features

Series

A Series is very similar to a Numpy array. What differentiate the Numpy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also does’nt need to hold numeric data, it can hold any arbitrary Python Object.

import numpy as np
import pandas as pd

Creating a Series

You can convert a list, numpy array, or dictionary to a Series

labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}
pd.Series(data=my_list)
#0 10
#1 20
#2 30
#dtype: int64
pd.Series(data = my_list, index = labels)
pd.Series(my_list, labels)
#a 10
#b 20
#c 30
#dtype: int64
pd.Series(arr)
#0 10
#1 20
#2 30
#dtype: int64
pd.Series(arr,labels)
#a 10
#b 20
#c 30
#dtype: int64
pd.Series(d)
#a 10
#b 20
#c 30
#dtype: int64

Data in a Series

A pandas Series can hold a variety of object types

pd.Series(data = labels)
#0 a
#1 b
#2 c
#dtype: object
#even functions(although unlikely that you will use this)
pd.Series([sum,print,len])
#0 <built-in function sum>
#1 <built-in function print>
#2 <built-in function len>
dtype: object

Using an Index

ser1 = pd.Series([1,2,3,4], index = ['USA','Germany','USSR','Japan'])
#USA     1
#Germany 2
#USSR    3
#Japan   4
#dtype: int64
ser2 = pd.Series([1,2,3,4], index = ['USA', 'Germany','Italy', 'Japan'])
#USA     1
#Germany 2
#Italy   5
#Japan   4
#dtype: int64
ser1['USA'] #1
ser1+ser2
#Germany 4.0
#Italy   NaN
#Japan   8.0
#USA     2.0
#USSR    NaN
#dtype: float64

DataFrames

DataFrames are the workhorse of pandas and directly inspired by the R programming language. We can think of a DataFrame as a buch of Series objects put together to share the same index.

import pandas as pd
import numpy as np
from numpy.random import randn
np.random.seed(101)
df = pd.DataFrame(randn(5,4), index = 'A B C D E'.split(), columns = 'W X Y Z'.split())

在这里插入图片描述

Selection and Indexing

df['W']
#A 2.706850
#B 0.651118
#C -2.018168
#D 0.188695
#E 0.190794
#Name: W, dtype: float64

# Pass a list of column names
df[['W','Z']]

在这里插入图片描述

#SQL Syntax (not recommended)
df.W
#A 2.706850
#B 0.651118
#C -2.018168
#D 0.188695
#E 0.190794
#Name: W, dtype: float64

DataFrame Columns are just Series

type(df['W']) #pandas.core.series.Series

Creating a new column

df['new'] = df['W'] + df['Y']

在这里插入图片描述

Removing Columns

df.drop('new',axis=1)

在这里插入图片描述

Note inplace unless specified

df

在这里插入图片描述

df.drop('new',axis=1,inplace=True)
# Can also drop rows in this way:
df.drop('E', axis=0)

在这里插入图片描述

Selecting Rows

df.loc['A']
#W 2.706850
#X 0.628133
#Y 0.907969
#Z 0.503826
#Name: A, dtype: float64

#Or select based off of position instead of label
df.iloc[2]
#W   -2.018168
#X 0.740122
#Y 0.528813
#Z -0.589001
#Name: C, dtype: float64

Selecting subset of rows and columns

df.loc['B','Y']
#-0.84807698340363147
df.loc[['A','B'], ['W','Y']]

在这里插入图片描述

Conditional Selection

An important feature of pandas is conditional seletion using bracket notation, very similar to numpy

df

在这里插入图片描述

df>0

在这里插入图片描述

df[df>0]

在这里插入图片描述

df[df['W']>0]

在这里插入图片描述

df[df['W']>0]['Y']
#A    0.907969
#B -0.848077
#D -0.933237
#E 2.605967
#Name: Y, dtype: float64
df[df['W']>0][['Y','X']]

在这里插入图片描述

for two conditions you can use & and | with parenthesis

df[(df['W']>0)&(df['Y']>1)]

在这里插入图片描述

Mode Index Details

df

在这里插入图片描述

# reset to default 0,1,... n index
df.reset_index()

在这里插入图片描述

newind = 'CA NY WY OR CO'.split()
df['States'] = newind

在这里插入图片描述

df.set_index('States')

在这里插入图片描述

Note inplace unless specified

df

在这里插入图片描述

df.set_index('States', inplace = True)
df

在这里插入图片描述

Multi-Index and Index Hierarchy

outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)
hier_index
#MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
df = pd.DataFrame(np.random.randn(6,2), index = hier_index, columns = ['A','B'])
df

在这里插入图片描述

For index hierarchy we use df.loc[], if this was the column axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe

df.loc['G1']

在这里插入图片描述

df.loc['G1'].loc[1]
#A    0.153661
#B    0.167638
#Name: 1, dtype: float64

#Cross Section fuction df.xs()
df.index,names
#FrozenList([None, None])
df.index.names = ['Group', 'Num']
df

在这里插入图片描述

df.xs('G1')

在这里插入图片描述

df.xs['G1',1]
#A    0.153661
#B    0.167638
#Name: (G1, 1), dtype: float64
df.xs(1,level = 'Num')

在这里插入图片描述

补充： Numpy—np.random.seed()函数的应用

np.random.seed()函数可以保证生成的随机数具有可预测性。

这里的可预测性是指相同的种子（seed值）所产生的随机数是相同的。如果不设置seed值，则系统根据时间来自己选择这个值，此时每次生成的随机数因时间差异而不同。

语法：

numpy.random.seed(seed=None)

输入：

—-seed参数默认为空，可选择整数或者数组，可选。

但是，需要注意的是，seed值的有效次数仅为一次，因此，若要保证每次产生的随机数相同，则需要在调用随机数函数之前再次使用相同的seed值。下面给出相应的案例，即

在这里插入图片描述

在机器学习和深度学习中，如果要保证部分参数（比如W权重参数）的随机初始化值相同，可以采用这种方式来实现。

Missing Data

import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,2,np.nan],
                'B':[5,np.nan,np.nan],
                'C':[1,2,3]})

在这里插入图片描述

df.dropna()

在这里插入图片描述

not inplace by default, drop rows by default

df.dropna(axis=1)

在这里插入图片描述

df.dropna(thresh=2)

在这里插入图片描述

df.fillna(value='FILL VALUE')

在这里插入图片描述

df['A'].fillna(value=df['A'].mean())
#0 1.0
#1 2.0
#2 1.5
#Name: A, dtype: float64

Groupby

The groupby method allows you to group rows of data together and call aggregate functions

import pandas as pd
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)

在这里插入图片描述

Now you can use the .groupby() method to group rows together based off a column name.

by_comp = df.groupby("Company")
by_comp.mean()

在这里插入图片描述

# same as df.groupby('Company').mean()
by_comp.std()
by_comp.min()
by_comp,.max()
by_comp.count()
by_comp.describe()

在这里插入图片描述

by_comp.describe().transpose()

在这里插入图片描述

by_comp.describe().transpose()['GOOG']

在这里插入图片描述

#####Merging Joining and Concatenating
Concatenation

Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use pd.concat and pass in a list of DataFrames to concatenate together.

import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                    index=[0, 1, 2, 3])

在这里插入图片描述

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                     index=[4, 5, 6, 7])

在这里插入图片描述

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']},
                    index=[8, 9, 10, 11])

在这里插入图片描述

pd.concat([df1,df2,df3])

在这里插入图片描述

pd.concat([df1,df2,df3],axis=1)

在这里插入图片描述

Merging

The merge function allows you to merge DataFrames together using a siilar logic as merging SQL tables together.

left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})

在这里插入图片描述

right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                       'C': ['C0', 'C1', 'C2', 'C3'],
                       'D': ['D0', 'D1', 'D2', 'D3']})

在这里插入图片描述

pd.merge(left,right,how='inner', on='key')

在这里插入图片描述

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})

在这里插入图片描述

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                               'key2': ['K0', 'K0', 'K0', 'K0'],
                                  'C': ['C0', 'C1', 'C2', 'C3'],
                                  'D': ['D0', 'D1', 'D2', 'D3']})

在这里插入图片描述

pd.merge(left, right, on=['key1', 'key2'])

在这里插入图片描述

pd.merge(left, right, how='outer', on=['key1', 'key2'])

在这里插入图片描述

pd.merge(left, right, how='right', on=['key1', 'key2'])

在这里插入图片描述

pd.merge(left, right, how='left', on=['key1', 'key2'])

在这里插入图片描述

Joining

Joining is a convenient method for combing the columns of two potentially differently-indexed DataFrames into a single result DataFrame

left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2'])

在这里插入图片描述

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])

在这里插入图片描述

left.join(right)

在这里插入图片描述

left.join(right, how='outer')

在这里插入图片描述

Operations

import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head() #return the first n rows

在这里插入图片描述

info on unique values

df['col2'].unique()
#array([444, 555, 666])
df['col2'].nunique()
#3
df['col2'].value_counts()
#444 2
#555 1
#666 1
#Name: col2, dtype: int64

Selecting Data

newdf = df[(df['col1']>2) & (df['col2']==444)]

在这里插入图片描述

Applyin Functions

def times2(x):
    return x*2
df['col1'].apply(times2)
#0 2
#1 4
#2 6
#3 8
#Name: col1, dtype: int64
#same as: df['col1'].apply(lambda x:x*2)
def['col3'].apply(len)
#0 3
#1 3
#2 3
#3 3
#Name: col3, dtype: int64
df['col1'].sum()
#10

Use .apply to send a column of every row to a function

Use .apply with axis=1 to send every single row to a function

Permanently Removing a Column

del df['col1']

在这里插入图片描述
Get column and index names

df.columns
#index(['col2', 'col3'], dtype='object')
df.index
#RangeIndex(start=0, stop=4, step=1)

Sorting and Ordering a DataFrame
df.sort_values(by=‘col2’) #inplace=false by default

![result](https://github.com/Yinstinctive/Python_for_Data_Science/blob/master/images/56.png)<br>
**Pivot Table**
```Python
data = {'A':['foo','foo','foo','bar','bar','bar'],
     'B':['one','one','two','two','one','one'],
       'C':['x','y','x','y','x','y'],
       'D':[1,3,2,5,4,1]}
df = pd.DataFrame(data)

在这里插入图片描述

df.pivot_table(values='D',index=['A', 'B'],columns=['C'])

在这里插入图片描述

Data Input and Output

import numpy as np
import pandas as pd

CSV

df = pd.read_csv('example')
#

在这里插入图片描述

df.to_csv('example',index=False)
# write index into row names, index=True by default

Excel

pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1')
df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

HTML

#Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects
df = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')
df[0]

SQL

The pandas.io.sql module provide a collection of query wrappers to both facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction is provided by SQLAlchemy if installed. In addition you will need a driver library for your database. Examples of such drivers are psycopg2 for PostgreSQL or pymysql for MySQL. For SQLite this is included in Python’s standard library by default. You can find an overview of supported drivers for each SQL dialect in the SQLAlchemy docs.
If SQLAlchemy is not installed, a fallback is only provided for sqlite (and for mysql for backwards compatibility, but this is deprecated and will be removed in a future version). This mode requires a Python database adapter which respect the Python DB-API.

See also some cookbook examples for some advanced strategies.

The key functions are:

read_sql_table(table_name, con[, schema, …])
- Read SQL database table into a DataFrame.
read_sql_query(sql, con[, index_col, …])
- Read SQL query into a DataFrame.
read_sql(sql, con[, index_col, …])
- Read SQL query or database table into a DataFrame.
DataFrame.to_sql(name, con[, flavor, …])
- Write records stored in a DataFrame to a SQL database.

from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
df.to_sql('data',engine)
sql_df = pd.read_sql('data', con=engine)
sql_df

在这里插入图片描述

Sleepallday

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
2. Pandas介绍

Introduction to PandasYou can think of pandas as an extremely powerful version of Excel, with a lot more featuresSeriesA Series is very similar to a Numpy array. What differentiate the Numpy array ...
复制链接

扫一扫