一. 初始Pandas
1.安装
pip install pandas
2.查看pandas版本
import pandas
pandas.__version__ # 查看版本
二. Pandas数据结构-Series
Series:类似表格中的一个列(column),类似于一维数组,可以保存任何数据类型
pandas.Series(data,index,dtype,name,copy)
data:一组数据
index:数据索引
dtype:数据类型,默认会自己判断
name:设置名称
copy:拷贝数据,默认False
1.一个简单的Series实例
import pandas as pd
a=[1,2,3]
myvar=pd.Series(a)
print(myvar)
2.没有指定索引,就从0开始,我们可以根据索引值读取数据
import pandas as pd
a=[1,2,3]
myvar=pd.Series(a)
print(myvar[1])
执行结果
D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/爬虫/c.py
2
进程已结束,退出代码0
3.指定索引
import pandas as pd
a=["Google","runoob","wiki"]
myvar=pd.Series(a,index=["x","y","z"])
print(myvar)
执行结果
D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/爬虫/c.py
x Google
y runoob
z wiki
dtype: object
进程已结束,退出代码0
4.使用key/value对象,类似字典来创建Series:
import pandas as pd
sites={1:"Google",2:"Runoob",3:"Wiki"}
myvar=pd.Series(sites)
print(myvar)
执行结果
字典的key变成了索引值
D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/爬虫/c.py
1 Google
2 Runoob
3 Wiki
dtype: object
进程已结束,退出代码0
实例:如果只需要字典中的一部分数据,只需要指定需要数据的索引即可
import pandas as pd
sites={1:"Google",2:"Runoob",3:"Wiki"}
myvar=pd.Series(sites,index=[1,2])
print(myvar)
执行结果
D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/爬虫/c.py
1 Google
2 Runoob
dtype: object
进程已结束,退出代码0
实例:设置series名称参数
import pandas as pd
sites={1:"Google",2:"Runoob",3:"Wiki"}
myvar=pd.Series(sites,index=[1,2],name="Runoob-Series-TEST")
print(myvar)
执行结果
D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/爬虫/c.py
1 Google
2 Runoob
Name: Runoob-Series-TEST, dtype: object
进程已结束,退出代码0
三.DataFrame
表格型的数据结构,既有行索引,也有列索引
可以看成由Series组成的字典
格式:
pandas.DataFrame(data,index,columns,dype,copy)
data:一组数据
index:
columns:列标签,默认为RangeIndex(0,1,2,,,n)
dtype:数据类型
copy:拷贝数据,默认False
1.DataFrame是一个二维的数组结构,类似二维数组
import pandas as pd
data=[['Google',10],['Runoob',12],['Wiki',13]]
df=pd.DataFrame(data,columns=['Sites','Age'],dtype=float)
print(df)
执行结果
Sites Age
0 Google 10.0
1 Runoob 12.0
2 Wiki 13.0
进程已结束,退出代码0
2.使用ndarrays创建
import pandas as pd
data={'Site':['Google','Runoob','Wiki'],'Age':[10,12,13]}
df=pd.DataFrame(data)
print(df)
执行结果
D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/爬虫/c.py
Site Age
0 Google 10
1 Runoob 12
2 Wiki 13
进程已结束,退出代码0
3.使用字典(key/value),其中字典的key为列名
import pandas as pd
data=[{'a':1,'b':2},{'a':5,'b':10,'c':20}]
df=pd.DataFrame(data)
print(df)
执行结果
没有对应的部分数据为NaN
D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/爬虫/c.py
a b c
0 1 2 NaN
1 5 10 20.0
进程已结束,退出代码0
4.Pandas可以使用loc属性返回指定行的数据,如果没有设置索引,第一行索引为0,第二行索引为1
import pandas as pd
data={
"calories":[420,380,390],
"duration":[50,40,45]
}
df=pd.DataFrame(data)
# 返回第一行
print(df.loc[0])
# 返回第二行
print(df.loc[1])
执行结果
D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/爬虫/c.py
calories 420
duration 50
Name: 0, dtype: int64
calories 380
duration 40
Name: 1, dtype: int64
进程已结束,退出代码0
5.也可以返回多行数据,[[…]]格式,…为各行的索引,以逗号隔开
import pandas as pd
data={
"calories":[420,380,390],
"duration":[50,40,45]
}
df=pd.DataFrame(data)
# 返回第一行和第二行
print(df.loc[[0,1]])
执行结果
返回结果其实就是一个Pandas DataFrame数据
D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/爬虫/c.py
calories duration
0 420 50
1 380 40
进程已结束,退出代码0
6.我们可以指定索引
import pandas as pd
data={
"calories":[420,380,390],
"duration":[50,40,45]
}
df=pd.DataFrame(data,index=["day1","day2","day3"])
print(df)
执行结果
D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/爬虫/c.py
calories duration
day1 420 50
day2 380 40
day3 390 45
进程已结束,退出代码0
7.Pandas可以使用loc属性返回指定索引对应到某一行
import pandas as pd
data={
"calories":[420,380,390],
"duration":[50,40,45]
}
df=pd.DataFrame(data,index=["day1","day2","day3"])
# 指定索引
print(df.loc["day2"])
执行结果
D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/爬虫/c.py
calories 380
duration 40
Name: day2, dtype: int64
进程已结束,退出代码0
四.Pandas CSV
to_string():用于返回DataFrame类型的数据,
import pandas as pd
df=pd.read_csv('nba.csv')
print(df.to_string())
执行结果
D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/c.py
Name Team Number Position Age Height Weight College Salary
0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0 Texas 7730337.0
1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0 Marquette 6796117.0
2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0 Boston University NaN
3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0 Georgia State 1148640.0
4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0 NaN 5000000.0
5 Amir Johnson Boston Celtics 90.0 PF 29.0 6-9 240.0 NaN 12000000.0
1.使用to_csv()将DataFrame存储为csv文件
import pandas as pd
# 三个字段 name, site, age
nme = ["Google", "Runoob", "Taobao", "Wiki"]
st = ["www.google.com", "www.runoob.com", "www.taobao.com", "www.wikipedia.org"]
ag = [90, 40, 80, 98]
# 字典
dict = {'name': nme, 'site': st, 'age': ag}
df = pd.DataFrame(dict)
# 保存 dataframe
df.to_csv('site.csv')
2.数据处理
**head(n)**用于读取前面的n行,如果不填参数n,默认返回5行
import pandas as pd
df=pd.read_csv('nba.csv')
print(df.head())
执行结果
D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/c.py
Name Team Number ... Weight College Salary
0 Avery Bradley Boston Celtics 0.0 ... 180.0 Texas 7730337.0
1 Jae Crowder Boston Celtics 99.0 ... 235.0 Marquette 6796117.0
2 John Holland Boston Celtics 30.0 ... 205.0 Boston University NaN
3 R.J. Hunter Boston Celtics 28.0 ... 185.0 Georgia State 1148640.0
4 Jonas Jerebko Boston Celtics 8.0 ... 231.0 NaN 5000000.0
[5 rows x 9 columns]
进程已结束,退出代码0
3.实例:读取前面10行
import pandas as pd
df=pd.read_csv('nba.csv')
print(df.head(10))
执行结果
D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/c.py
Name Team Number ... Weight College Salary
0 Avery Bradley Boston Celtics 0.0 ... 180.0 Texas 7730337.0
1 Jae Crowder Boston Celtics 99.0 ... 235.0 Marquette 6796117.0
2 John Holland Boston Celtics 30.0 ... 205.0 Boston University NaN
3 R.J. Hunter Boston Celtics 28.0 ... 185.0 Georgia State 1148640.0
4 Jonas Jerebko Boston Celtics 8.0 ... 231.0 NaN 5000000.0
5 Amir Johnson Boston Celtics 90.0 ... 240.0 NaN 12000000.0
6 Jordan Mickey Boston Celtics 55.0 ... 235.0 LSU 1170960.0
7 Kelly Olynyk Boston Celtics 41.0 ... 238.0 Gonzaga 2165160.0
8 Terry Rozier Boston Celtics 12.0 ... 190.0 Louisville 1824360.0
9 Marcus Smart Boston Celtics 36.0 ... 220.0 Oklahoma State 3431040.0
[10 rows x 9 columns]
进程已结束,退出代码0
4. tail(n)
用于读取尾部的n行,如果不填参数n,默认返回5行,空行各个字段的值返回NaN
import pandas as pd
df=pd.read_csv('nba.csv')
print(df.tail())
执行结果
D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/c.py
Name Team Number Position ... Height Weight College Salary
453 Shelvin Mack Utah Jazz 8.0 PG ... 6-3 203.0 Butler 2433333.0
454 Raul Neto Utah Jazz 25.0 PG ... 6-1 179.0 NaN 900000.0
455 Tibor Pleiss Utah Jazz 21.0 C ... 7-3 256.0 NaN 2900000.0
456 Jeff Withey Utah Jazz 24.0 C ... 7-0 231.0 Kansas 947276.0
457 NaN NaN NaN NaN ... NaN NaN NaN NaN
[5 rows x 9 columns]
进程已结束,退出代码0
5.info()
返回表格的一些基本信息
import pandas as pd
df=pd.read_csv('nba.csv')
print(df.info())
执行结果
D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/c.py
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458 entries, 0 to 457
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 457 non-null object
1 Team 457 non-null object
2 Number 457 non-null float64
3 Position 457 non-null object
4 Age 457 non-null float64
5 Height 457 non-null object
6 Weight 457 non-null float64
7 College 373 non-null object
8 Salary 446 non-null float64
dtypes: float64(4), object(5)
memory usage: 32.3+ KB
None
进程已结束,退出代码0
五.JSON
Pandas可以很方便的处理JSON数据,
import pandas as pd
df=pd.read_json('sites.json')
print(df.to_string())
执行结果
D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/c.py
id name url likes
0 A001 菜鸟教程 www.runoob.com 61
1 A002 Google www.google.com 124
2 A003 淘宝 www.taobao.com 45
进程已结束,退出代码0
1. JSON对象与Python字典有相同的格式,所以我们可以直接将Pyhon字典转化为DataFrame数据
import pandas as pd
# 字典格式的 JSON
s = {
"col1":{"row1":1,"row2":2,"row3":3},
"col2":{"row1":"x","row2":"y","row3":"z"}
}
# 读取 JSON 转为 DataFrame
df = pd.DataFrame(s)
print(df)
执行结果
D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/c.py
col1 col2
row1 1 x
row2 2 y
row3 3 z
进程已结束,退出代码0
2.从url中读取JSON数据
import pandas as pd
URL = 'https://static.runoob.com/download/sites.json'
df = pd.read_json(URL)
print(df)
执行结果
D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/c.py
id name url likes
0 A001 菜鸟教程 www.runoob.com 61
1 A002 Google www.google.com 124
2 A003 淘宝 www.taobao.com 45
进程已结束,退出代码0
3.内嵌的JSON数据
import pandas as pd
df = pd.read_json('nested_list.json')
print(df)
执行结果
school_name class students
0 ABC primary school Year 1 {'id': 'A001', 'name': 'Tom', 'math': 60, 'phy...
1 ABC primary school Year 1 {'id': 'A002', 'name': 'James', 'math': 89, 'p...
2 ABC primary school Year 1 {'id': 'A003', 'name': 'Jenny', 'math': 79, 'p...
4.json_normalize()将内嵌的数据完整的解析出来
import pandas as pd
import json
# 使用 Python JSON 模块载入数据
with open('nested_list.json','r') as f:
data = json.loads(f.read())
# 展平数据
df_nested_list = pd.json_normalize(data, record_path =['students'])
print(df_nested_list)
实例:json_normalize()使用了参数record_path并设置[‘student’]用于展开内嵌的JSON数据students,
import pandas as pd
import json
# 使用 Python JSON 模块载入数据
with open('nested_list.json','r') as f:
data = json.loads(f.read())
# 展平数据
df_nested_list = pd.json_normalize(
data,
record_path =['students'],
meta=['school_name', 'class']
)
print(df_nested_list)
执行结果
id name math physics chemistry school_name class
0 A001 Tom 60 66 61 ABC primary school Year 1
1 A002 James 89 76 51 ABC primary school Year 1
2 A003 Jenny 79 90 78 ABC primary school Year 1
5.读取内嵌数据中的一组数据
读取内嵌中的math字段
{
"school_name": "local primary school",
"class": "Year 1",
"students": [
{
"id": "A001",
"name": "Tom",
"grade": {
"math": 60,
"physics": 66,
"chemistry": 61
}
},
{
"id": "A002",
"name": "James",
"grade": {
"math": 89,
"physics": 76,
"chemistry": 51
}
},
{
"id": "A003",
"name": "Jenny",
"grade": {
"math": 79,
"physics": 90,
"chemistry": 78
}
}]
}
需要用到glom模块,其允许使用 . 来访问内嵌对象的属性
先安装
pip3 install glom
import pandas as pd
from glom import glom
df = pd.read_json('nested_deep.json')
data = df['students'].apply(lambda row: glom(row, 'grade.math'))
print(data)
执行结果
0 60
1 89
2 79
Name: students, dtype: int64
六.数据清洗
对一些没有用的数据进行处理的过程。
很多数据集存在数据缺失,格式错误,错误数据或重复数据
1.Pandas清洗空值
删除包含空字段的行,用dropna()
格式:
DataFrame.dropna(axis=0,how='any',thresh=None,subset=None,inplace=False)
axis:默认为0,逢空值剔除整行;axis=1,表示逢空值去掉整列
how:默认为'any',一行或一列里任何一个数据有出现NA就去掉整行。如果设置how='all'一行(或列)都是NA才去掉这整行
thresh:设置需要多少非空值的数据才可以保留下来
subset:设置想要检查的列。如果是多个列,可以使用列名的list作为参数
inplace:如果设置True。将计算得到的值替换之前的值并返回None,修改的是源数据
实例:通过 isnull()来判断单元格是否为空
import pandas as pd
df=pd.read_csv('property-data.csv')
print(df['NUM_BEDROOMS'])
print(df['NUM_BEDROOMS'].isnull())
执行结果
2.Pandas把n/a和NA当作空数据,na不是空数据,不符合我们的要求,我们可以指定空数据类型
import pandas as pd
missing_values = ["n/a", "na", "--"]
df = pd.read_csv('property-data.csv', na_values = missing_values)
print (df['NUM_BEDROOMS'])
print (df['NUM_BEDROOMS'].isnull())
执行结果
3.删除包含空数据的行
import pandas as pd
df = pd.read_csv('property-data.csv')
new_df = df.dropna()
print(new_df.to_string())
修改元素据DataFrame,使用inplace=True
import pandas as pd
df = pd.read_csv('property-data.csv')
df.dropna(inplace = True)
print(df.to_string())
移除指定列有空值的行
import pandas as pd
df = pd.read_csv('property-data.csv')
df.dropna(subset=['ST_NUM'], inplace = True)
print(df.to_string())
4.使用fillna()来替换一些空字段
实例:使用12345来替换一些空字段
import pandas as pd
df = pd.read_csv('property-data.csv')
df.fillna(12345, inplace = True)
print(df.to_string())
实例:指定一个列来替换数据
使用12345替换PID为空数据
import pandas as pd
df = pd.read_csv('property-data.csv')
df['PID'].fillna(12345, inplace = True)
print(df.to_string())
实例:使用mean()计算列的均值并替换空单元格
import pandas as pd
df = pd.read_csv('property-data.csv')
x = df["ST_NUM"].mean()
df["ST_NUM"].fillna(x, inplace = True)
print(df.to_string())
实例:使用median()计算列的中位数并替换空单元格
import pandas as pd
df = pd.read_csv('property-data.csv')
x = df["ST_NUM"].median()
df["ST_NUM"].fillna(x, inplace = True)
print(df.to_string())
实例:使用mode()来计算列的众数并替换空单元格
import pandas as pd
df = pd.read_csv('property-data.csv')
x = df["ST_NUM"].mode()
df["ST_NUM"].fillna(x, inplace = True)
print(df.to_string())
5.Pandas清洗格式错误数据
错误的单元格会使数据分析变得困难
我们可以通过包含单元格的行,或者将列中的所有单元格转换为相同格式的数据
实例:格式化日期
import pandas as pd
# 第三个日期格式错误
data = {
"Date": ['2020/12/01', '2020/12/02' , '20201226'],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())
执行结果
Date duration
day1 2020-12-01 50
day2 2020-12-02 40
day3 2020-12-26 45
6.Pandas清洗错误数据
对错误的数据进行替换或移除
实例:替换错误年龄的数据
import pandas as pd
person = {
"name": ['Google', 'Runoob' , 'Taobao'],
"age": [50, 40, 12345] # 12345 年龄数据是错误的
}
df = pd.DataFrame(person)
df.loc[2, 'age'] = 30 # 修改数据
print(df.to_string())
name age
0 Google 50
1 Runoob 40
2 Taobao 30
实例:将age>120的设置为120
import pandas as pd
person = {
"name": ['Google', 'Runoob' , 'Taobao'],
"age": [50, 200, 12345]
}
df = pd.DataFrame(person)
for x in df.index:
if df.loc[x, "age"] > 120:
df.loc[x, "age"] = 120
print(df.to_string())
执行结果
name age
0 Google 50
1 Runoob 120
2 Taobao 120
实例:将age>120的删除
import pandas as pd
person = {
"name": ['Google', 'Runoob' , 'Taobao'],
"age": [50, 40, 12345] # 12345 年龄数据是错误的
}
df = pd.DataFrame(person)
for x in df.index:
if df.loc[x, "age"] > 120:
df.drop(x, inplace = True)
print(df.to_string())
执行结果
name age
0 Google 50
1 Runoob 40
7.Pandas清洗重复数据
使用duplicated()和drop_duplicates()
如果对应的数据是重复的,duplicated()会返回True,否则False
import pandas as pd
person = {
"name": ['Google', 'Runoob', 'Runoob', 'Taobao'],
"age": [50, 40, 40, 23]
}
df = pd.DataFrame(person)
print(df.duplicated())
执行结果
0 False
1 False
2 True
3 False
dtype: bool
实例:删除重复的数据,直接使用drop_duplicates()
import pandas as pd
persons = {
"name": ['Google', 'Runoob', 'Runoob', 'Taobao'],
"age": [50, 40, 40, 23]
}
df = pd.DataFrame(persons)
df.drop_duplicates(inplace = True)
print(df)
执行结果
name age
0 Google 50
1 Runoob 40
3 Taobao 23