Pandas笔记

文章介绍了Pandas库的初始使用,包括Series和DataFrame数据结构的创建、操作以及CSV文件的读写。此外,还详细阐述了如何处理数据缺失和格式错误,如使用dropna(),fillna()等方法进行数据清洗。
摘要由CSDN通过智能技术生成

一. 初始Pandas

1.安装

pip install pandas

2.查看pandas版本

import pandas
pandas.__version__  # 查看版本

二. Pandas数据结构-Series

Series:类似表格中的一个列(column),类似于一维数组,可以保存任何数据类型

pandas.Series(data,index,dtype,name,copy)

data:一组数据
index:数据索引
dtype:数据类型,默认会自己判断
name:设置名称
copy:拷贝数据,默认False

1.一个简单的Series实例

import pandas as pd

a=[1,2,3]
myvar=pd.Series(a)
print(myvar)

在这里插入图片描述

2.没有指定索引,就从0开始,我们可以根据索引值读取数据

import pandas as pd

a=[1,2,3]
myvar=pd.Series(a)
print(myvar[1])

执行结果

D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/爬虫/c.py
2

进程已结束,退出代码0

3.指定索引

import pandas as pd

a=["Google","runoob","wiki"]
myvar=pd.Series(a,index=["x","y","z"])
print(myvar)

执行结果

D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/爬虫/c.py
x    Google
y    runoob
z      wiki
dtype: object

进程已结束,退出代码0

4.使用key/value对象,类似字典来创建Series:

import pandas as pd

sites={1:"Google",2:"Runoob",3:"Wiki"}
myvar=pd.Series(sites)
print(myvar)

执行结果
字典的key变成了索引值

D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/爬虫/c.py
1    Google
2    Runoob
3      Wiki
dtype: object

进程已结束,退出代码0

实例:如果只需要字典中的一部分数据,只需要指定需要数据的索引即可

import pandas as pd

sites={1:"Google",2:"Runoob",3:"Wiki"}
myvar=pd.Series(sites,index=[1,2])

print(myvar)

执行结果

D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/爬虫/c.py
1    Google
2    Runoob
dtype: object

进程已结束,退出代码0

实例:设置series名称参数

import pandas as pd

sites={1:"Google",2:"Runoob",3:"Wiki"}
myvar=pd.Series(sites,index=[1,2],name="Runoob-Series-TEST")

print(myvar)

执行结果

D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/爬虫/c.py
1    Google
2    Runoob
Name: Runoob-Series-TEST, dtype: object

进程已结束,退出代码0

三.DataFrame

表格型的数据结构,既有行索引,也有列索引
可以看成由Series组成的字典
在这里插入图片描述
格式:

pandas.DataFrame(data,index,columns,dype,copy)

data:一组数据
index:
columns:列标签,默认为RangeIndex(0,1,2,,,n)
dtype:数据类型
copy:拷贝数据,默认False

1.DataFrame是一个二维的数组结构,类似二维数组

import pandas as pd

data=[['Google',10],['Runoob',12],['Wiki',13]]
df=pd.DataFrame(data,columns=['Sites','Age'],dtype=float)
print(df)

执行结果

    Sites   Age
0  Google  10.0
1  Runoob  12.0
2    Wiki  13.0

进程已结束,退出代码0

2.使用ndarrays创建

import pandas as pd

data={'Site':['Google','Runoob','Wiki'],'Age':[10,12,13]}
df=pd.DataFrame(data)
print(df)

执行结果

D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/爬虫/c.py
     Site  Age
0  Google   10
1  Runoob   12
2    Wiki   13

进程已结束,退出代码0

3.使用字典(key/value),其中字典的key为列名

import pandas as pd

data=[{'a':1,'b':2},{'a':5,'b':10,'c':20}]
df=pd.DataFrame(data)
print(df)

执行结果
没有对应的部分数据为NaN

D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/爬虫/c.py
   a   b     c
0  1   2   NaN
1  5  10  20.0

进程已结束,退出代码0

4.Pandas可以使用loc属性返回指定行的数据,如果没有设置索引,第一行索引为0,第二行索引为1

import pandas as pd

data={
    "calories":[420,380,390],
    "duration":[50,40,45]
}
df=pd.DataFrame(data)

# 返回第一行
print(df.loc[0])
# 返回第二行
print(df.loc[1])

执行结果

D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/爬虫/c.py
calories    420
duration     50
Name: 0, dtype: int64
calories    380
duration     40
Name: 1, dtype: int64

进程已结束,退出代码0

5.也可以返回多行数据,[[…]]格式,…为各行的索引,以逗号隔开

import pandas as pd

data={
    "calories":[420,380,390],
    "duration":[50,40,45]
}
df=pd.DataFrame(data)

# 返回第一行和第二行
print(df.loc[[0,1]])

执行结果
返回结果其实就是一个Pandas DataFrame数据

D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/爬虫/c.py
   calories  duration
0       420        50
1       380        40

进程已结束,退出代码0

6.我们可以指定索引

import pandas as pd

data={
    "calories":[420,380,390],
    "duration":[50,40,45]
}
df=pd.DataFrame(data,index=["day1","day2","day3"])

print(df)

执行结果

D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/爬虫/c.py
      calories  duration
day1       420        50
day2       380        40
day3       390        45

进程已结束,退出代码0

7.Pandas可以使用loc属性返回指定索引对应到某一行

import pandas as pd

data={
    "calories":[420,380,390],
    "duration":[50,40,45]
}
df=pd.DataFrame(data,index=["day1","day2","day3"])

# 指定索引
print(df.loc["day2"])

执行结果

D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/爬虫/c.py
calories    380
duration     40
Name: day2, dtype: int64

进程已结束,退出代码0

四.Pandas CSV

to_string():用于返回DataFrame类型的数据,

import pandas as pd

df=pd.read_csv('nba.csv')
print(df.to_string())

执行结果

D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/c.py
                         Name                    Team  Number Position   Age Height  Weight                College      Salary
0               Avery Bradley          Boston Celtics     0.0       PG  25.0    6-2   180.0                  Texas   7730337.0
1                 Jae Crowder          Boston Celtics    99.0       SF  25.0    6-6   235.0              Marquette   6796117.0
2                John Holland          Boston Celtics    30.0       SG  27.0    6-5   205.0      Boston University         NaN
3                 R.J. Hunter          Boston Celtics    28.0       SG  22.0    6-5   185.0          Georgia State   1148640.0
4               Jonas Jerebko          Boston Celtics     8.0       PF  29.0   6-10   231.0                    NaN   5000000.0
5                Amir Johnson          Boston Celtics    90.0       PF  29.0    6-9   240.0                    NaN  12000000.0

1.使用to_csv()将DataFrame存储为csv文件

import pandas as pd
   
# 三个字段 name, site, age
nme = ["Google", "Runoob", "Taobao", "Wiki"]
st = ["www.google.com", "www.runoob.com", "www.taobao.com", "www.wikipedia.org"]
ag = [90, 40, 80, 98]
   
# 字典
dict = {'name': nme, 'site': st, 'age': ag}
     
df = pd.DataFrame(dict)
 
# 保存 dataframe
df.to_csv('site.csv')

在这里插入图片描述

2.数据处理

**head(n)**用于读取前面的n行,如果不填参数n,默认返回5行

import pandas as pd

df=pd.read_csv('nba.csv')
print(df.head())

执行结果

D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/c.py
            Name            Team  Number  ... Weight            College     Salary
0  Avery Bradley  Boston Celtics     0.0  ...  180.0              Texas  7730337.0
1    Jae Crowder  Boston Celtics    99.0  ...  235.0          Marquette  6796117.0
2   John Holland  Boston Celtics    30.0  ...  205.0  Boston University        NaN
3    R.J. Hunter  Boston Celtics    28.0  ...  185.0      Georgia State  1148640.0
4  Jonas Jerebko  Boston Celtics     8.0  ...  231.0                NaN  5000000.0

[5 rows x 9 columns]

进程已结束,退出代码0

3.实例:读取前面10行

import pandas as pd

df=pd.read_csv('nba.csv')
print(df.head(10))

执行结果

D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/c.py
            Name            Team  Number  ... Weight            College      Salary
0  Avery Bradley  Boston Celtics     0.0  ...  180.0              Texas   7730337.0
1    Jae Crowder  Boston Celtics    99.0  ...  235.0          Marquette   6796117.0
2   John Holland  Boston Celtics    30.0  ...  205.0  Boston University         NaN
3    R.J. Hunter  Boston Celtics    28.0  ...  185.0      Georgia State   1148640.0
4  Jonas Jerebko  Boston Celtics     8.0  ...  231.0                NaN   5000000.0
5   Amir Johnson  Boston Celtics    90.0  ...  240.0                NaN  12000000.0
6  Jordan Mickey  Boston Celtics    55.0  ...  235.0                LSU   1170960.0
7   Kelly Olynyk  Boston Celtics    41.0  ...  238.0            Gonzaga   2165160.0
8   Terry Rozier  Boston Celtics    12.0  ...  190.0         Louisville   1824360.0
9   Marcus Smart  Boston Celtics    36.0  ...  220.0     Oklahoma State   3431040.0

[10 rows x 9 columns]

进程已结束,退出代码0

4. tail(n)

用于读取尾部的n行,如果不填参数n,默认返回5行,空行各个字段的值返回NaN

import pandas as pd

df=pd.read_csv('nba.csv')
print(df.tail())

执行结果

D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/c.py
             Name       Team  Number Position  ...  Height Weight  College     Salary
453  Shelvin Mack  Utah Jazz     8.0       PG  ...     6-3  203.0   Butler  2433333.0
454     Raul Neto  Utah Jazz    25.0       PG  ...     6-1  179.0      NaN   900000.0
455  Tibor Pleiss  Utah Jazz    21.0        C  ...     7-3  256.0      NaN  2900000.0
456   Jeff Withey  Utah Jazz    24.0        C  ...     7-0  231.0   Kansas   947276.0
457           NaN        NaN     NaN      NaN  ...     NaN    NaN      NaN        NaN

[5 rows x 9 columns]

进程已结束,退出代码0

5.info()

返回表格的一些基本信息

import pandas as pd

df=pd.read_csv('nba.csv')
print(df.info())

执行结果

D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/c.py
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458 entries, 0 to 457
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      457 non-null    object 
 1   Team      457 non-null    object 
 2   Number    457 non-null    float64
 3   Position  457 non-null    object 
 4   Age       457 non-null    float64
 5   Height    457 non-null    object 
 6   Weight    457 non-null    float64
 7   College   373 non-null    object 
 8   Salary    446 non-null    float64
dtypes: float64(4), object(5)
memory usage: 32.3+ KB
None

进程已结束,退出代码0

五.JSON

Pandas可以很方便的处理JSON数据,

import pandas as pd

df=pd.read_json('sites.json')
print(df.to_string())

执行结果

D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/c.py
     id    name             url  likes
0  A001    菜鸟教程  www.runoob.com     61
1  A002  Google  www.google.com    124
2  A003      淘宝  www.taobao.com     45

进程已结束,退出代码0

1. JSON对象与Python字典有相同的格式,所以我们可以直接将Pyhon字典转化为DataFrame数据

import pandas as pd


# 字典格式的 JSON
s = {
    "col1":{"row1":1,"row2":2,"row3":3},
    "col2":{"row1":"x","row2":"y","row3":"z"}
}

# 读取 JSON 转为 DataFrame
df = pd.DataFrame(s)
print(df)

执行结果

D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/c.py
      col1 col2
row1     1    x
row2     2    y
row3     3    z

进程已结束,退出代码0

2.从url中读取JSON数据

import pandas as pd

URL = 'https://static.runoob.com/download/sites.json'
df = pd.read_json(URL)
print(df)

执行结果

D:\python\python3.7.9\python3.7.9\python.exe D:/python/conda-cvProject/c.py
     id    name             url  likes
0  A001    菜鸟教程  www.runoob.com     61
1  A002  Google  www.google.com    124
2  A003      淘宝  www.taobao.com     45

进程已结束,退出代码0

3.内嵌的JSON数据

import pandas as pd

df = pd.read_json('nested_list.json')

print(df)

执行结果

         school_name   class                                           students
0  ABC primary school  Year 1  {'id': 'A001', 'name': 'Tom', 'math': 60, 'phy...
1  ABC primary school  Year 1  {'id': 'A002', 'name': 'James', 'math': 89, 'p...
2  ABC primary school  Year 1  {'id': 'A003', 'name': 'Jenny', 'math': 79, 'p...

4.json_normalize()将内嵌的数据完整的解析出来

import pandas as pd
import json

# 使用 Python JSON 模块载入数据
with open('nested_list.json','r') as f:
    data = json.loads(f.read())

# 展平数据
df_nested_list = pd.json_normalize(data, record_path =['students'])
print(df_nested_list)

实例:json_normalize()使用了参数record_path并设置[‘student’]用于展开内嵌的JSON数据students,

import pandas as pd
import json

# 使用 Python JSON 模块载入数据
with open('nested_list.json','r') as f:
    data = json.loads(f.read())

# 展平数据
df_nested_list = pd.json_normalize(
    data,
    record_path =['students'],
    meta=['school_name', 'class']
)
print(df_nested_list)

执行结果

     id   name  math  physics  chemistry         school_name   class
0  A001    Tom    60       66         61  ABC primary school  Year 1
1  A002  James    89       76         51  ABC primary school  Year 1
2  A003  Jenny    79       90         78  ABC primary school  Year 1

5.读取内嵌数据中的一组数据

读取内嵌中的math字段

{
    "school_name": "local primary school",
    "class": "Year 1",
    "students": [
    {
        "id": "A001",
        "name": "Tom",
        "grade": {
            "math": 60,
            "physics": 66,
            "chemistry": 61
        }
 
    },
    {
        "id": "A002",
        "name": "James",
        "grade": {
            "math": 89,
            "physics": 76,
            "chemistry": 51
        }
       
    },
    {
        "id": "A003",
        "name": "Jenny",
        "grade": {
            "math": 79,
            "physics": 90,
            "chemistry": 78
        }
    }]
}

需要用到glom模块,其允许使用 . 来访问内嵌对象的属性

先安装

pip3 install glom
import pandas as pd
from glom import glom

df = pd.read_json('nested_deep.json')

data = df['students'].apply(lambda row: glom(row, 'grade.math'))
print(data)

执行结果

0    60
1    89
2    79
Name: students, dtype: int64

六.数据清洗

对一些没有用的数据进行处理的过程。
很多数据集存在数据缺失,格式错误,错误数据或重复数据
在这里插入图片描述

1.Pandas清洗空值

删除包含空字段的行,用dropna()
格式:

DataFrame.dropna(axis=0,how='any',thresh=None,subset=None,inplace=False)

axis:默认为0,逢空值剔除整行;axis=1,表示逢空值去掉整列
how:默认为'any',一行或一列里任何一个数据有出现NA就去掉整行。如果设置how='all'一行(或列)都是NA才去掉这整行
thresh:设置需要多少非空值的数据才可以保留下来
subset:设置想要检查的列。如果是多个列,可以使用列名的list作为参数
inplace:如果设置True。将计算得到的值替换之前的值并返回None,修改的是源数据

实例:通过 isnull()来判断单元格是否为空

import pandas as pd

df=pd.read_csv('property-data.csv')
print(df['NUM_BEDROOMS'])
print(df['NUM_BEDROOMS'].isnull())

执行结果

在这里插入图片描述

2.Pandas把n/a和NA当作空数据,na不是空数据,不符合我们的要求,我们可以指定空数据类型

import pandas as pd

missing_values = ["n/a", "na", "--"]
df = pd.read_csv('property-data.csv', na_values = missing_values)

print (df['NUM_BEDROOMS'])
print (df['NUM_BEDROOMS'].isnull())

执行结果
在这里插入图片描述

3.删除包含空数据的行

import pandas as pd

df = pd.read_csv('property-data.csv')

new_df = df.dropna()

print(new_df.to_string())

在这里插入图片描述
修改元素据DataFrame,使用inplace=True

import pandas as pd

df = pd.read_csv('property-data.csv')

df.dropna(inplace = True)

print(df.to_string())

在这里插入图片描述
移除指定列有空值的行

import pandas as pd

df = pd.read_csv('property-data.csv')

df.dropna(subset=['ST_NUM'], inplace = True)

print(df.to_string())

在这里插入图片描述

4.使用fillna()来替换一些空字段

实例:使用12345来替换一些空字段

import pandas as pd

df = pd.read_csv('property-data.csv')

df.fillna(12345, inplace = True)

print(df.to_string())

在这里插入图片描述
实例:指定一个列来替换数据
使用12345替换PID为空数据

import pandas as pd

df = pd.read_csv('property-data.csv')

df['PID'].fillna(12345, inplace = True)

print(df.to_string())

在这里插入图片描述
实例:使用mean()计算列的均值并替换空单元格

import pandas as pd

df = pd.read_csv('property-data.csv')

x = df["ST_NUM"].mean()

df["ST_NUM"].fillna(x, inplace = True)

print(df.to_string())

在这里插入图片描述
实例:使用median()计算列的中位数并替换空单元格

import pandas as pd

df = pd.read_csv('property-data.csv')

x = df["ST_NUM"].median()

df["ST_NUM"].fillna(x, inplace = True)

print(df.to_string())

在这里插入图片描述
实例:使用mode()来计算列的众数并替换空单元格

import pandas as pd

df = pd.read_csv('property-data.csv')

x = df["ST_NUM"].mode()

df["ST_NUM"].fillna(x, inplace = True)

print(df.to_string())

在这里插入图片描述

5.Pandas清洗格式错误数据

错误的单元格会使数据分析变得困难
我们可以通过包含单元格的行,或者将列中的所有单元格转换为相同格式的数据
实例:格式化日期

import pandas as pd

# 第三个日期格式错误
data = {
  "Date": ['2020/12/01', '2020/12/02' , '20201226'],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

df['Date'] = pd.to_datetime(df['Date'])

print(df.to_string())

执行结果

           Date  duration
day1 2020-12-01        50
day2 2020-12-02        40
day3 2020-12-26        45

6.Pandas清洗错误数据

对错误的数据进行替换或移除
实例:替换错误年龄的数据

import pandas as pd

person = {
  "name": ['Google', 'Runoob' , 'Taobao'],
  "age": [50, 40, 12345]    # 12345 年龄数据是错误的
}

df = pd.DataFrame(person)

df.loc[2, 'age'] = 30 # 修改数据

print(df.to_string())
     name  age
0  Google   50
1  Runoob   40
2  Taobao   30

实例:将age>120的设置为120

import pandas as pd

person = {
  "name": ['Google', 'Runoob' , 'Taobao'],
  "age": [50, 200, 12345]    
}

df = pd.DataFrame(person)

for x in df.index:
  if df.loc[x, "age"] > 120:
    df.loc[x, "age"] = 120

print(df.to_string())

执行结果

     name  age
0  Google   50
1  Runoob  120
2  Taobao  120

实例:将age>120的删除

import pandas as pd

person = {
  "name": ['Google', 'Runoob' , 'Taobao'],
  "age": [50, 40, 12345]    # 12345 年龄数据是错误的
}

df = pd.DataFrame(person)

for x in df.index:
  if df.loc[x, "age"] > 120:
    df.drop(x, inplace = True)

print(df.to_string())

执行结果

     name  age
0  Google   50
1  Runoob   40

7.Pandas清洗重复数据

使用duplicated()和drop_duplicates()
如果对应的数据是重复的,duplicated()会返回True,否则False

import pandas as pd

person = {
  "name": ['Google', 'Runoob', 'Runoob', 'Taobao'],
  "age": [50, 40, 40, 23]  
}
df = pd.DataFrame(person)

print(df.duplicated())

执行结果

0    False
1    False
2     True
3    False
dtype: bool

实例:删除重复的数据,直接使用drop_duplicates()

import pandas as pd

persons = {
  "name": ['Google', 'Runoob', 'Runoob', 'Taobao'],
  "age": [50, 40, 40, 23]  
}

df = pd.DataFrame(persons)

df.drop_duplicates(inplace = True)
print(df)

执行结果

     name  age
0  Google   50
1  Runoob   40
3  Taobao   23
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值