[Python数据分析] 异常数据处理

导入pandas和numpy

import pandas as pd
import numpy as np

1.导入数据并查看属性

data = pd.read_csv("data/Online_Retail_Fake.csv")  # 导入数据
data.shape # 查看shape属性

outputs:

(541910, 8)
data.size # 查看size属性

outputs:

4335280
data  # 查看数据

outputs:

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010/12/1 8:262.5517850.0United Kingdom
153636571053WHITE METAL LANTERN62010/12/1 8:26NaN17850.0United Kingdom
253636584406BNaN82010/12/1 8:262.7517850.0United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010/12/1 8:263.3917850.0United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010/12/1 8:263.3917850.0United Kingdom
...........................
54190558158722899CHILDREN'S APRON DOLLY GIRL62011/12/9 12:502.1012680.0France
54190658158723254CHILDRENS CUTLERY DOLLY GIRL42011/12/9 12:504.1512680.0France
54190758158723255CHILDRENS CUTLERY CIRCUS PARADE42011/12/9 12:504.1512680.0France
54190858158722138BAKING SET 9 PIECE RETROSPOT32011/12/9 12:504.9512680.0France
54190958158722138Wrong booking32011/12/9 12:504.9512680.0France

541910 rows × 8 columns

data.info() # 查看元数据信息

output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541910 entries, 0 to 541909
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541910 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540454 non-null  object 
 3   Quantity     541910 non-null  int64  
 4   InvoiceDate  541910 non-null  object 
 5   UnitPrice    541907 non-null  float64
 6   CustomerID   406828 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB

2.异常数据处理

2.1空值

data.isnull().sum() # 查看是否有空值
InvoiceNo           0
StockCode           1
Description      1456
Quantity            0
InvoiceDate         0
UnitPrice           3
CustomerID     135082
Country             1
dtype: int64

可以看到各个属性列的空值个数

2.2 重复数据

data.duplicated()

output:

0         False
1         False
2         False
3         False
4         False
          ...  
541905    False
541906    False
541907    False
541908    False
541909    False
Length: 541910, dtype: bool
data.duplicated().sum()

outputs:

5268

可以看出有较多重复数据

data[data.duplicated()].head()

outputs:

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
51753640921866UNION JACK FLAG LUGGAGE TAG12010/12/1 11:451.2517908.0United Kingdom
52753640922866HAND WARMER SCOTTY DOG DESIGN12010/12/1 11:452.1017908.0United Kingdom
53753640922900SET 2 TEA TOWELS I LOVE LONDON12010/12/1 11:452.9517908.0United Kingdom
53953640922111SCOTTIE DOG HOT WATER BOTTLE12010/12/1 11:454.9517908.0United Kingdom
55553641222327ROUND SNACK BOXES SET OF 4 SKULLS12010/12/1 11:492.9517920.0United Kingdom

可以看出InvoiceNo为536409的信息较多,需要单独分析

查看InvoiceNo为536409的数据

data[data["InvoiceNo"] == "536409"]

outputs:

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
48353640990199C5 STRAND GLASS NECKLACE CRYSTAL32010/12/1 11:456.3517908.0United Kingdom
48453640921479WHITE SKULL HOT WATER BOTTLE12010/12/1 11:453.7517908.0United Kingdom
48553640922111SCOTTIE DOG HOT WATER BOTTLE12010/12/1 11:454.9517908.0United Kingdom
48653640922785SQUARECUSHION COVER PINK UNION FLAG12010/12/1 11:456.7517908.0United Kingdom
........................
53753640922900SET 2 TEA TOWELS I LOVE LONDON12010/12/1 11:452.9517908.0United Kingdom
53853640921980PACK OF 12 RED RETROSPOT TISSUES122010/12/1 11:450.2917908.0United Kingdom
53953640922111SCOTTIE DOG HOT WATER BOTTLE12010/12/1 11:454.9517908.0United Kingdom
54053640922112CHOCOLATE HOT WATER BOTTLE12010/12/1 11:454.9517908.0United Kingdom
data[data.duplicated(keep=False)][data[data.duplicated(keep=False)]["InvoiceNo"] == "536409"]

outputs:

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
48553640922111SCOTTIE DOG HOT WATER BOTTLE12010/12/1 11:454.9517908.0United Kingdom
48953640922866HAND WARMER SCOTTY DOG DESIGN12010/12/1 11:452.1017908.0United Kingdom
49453640921866UNION JACK FLAG LUGGAGE TAG12010/12/1 11:451.2517908.0United Kingdom
51753640921866UNION JACK FLAG LUGGAGE TAG12010/12/1 11:451.2517908.0United Kingdom
52153640922900SET 2 TEA TOWELS I LOVE LONDON12010/12/1 11:452.9517908.0United Kingdom
52753640922866HAND WARMER SCOTTY DOG DESIGN12010/12/1 11:452.1017908.0United Kingdom
53753640922900SET 2 TEA TOWELS I LOVE LONDON12010/12/1 11:452.9517908.0United Kingdom
53953640922111SCOTTIE DOG HOT WATER BOTTLE12010/12/1 11:454.9517908.0United Kingdom

可直接获取如上重复数据

2.3 异常处理

2.3.1 空值处理

data.dropna()

outputs:

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010/12/1 8:262.5517850.0United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010/12/1 8:263.3917850.0United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010/12/1 8:263.3917850.0United Kingdom
553636522752SET 7 BABUSHKA NESTING BOXES22010/12/1 8:267.6517850.0United Kingdom
653636521730GLASS STAR FROSTED T-LIGHT HOLDER62010/12/1 8:264.2517850.0United Kingdom
...........................
54190558158722899CHILDREN'S APRON DOLLY GIRL62011/12/9 12:502.1012680.0France
54190658158723254CHILDRENS CUTLERY DOLLY GIRL42011/12/9 12:504.1512680.0France
54190758158723255CHILDRENS CUTLERY CIRCUS PARADE42011/12/9 12:504.1512680.0France
54190858158722138BAKING SET 9 PIECE RETROSPOT32011/12/9 12:504.9512680.0France
54190958158722138Wrong booking32011/12/9 12:504.9512680.0France

406825 rows × 8 columns

如上实现某一行有一个空值则去掉

data.dropna(how="all")

outputs:

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010/12/1 8:262.5517850.0United Kingdom
153636571053WHITE METAL LANTERN62010/12/1 8:26NaN17850.0United Kingdom
253636584406BNaN82010/12/1 8:262.7517850.0United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010/12/1 8:263.3917850.0United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010/12/1 8:263.3917850.0United Kingdom
...........................
54190558158722899CHILDREN'S APRON DOLLY GIRL62011/12/9 12:502.1012680.0France
54190658158723254CHILDRENS CUTLERY DOLLY GIRL42011/12/9 12:504.1512680.0France
54190758158723255CHILDRENS CUTLERY CIRCUS PARADE42011/12/9 12:504.1512680.0France
54190858158722138BAKING SET 9 PIECE RETROSPOT32011/12/9 12:504.9512680.0France
54190958158722138Wrong booking32011/12/9 12:504.9512680.0France

541910 rows × 8 columns

如上实现某一行全为空值才去掉

data.dropna(axis=1)

outputs:

InvoiceNoQuantityInvoiceDate
053636562010/12/1 8:26
153636562010/12/1 8:26
253636582010/12/1 8:26
353636562010/12/1 8:26
453636562010/12/1 8:26
............
54190558158762011/12/9 12:50
54190658158742011/12/9 12:50
54190758158742011/12/9 12:50
54190858158732011/12/9 12:50
54190958158732011/12/9 12:50

541910 rows × 3 columns

如上实现按列去除

data.dropna(axis=1, how="all")

outputs:

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010/12/1 8:262.5517850.0United Kingdom
153636571053WHITE METAL LANTERN62010/12/1 8:26NaN17850.0United Kingdom
253636584406BNaN82010/12/1 8:262.7517850.0United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010/12/1 8:263.3917850.0United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010/12/1 8:263.3917850.0United Kingdom
...........................
54190558158722899CHILDREN'S APRON DOLLY GIRL62011/12/9 12:502.1012680.0France
54190658158723254CHILDRENS CUTLERY DOLLY GIRL42011/12/9 12:504.1512680.0France
54190758158723255CHILDRENS CUTLERY CIRCUS PARADE42011/12/9 12:504.1512680.0France
54190858158722138BAKING SET 9 PIECE RETROSPOT32011/12/9 12:504.9512680.0France
54190958158722138Wrong booking32011/12/9 12:504.9512680.0France

541910 rows × 8 columns

如上实现按列去除全为空值的列

data.fillna(value=0)
InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010/12/1 8:262.5517850.0United Kingdom
153636571053WHITE METAL LANTERN62010/12/1 8:260.0017850.0United Kingdom
253636584406B082010/12/1 8:262.7517850.0United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010/12/1 8:263.3917850.0United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010/12/1 8:263.3917850.0United Kingdom
...........................
54190558158722899CHILDREN'S APRON DOLLY GIRL62011/12/9 12:502.1012680.0France
54190658158723254CHILDRENS CUTLERY DOLLY GIRL42011/12/9 12:504.1512680.0France
54190758158723255CHILDRENS CUTLERY CIRCUS PARADE42011/12/9 12:504.1512680.0France
54190858158722138BAKING SET 9 PIECE RETROSPOT32011/12/9 12:504.9512680.0France
54190958158722138Wrong booking32011/12/9 12:504.9512680.0France

541910 rows × 8 columns

如上实现空值填充为0

data.fillna(method="ffill")

outputs:

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010/12/1 8:262.5517850.0United Kingdom
153636571053WHITE METAL LANTERN62010/12/1 8:262.5517850.0United Kingdom
253636584406BWHITE METAL LANTERN82010/12/1 8:262.7517850.0United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010/12/1 8:263.3917850.0United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010/12/1 8:263.3917850.0United Kingdom
...........................
54190558158722899CHILDREN'S APRON DOLLY GIRL62011/12/9 12:502.1012680.0France
54190658158723254CHILDRENS CUTLERY DOLLY GIRL42011/12/9 12:504.1512680.0France
54190758158723255CHILDRENS CUTLERY CIRCUS PARADE42011/12/9 12:504.1512680.0France
54190858158722138BAKING SET 9 PIECE RETROSPOT32011/12/9 12:504.9512680.0France
54190958158722138Wrong booking32011/12/9 12:504.9512680.0France

541910 rows × 8 columns

如上实现空值填充为上面的数据

data.fillna(method="bfill")

outputs:

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010/12/1 8:262.5517850.0United Kingdom
153636571053WHITE METAL LANTERN62010/12/1 8:262.7517850.0United Kingdom
253636584406BKNITTED UNION FLAG HOT WATER BOTTLE82010/12/1 8:262.7517850.0United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010/12/1 8:263.3917850.0United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010/12/1 8:263.3917850.0United Kingdom
...........................
54190558158722899CHILDREN'S APRON DOLLY GIRL62011/12/9 12:502.1012680.0France
54190658158723254CHILDRENS CUTLERY DOLLY GIRL42011/12/9 12:504.1512680.0France
54190758158723255CHILDRENS CUTLERY CIRCUS PARADE42011/12/9 12:504.1512680.0France
54190858158722138BAKING SET 9 PIECE RETROSPOT32011/12/9 12:504.9512680.0France
54190958158722138Wrong booking32011/12/9 12:504.9512680.0France

541910 rows × 8 columns

如上实现空值填充为下面的数据

2.3.2 重复值处理

data.drop_duplicates()

outputs:

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010/12/1 8:262.5517850.0United Kingdom
153636571053WHITE METAL LANTERN62010/12/1 8:26NaN17850.0United Kingdom
253636584406BNaN82010/12/1 8:262.7517850.0United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010/12/1 8:263.3917850.0United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010/12/1 8:263.3917850.0United Kingdom
...........................
54190558158722899CHILDREN'S APRON DOLLY GIRL62011/12/9 12:502.1012680.0France
54190658158723254CHILDRENS CUTLERY DOLLY GIRL42011/12/9 12:504.1512680.0France
54190758158723255CHILDRENS CUTLERY CIRCUS PARADE42011/12/9 12:504.1512680.0France
54190858158722138BAKING SET 9 PIECE RETROSPOT32011/12/9 12:504.9512680.0France
54190958158722138Wrong booking32011/12/9 12:504.9512680.0France

去除重复数据

data

outputs:

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010/12/1 8:262.5517850.0United Kingdom
153636571053WHITE METAL LANTERN62010/12/1 8:26NaN17850.0United Kingdom
253636584406BNaN82010/12/1 8:262.7517850.0United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010/12/1 8:263.3917850.0United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010/12/1 8:263.3917850.0United Kingdom
...........................
54190558158722899CHILDREN'S APRON DOLLY GIRL62011/12/9 12:502.1012680.0France
54190658158723254CHILDRENS CUTLERY DOLLY GIRL42011/12/9 12:504.1512680.0France
54190758158723255CHILDRENS CUTLERY CIRCUS PARADE42011/12/9 12:504.1512680.0France
54190858158722138BAKING SET 9 PIECE RETROSPOT32011/12/9 12:504.9512680.0France
54190958158722138Wrong booking32011/12/9 12:504.9512680.0France

541910 rows × 8 columns

data = data.fillna(method="bfill")
data

outputs:

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010/12/1 8:262.5517850.0United Kingdom
153636571053WHITE METAL LANTERN62010/12/1 8:262.7517850.0United Kingdom
253636584406BKNITTED UNION FLAG HOT WATER BOTTLE82010/12/1 8:262.7517850.0United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010/12/1 8:263.3917850.0United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010/12/1 8:263.3917850.0United Kingdom
...........................
54190558158722899CHILDREN'S APRON DOLLY GIRL62011/12/9 12:502.1012680.0France
54190658158723254CHILDRENS CUTLERY DOLLY GIRL42011/12/9 12:504.1512680.0France
54190758158723255CHILDRENS CUTLERY CIRCUS PARADE42011/12/9 12:504.1512680.0France
54190858158722138BAKING SET 9 PIECE RETROSPOT32011/12/9 12:504.9512680.0France
54190958158722138Wrong booking32011/12/9 12:504.9512680.0France

541910 rows × 8 columns

data.drop_duplicates(["InvoiceNo", "UnitPrice"])

outputs:

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010/12/1 8:262.5517850.0United Kingdom
153636571053WHITE METAL LANTERN62010/12/1 8:262.7517850.0United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010/12/1 8:263.3917850.0United Kingdom
553636522752SET 7 BABUSHKA NESTING BOXES22010/12/1 8:267.6517850.0United Kingdom
653636521730GLASS STAR FROSTED T-LIGHT HOLDER62010/12/1 8:264.2517850.0United Kingdom
...........................
54189758158722728ALARM CLOCK BAKELIKE PINK42011/12/9 12:503.7512680.0France
54190358158723256CHILDRENS CUTLERY SPACEBOY42011/12/9 12:504.1512680.0France
54190458158722613PACK OF 20 SPACEBOY NAPKINS122011/12/9 12:500.8512680.0France
54190558158722899CHILDREN'S APRON DOLLY GIRL62011/12/9 12:502.1012680.0France
54190858158722138BAKING SET 9 PIECE RETROSPOT32011/12/9 12:504.9512680.0France

204727 rows × 8 columns

如上实现去除 InvoiceNoUnitPrice相同的数据

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

是易安吖

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值