【Kaggle】练习赛《预测贴纸的销量》（上）

最新推荐文章于 2025-01-13 05:00:00 发布

cndrip

最新推荐文章于 2025-01-13 05:00:00 发布

阅读量930

点赞数 19

分类专栏：机器学习 kaggle 数据可视化文章标签：笔记数据挖掘

本文链接：https://blog.csdn.net/cndrip/article/details/144929704

版权

机器学习同时被 3 个专栏收录

43 篇文章

订阅专栏

kaggle

37 篇文章

订阅专栏

数据可视化

19 篇文章

订阅专栏

前言

本篇文章介绍的是2025年首个Kaggle月赛《Forecasting Sticker Sales》，即《预测贴纸的销量》。与之前一样，也同样适合初学者，但与之前不同的是，本次比赛的数据集是个时间序列，从题目来看，就是通过之前销量来预测未来的销量。因此，本期用先用着重讲述如何填充缺失值。

题目说明

数据集就4个特征列和一个标签，分别如下：date（日期）、country（国家）、store（商店）、product（产品）、num_sold（销量），非常简单。

加载数据

import pandas as pd
train=pd.read_csv('/kaggle/input/playground-series-s5e1/train.csv')
train.head()

_	id	date	country	store	product	num_sold
0	0	2010-01-01	Canada	Discount Stickers	Holographic Goose	NaN
1	1	2010-01-01	Canada	Discount Stickers	Kaggle	973.0
2	2	2010-01-01	Canada	Discount Stickers	Kaggle Tiers	906.0
3	3	2010-01-01	Canada	Discount Stickers	Kerneler	423.0
4	4	2010-01-01	Canada	Discount Stickers	Kerneler Dark Mode	491.0

test=pd.read_csv('/kaggle/input/playground-series-s5e1/test.csv')
test.head()

_	id	date	country	store	product
0	230130	2017-01-01	Canada	Discount Stickers	Holographic Goose
1	230131	2017-01-01	Canada	Discount Stickers	Kaggle
2	230132	2017-01-01	Canada	Discount Stickers	Kaggle Tiers
3	230133	2017-01-01	Canada	Discount Stickers	Kerneler
4	230134	2017-01-01	Canada	Discount Stickers	Kerneler Dark Mode

查看数据情况

train.date=pd.to_datetime(train.date, infer_datetime_format=True)
train.info()

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 230130 entries, 0 to 230129
Data columns (total 6 columns):
# Column Non-Null Count Dtype

0 id 230130 non-null int64
1 date 230130 non-null datetime64[ns]
2 country 230130 non-null object
3 store 230130 non-null object
4 product 230130 non-null object
5 num_sold 221259 non-null float64
dtypes: datetime64ns, float64(1), int64(1), object(3)
memory usage: 10.5+ MB

test.date=pd.to_datetime(test.date, infer_datetime_format=True)
test.info()

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 98550 entries, 0 to 98549
Data columns (total 5 columns):
# Column Non-Null Count Dtype

0 id 98550 non-null int64
1 date 98550 non-null datetime64[ns]
2 country 98550 non-null object
3 store 98550 non-null object
4 product 98550 non-null object
dtypes: datetime64ns, int64(1), object(3)
memory usage: 3.8+ MB

删除不需要的列

train=train.drop(['id'],axis=1)
test=test.drop(['id'],axis=1)

查看训练集与测试集分类内容是否相同

counter=0
for i in test.select_dtypes(include=['object']).columns.tolist():
    if (len(list(set(train[i].unique().tolist())^set(test[i].unique().tolist())))!=0):
        print(i ,'need to be worked on')
        counter+=1
    else:
        continue
if(counter==0):
    print('No work needed')

No work needed
即两个数据集一致

查看训练集分类的数量比

country=train['country'].value_counts().to_frame()
country.reset_index()

-	country	count
0	Canada	38355
1	Finland	38355
2	Italy	38355
3	Kenya	38355
4	Norway	38355
5	Singapore	38355

train.groupby('country').size().plot(kind='pie', autopct='%.2f%%');

product=train['product'].value_counts().to_frame()
product.reset_index()

product=train['product'].value_counts().to_frame()
product.reset_index()

-	product	count
0	Holographic Goose	46026
1	Kaggle	46026
2	Kaggle Tiers	46026
3	Kerneler	46026
4	Kerneler Dark Mode	46026

train.groupby('product').size().plot(kind='pie', autopct='%.2f%%');

从上述分析来看，各分类数据集分布非常均匀

counts = train.groupby(["country","store","product"])["num_sold"].count()
counts

Canada     Discount Stickers  Holographic Goose        0
                              Kaggle                2557
                              Kaggle Tiers          2557
                              Kerneler              2556
                              Kerneler Dark Mode    2557
                                                    ... 
Singapore  Stickers for Less  Holographic Goose     2557
                              Kaggle                2557
                              Kaggle Tiers          2557
                              Kerneler              2557
                              Kerneler Dark Mode    2557
Name: num_sold, Length: 90, dtype: int64

查看训练集缺失情况

missing_data = counts.loc[counts != 2557]
missing_data

Canada   Discount Stickers     Holographic Goose        0
                               Kerneler              2556
         Premium Sticker Mart  Holographic Goose     2177
         Stickers for Less     Holographic Goose     1249
Kenya    Discount Stickers     Holographic Goose        0
                               Kerneler              2494
                               Kerneler Dark Mode    2556
         Premium Sticker Mart  Holographic Goose     1911
         Stickers for Less     Holographic Goose     1199
Name: num_sold, dtype: int64

import matplotlib.pyplot as plt

# Plot setup
f, axs = plt.subplots(9, 1, figsize=(20, 50))

for i, (country, store, product) in enumerate(missing_data.index):
    # Filter the data for the current combination of country, store, and product
    plot = train.loc[
        (train["country"] == country) & 
        (train["store"] == store) & 
        (train["product"] == product)
    ]
    
    if plot.empty:
        print(f"No data available for {country} - {store} - {product}")
        continue

    # Handle missing values in num_sold
    missing_vals = plot.loc[plot["num_sold"].isna()]
    
    # Plot the num_sold data
    axs[i].plot(plot["date"], plot["num_sold"], marker='o', linestyle='-', color='blue', label='num_sold')

    # Add vertical lines for missing dates
    for missing_date in missing_vals["date"].dropna():
        axs[i].axvline(missing_date, color='red', linestyle='-', linewidth=1, alpha=0.5)

    # Add title and labels
    axs[i].set_title(f"{country} - {store} - {product}")
    axs[i].set_xlabel("Date")
    axs[i].set_ylabel("Number Sold")
    axs[i].legend(loc='upper right')

# Adjust layout
plt.tight_layout()
plt.show()

下图是表示缺失值分布情况，蓝色为真实值，红色垂直线为缺失值所在的日期

从上图可以看出，缺失值分布情况，有些产品是完全缺失

查看训练集与测试集时间区间

print("Train - Earliest date:", train["date"].min())
print("Train - Latest date:", train["date"].max())

print("Test - Earliest date:", test["date"].min())
print("Test - Latest date:", test["date"].max())

Train - Earliest date: 2010-01-01 00:00:00
Train - Latest date: 2016-12-31 00:00:00
Test - Earliest date: 2017-01-01 00:00:00
Test - Latest date: 2019-12-31 00:00:00

统计按周、月、年不同地区、产品的销量

weekly = train.groupby(["country","store", "product", pd.Grouper(key="date", freq="W")])["num_sold"].sum().rename("num_sold").reset_index()
monthly = train.groupby(["country","store", "product", pd.Grouper(key="date", freq="MS")])["num_sold"].sum().rename("num_sold").reset_index()
yearly = train.groupby(["country","store", "product", pd.Grouper(key="date", freq="Y")])["num_sold"].sum().rename("num_sold").reset_index()

不同分类的所有销量情况

def plot_all(df):
    import matplotlib.pyplot as plt

    # Set up the figure and axes
    f, axes = plt.subplots(3, 2, figsize=(20, 30), sharex=True, sharey=True)
    f.tight_layout(pad=5.0)  # Adjust layout for better spacing
    
    # Iterate over each unique product
    for n, prod in enumerate(df["product"].unique()):
        plot = df.loc[df["product"] == prod]
        
        # Get the current subplot axis
        ax = axes[n // 2, n % 2]
        
        # Plot data for each combination of country and store
        for country in plot["country"].unique():
            for store in plot["store"].unique():
                subset = plot.loc[
                    (plot["country"] == country) & 
                    (plot["store"] == store)
                ]
                if not subset.empty:
                    ax.plot(
                        subset["date"], 
                        subset["num_sold"], 
                        marker='o', 
                        label=f"{country}-{store}"
                    )
        
        # Set title and labels
        ax.set_title(f"Product: {prod}", fontsize=16)
        ax.set_xlabel("Date", fontsize=12)
        ax.set_ylabel("Number Sold", fontsize=12)
        ax.legend(title="Country-Store", loc='upper left', fontsize=10)
    
    # Hide unused subplots if any
    for idx in range(n + 1, len(axes.flatten())):
        axes.flatten()[idx].set_visible(False)
    
    # Show the plot
    plt.show()

按周查看

plot_all(weekly)

按月查看

plot_all(monthly)

按年查看

plot_all(yearly)

查看各地区销量权重

country_weights = train.groupby("country")["num_sold"].sum() / train["num_sold"].sum()
country_weights.head()

country
Canada 0.172087
Finland 0.172362
Italy 0.124908
Kenya 0.004139
Norway 0.327695
Name: num_sold, dtype: float64

查看各地区按每日销量权重

country_ratio_over_time = (train.groupby(["date","country"])["num_sold"].sum() / train.groupby(["date"])["num_sold"].sum()).reset_index()
country_ratio_over_time.head()

-	date	country	num_sold
0	2010-01-01	Canada	0.176460
1	2010-01-01	Finland	0.179511
2	2010-01-01	Italy	0.136094
3	2010-01-01	Kenya	0.003424
4	2010-01-01	Norway	0.326333

发现什么了吗？
每日销量的权重与每年的一样，与地区一致

将销量按每日作图

f, ax = plt.subplots(figsize=(20, 10))

# Loop through the countries and plot each one
for country in country_ratio_over_time['country'].unique():
    country_data = country_ratio_over_time[country_ratio_over_time['country'] == country]
    ax.plot(country_data['date'], country_data['num_sold'], label=country)

ax.set_ylabel("Proportion of sales")
ax.legend(title='Country')

plt.show()

获取各地区的GDP数据

gdp_per_capita = pd.read_csv("/kaggle/input/world-gdpgdp-gdp-per-capita-and-annual-growths/gdp_per_capita.csv")
years =  ["2010", "2011", "2012", "2013", "2014", "2015", "2016"]
gdp_per_capita = gdp_per_capita.loc[gdp_per_capita["Country Name"].isin(train["country"].unique()), ["Country Name"] + years].set_index("Country Name")
gdp_per_capita.head()

Country Name	2010	2011	2012	2013	2014	2015	2016
Canada	47562.083425	52223.696112	52669.089963	52635.174958	50955.998323	43596.135537	42315.603706
Finland	46459.973255	51081.997670	47710.790217	49878.043244	50260.299859	42784.698362	43784.283962
Italy	36000.520118	38599.062207	35053.526244	35549.974697	35518.415292	30230.226302	30939.714246
Kenya	1080.296184	1085.487152	1271.815383	1354.820833	1462.220052	1464.554009	1525.235192
Norway	87693.790066	100600.562408	101524.141852	102913.450844	97019.182753	74355.515858	70460.560532

获取各地区GDP数据方法很多，可以直接用API方式获取，详见下一篇文章

统计各地区按GDP占的权重

# 以下处理是为了，从一年的第一天到最后一天用相同的数据填充，为下图作图方便
for year in years:
    gdp_per_capita[f"{year}_ratio"] = gdp_per_capita[year] / gdp_per_capita.sum()[year]
gdp_per_capita_ratios = gdp_per_capita[[i+"_ratio" for i in years]]
gdp_per_capita_ratios.columns = [int(i) for i in years]
gdp_per_capita_ratios = gdp_per_capita_ratios.unstack().reset_index().rename(columns = {"level_0": "year", 0: "ratio", "Country Name": "country"})
gdp_per_capita_ratios['year'] = pd.to_datetime(gdp_per_capita_ratios['year'], format='%Y')

gdp_per_capita_ratios_nextyear = gdp_per_capita_ratios.copy()
gdp_per_capita_ratios_nextyear["year"] = pd.to_datetime(gdp_per_capita_ratios_nextyear['year'].astype(str)) + pd.offsets.YearEnd(1)

gdp_per_capita_ratios = pd.concat([gdp_per_capita_ratios, gdp_per_capita_ratios_nextyear],ignore_index=True)
gdp_per_capita_ratios=gdp_per_capita_ratios.sort_values(by=['year'], ascending=True)
gdp_per_capita_ratios.reset_index(drop=True, inplace=True)
gdp_per_capita_ratios.head()

-	year	country	ratio
0	2010-01-01	Canada	0.178782
1	2010-01-01	Finland	0.174639
2	2010-01-01	Italy	0.135323
3	2010-01-01	Kenya	0.004061
4	2010-01-01	Norway	0.329634

小伙伴们 又发现什么了?
这组数据与前面的权重是不是又一致了

f, ax = plt.subplots(figsize=(20, 20))

# Plot the first dataset (country_ratio_over_time)
for country in country_ratio_over_time['country'].unique():
    country_data = country_ratio_over_time[country_ratio_over_time['country'] == country]
    ax.plot(country_data['date'], country_data['num_sold'], label=country)

# Plot the second dataset (gdp_per_capita_ratios) with black color for all countries
for country in gdp_per_capita_ratios['country'].unique():
    country_data = gdp_per_capita_ratios[gdp_per_capita_ratios['country'] == country]
    ax.plot(country_data['year'], country_data['ratio'], color='black')

# Set ylabel
ax.set_ylabel("Proportion of sales")

# Show legend for the first dataset (country_ratio_over_time)
ax.legend(title='Country')

plt.show()

上图将GDP的权重与该地区的销量图

gdp_per_capita_ratios_nextyear["year"] = gdp_per_capita_ratios_nextyear["year"].dt.year
def plot_adjust_country(df):
    new = df.copy()
    new["year"] = new["date"].dt.year
    
    for country in new["country"].unique():
        for year in new["year"].unique():
            new.loc[(new["country"] == country) & (new["year"] == year), "num_sold"] = new.loc[(new["country"] == country) & (new["year"] == year), "num_sold"] / gdp_per_capita_ratios_nextyear.loc[(gdp_per_capita_ratios_nextyear["country"] == country) & (gdp_per_capita_ratios_nextyear["year"] == year), "ratio"].values[0]
            
    plot_all(new)

原数据用GDP权重调整后作图

plot_adjust_country(monthly)

在这里插入图片描述

查看数据集缺失值的情况

print(f"Missing values remaining: {train['num_sold'].isna().sum()}")
# Missing values remaining: 8871

train["year"] = train["date"].dt.year

用GDP的权重方式的填充其他地匹配的销量缺失

具体实现有点复杂
举个例子，读取加拿大和挪威的GDP权重，找出挪威的该产品的销量，用等比例计算出加拿大该产品的销量进行填充
还提到，完全用挪威数据替换肯尼亚的几个商店，其中包含大量丢失的数据，这将有点冒险

for year in train["year"].unique():
    # Impute Time Series 1 (Canada, Discount Stickers, Holographic Goose)
    target_ratio = gdp_per_capita_ratios_nextyear.loc[(gdp_per_capita_ratios_nextyear["year"] == year) & (gdp_per_capita_ratios_nextyear["country"] == "Norway"), "ratio"].values[0]
    current_ratio = gdp_per_capita_ratios_nextyear.loc[(gdp_per_capita_ratios_nextyear["year"] == year) & (gdp_per_capita_ratios_nextyear["country"] == "Canada"), "ratio"].values[0]
    ratio_can = current_ratio / target_ratio
    train.loc[(train["country"] == "Canada") & (train["store"] == "Discount Stickers") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] = (train.loc[(train["country"] == "Norway") & (train["store"] == "Discount Stickers") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] * ratio_can).values

    train.loc[(train["country"] == "Canada") & (train["store"] == "Premium Sticker Mart") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] = (train.loc[(train["country"] == "Norway") & (train["store"] == "Premium Sticker Mart") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] * ratio_can).values
    train.loc[(train["country"] == "Canada") & (train["store"] == "Stickers for Less") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] = (train.loc[(train["country"] == "Norway") & (train["store"] == "Stickers for Less") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] * ratio_can).values
    
    # Impute Time Series 1 (Kenya, Discount Stickers, Holographic Goose)
    current_ratio = gdp_per_capita_ratios_nextyear.loc[(gdp_per_capita_ratios_nextyear["year"] == year) & (gdp_per_capita_ratios_nextyear["country"] == "Kenya"), "ratio"].values[0]
    ratio_ken = current_ratio / target_ratio
    train.loc[(train["country"] == "Kenya") & (train["store"] == "Discount Stickers") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] = (train.loc[(train["country"] == "Norway") & (train["store"] == "Discount Stickers") & (train["product"] == "Holographic Goose")& (train["year"] == year), "num_sold"] * ratio_ken).values

    # Going to be a bit risky and completely replace a couple of stores for Kenya with a lot of missing data with Norways data:
    train.loc[(train["country"] == "Kenya") & (train["store"] == "Premium Sticker Mart") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] = (train.loc[(train["country"] == "Norway") & (train["store"] == "Premium Sticker Mart") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] * ratio_ken).values
    train.loc[(train["country"] == "Kenya") & (train["store"] == "Stickers for Less") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] = (train.loc[(train["country"] == "Norway") & (train["store"] == "Stickers for Less") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] * ratio_ken).values
    train.loc[(train["country"] == "Kenya") & (train["store"] == "Discount Stickers") & (train["product"] == "Kerneler") & (train["year"] == year), "num_sold"] = (train.loc[(train["country"] == "Norway") & (train["store"] == "Discount Stickers") & (train["product"] == "Kerneler") & (train["year"] == year), "num_sold"] * ratio_ken).values

替换后查看数据集缺失值的情况

print(f"Missing values remaining: {train['num_sold'].isna().sum()}")
# Missing values remaining: 2

仅仅发现还只有2个数据缺失

# 找出MISS的两个数据
missing_rows = train.loc[train["num_sold"].isna()]
missing_rows

id	date	country	store	product	num_sold	year
23719	2010-09-21	Kenya	Discount Stickers	Kerneler Dark Mode	NaN	2010
207003	2016-04-19	Canada	Discount Stickers	Kerneler	NaN	2016

查看缺失值前后数据情况

a=train.loc[(train.country=='Kenya') & (train.store=='Discount Stickers') &
             (train['product']=='Kerneler Dark Mode')]
a.loc[a.date>=' 2010-09-11'].head(20)

id	date	country	store	product	num_sold	year
22819	2010-09-11	Kenya	Discount Stickers	Kerneler Dark Mode	6.0	2010
22909	2010-09-12	Kenya	Discount Stickers	Kerneler Dark Mode	6.0	2010
22999	2010-09-13	Kenya	Discount Stickers	Kerneler Dark Mode	5.0	2010
23089	2010-09-14	Kenya	Discount Stickers	Kerneler Dark Mode	6.0	2010
23179	2010-09-15	Kenya	Discount Stickers	Kerneler Dark Mode	5.0	2010
23269	2010-09-16	Kenya	Discount Stickers	Kerneler Dark Mode	5.0	2010
23359	2010-09-17	Kenya	Discount Stickers	Kerneler Dark Mode	6.0	2010
23449	2010-09-18	Kenya	Discount Stickers	Kerneler Dark Mode	6.0	2010
23539	2010-09-19	Kenya	Discount Stickers	Kerneler Dark Mode	6.0	2010
23629	2010-09-20	Kenya	Discount Stickers	Kerneler Dark Mode	5.0	2010
23719	2010-09-21	Kenya	Discount Stickers	Kerneler Dark Mode	NaN	2010
23809	2010-09-22	Kenya	Discount Stickers	Kerneler Dark Mode	5.0	2010
23899	2010-09-23	Kenya	Discount Stickers	Kerneler Dark Mode	5.0	2010
23989	2010-09-24	Kenya	Discount Stickers	Kerneler Dark Mode	5.0	2010
24079	2010-09-25	Kenya	Discount Stickers	Kerneler Dark Mode	6.0	2010
24169	2010-09-26	Kenya	Discount Stickers	Kerneler Dark Mode	7.0	2010
24259	2010-09-27	Kenya	Discount Stickers	Kerneler Dark Mode	5.0	2010
24349	2010-09-28	Kenya	Discount Stickers	Kerneler Dark Mode	5.0	2010
24439	2010-09-29	Kenya	Discount Stickers	Kerneler Dark Mode	5.0	2010
24529	2010-09-30	Kenya	Discount Stickers	Kerneler Dark Mode	5.0	2010

a=train.loc[(train.country=='Canada') & (train.store=='Discount Stickers') &
             (train['product']=='Kerneler')]
a.loc[a.date>=' 2016-04-15'].head(20)

id	date	country	store	product	num_sold	year
206643	2016-04-15	Canada	Discount Stickers	Kerneler	245.0	2016
206733	2016-04-16	Canada	Discount Stickers	Kerneler	239.0	2016
206823	2016-04-17	Canada	Discount Stickers	Kerneler	290.0	2016
206913	2016-04-18	Canada	Discount Stickers	Kerneler	229.0	2016
207003	2016-04-19	Canada	Discount Stickers	Kerneler	NaN	2016
207093	2016-04-20	Canada	Discount Stickers	Kerneler	245.0	2016
207183	2016-04-21	Canada	Discount Stickers	Kerneler	228.0	2016
207273	2016-04-22	Canada	Discount Stickers	Kerneler	246.0	2016
207363	2016-04-23	Canada	Discount Stickers	Kerneler	265.0	2016
207453	2016-04-24	Canada	Discount Stickers	Kerneler	298.0	2016
207543	2016-04-25	Canada	Discount Stickers	Kerneler	217.0	2016
207633	2016-04-26	Canada	Discount Stickers	Kerneler	226.0	2016
207723	2016-04-27	Canada	Discount Stickers	Kerneler	221.0	2016
207813	2016-04-28	Canada	Discount Stickers	Kerneler	223.0	2016
207903	2016-04-29	Canada	Discount Stickers	Kerneler	256.0	2016
207993	2016-04-30	Canada	Discount Stickers	Kerneler	215.0	2016
208083	2016-05-01	Canada	Discount Stickers	Kerneler	304.0	2016
208173	2016-05-02	Canada	Discount Stickers	Kerneler	222.0	2016
208263	2016-05-03	Canada	Discount Stickers	Kerneler	256.0	2016
208353	2016-05-04	Canada	Discount Stickers	Kerneler	229.0	2016

直接用确定值进行填充缺失值

missing_rows = train.loc[train["num_sold"].isna()]
train.loc[train.index == 23719, "num_sold"] = 5
train.loc[train.index == 207003, "num_sold"] = 300

print(f"Missing values remaining: {train['num_sold'].isna().sum()}")

Missing values remaining: 0

查看各商场销量权重

weekly = train.groupby(["country","store", "product", pd.Grouper(key="date", freq="W")])["num_sold"].sum().rename("num_sold").reset_index()
monthly = train.groupby(["country","store", "product", pd.Grouper(key="date", freq="MS")])["num_sold"].sum().rename("num_sold").reset_index()
store_weights = train.groupby("store")["num_sold"].sum()/train["num_sold"].sum()
store_weights

store
Discount Stickers 0.184728
Premium Sticker Mart 0.441569
Stickers for Less 0.373702
Name: num_sold, dtype: float64

store_ratio_over_time = (train.groupby(["date","store"])["num_sold"].sum() / train.groupby(["date"])["num_sold"].sum()).reset_index()

作图查看各商场销量权重

f, ax = plt.subplots(figsize=(20, 10))

# Loop through the unique stores and plot them
for store in store_ratio_over_time['store'].unique():
    store_data = store_ratio_over_time[store_ratio_over_time['store'] == store]
    ax.plot(store_data['date'], store_data['num_sold'], label=store)

# Set the ylabel
ax.set_ylabel("Proportion of sales")
plt.grid()

p11

重新调整后作图

def plot_adjusted_store(df):
    new = df.copy()
    weights = store_weights.loc["Premium Sticker Mart"] / store_weights
    print(weights)
    for store in weights.index:
        new.loc[new["store"] == store, "num_sold"] = new.loc[new["store"] == store, "num_sold"] * weights[store]
    plot_all(new)

plot_adjusted_store(monthly)

store
Discount Stickers 2.390369
Premium Sticker Mart 1.000000
Stickers for Less 1.181606
Name: num_sold, dtype: float64

p12

各产品的销量情况

product_ = train.groupby(["date","product"])["num_sold"].sum().reset_index()
f, ax = plt.subplots(figsize=(20, 10))

# Loop through the unique products and plot them
for product in product_['product'].unique():
    product_data = product_[product_['product'] == product]
    ax.plot(product_data['date'], product_data['num_sold'], label=product)

# Set labels
ax.set_xlabel("Date")
ax.set_ylabel("Number of Units Sold")

# Add a legend
ax.legend(title="Product")

# Display the plot
plt.grid()

p13

各产品的销量权重作图（转化为权重）

product_ratio = product_.pivot(index="date", columns="product", values="num_sold")
product_ratio = product_ratio.apply(lambda x: x/x.sum(),axis=1)
product_ratio = product_ratio.stack().rename("ratios").reset_index()

f, ax = plt.subplots(figsize=(20, 10))

# Loop through the unique products and plot them
for product in product_ratio['product'].unique():
    product_data = product_ratio[product_ratio['product'] == product]
    ax.plot(product_data['date'], product_data['ratios'], label=product)

# Set axis labels
ax.set_xlabel("Date")
ax.set_ylabel("Ratios")

# Add a legend
ax.legend(title="Product")

# Display the plot
plt.grid()

p14

比较上面两图，发现曲线的波动更加平滑了

按日统计销量

train_ = train.groupby(["date"])["num_sold"].sum().reset_index()
f, ax = plt.subplots(figsize=(20, 10))

# Plot the data
ax.plot(train_['date'], train_['num_sold'], linewidth=1)

# Set axis labels
ax.set_xlabel("Date")
ax.set_ylabel("Number of Units Sold")

# Display the plot
plt.grid()

p15

weekly = train_.groupby([pd.Grouper(key="date", freq="W")])["num_sold"].sum().rename("num_sold").reset_index()
monthly = train_.groupby([pd.Grouper(key="date", freq="MS")])["num_sold"].sum().rename("num_sold").reset_index()

按月统计销量

f, ax = plt.subplots(figsize=(20, 10))

# Plot the data
ax.plot(monthly['date'], monthly['num_sold'], linewidth=1)

# Set axis labels
ax.set_xlabel("Date")
ax.set_ylabel("Number of Units Sold")

# Display the plot
plt.grid()

p16

按周统计销量

f, ax = plt.subplots(figsize=(20, 10))

# Plot the data
ax.plot(weekly['date'][1:-1], weekly['num_sold'][1:-1], linewidth=1)

# Set axis labels
ax.set_xlabel("Date")
ax.set_ylabel("Number of Units Sold")

# Display the plot
plt.grid()

p17

按季统计销量

def plot_seasonality(df, x_axis):
    # Create new columns for seasonality analysis
    df["month"] = df["date"].dt.month
    df["day_of_week"] = df["date"].dt.dayofweek
    df["day_of_year"] = df['date'].apply(
        lambda x: x.timetuple().tm_yday if not (x.is_leap_year and x.month > 2) else x.timetuple().tm_yday - 1
    )

    # Create the plot
    f, ax = plt.subplots(1, 1, figsize=(20, 8))

    # Group the data by the specified x-axis value and calculate the mean of num_sold
    grouped_data = df.groupby(x_axis)["num_sold"].mean().reset_index()

    # Plot the data
    ax.plot(grouped_data[x_axis], grouped_data["num_sold"], linewidth=1)

    # Set title and labels
    ax.set_title(f"{x_axis} Seasonality")
    ax.set_xlabel(x_axis)
    ax.set_ylabel("Number of Units Sold")

    # Display the plot
    plt.grid()

plot_seasonality(train_, "month")

p18

plot_seasonality(train_, "day_of_week")

p19

plot_seasonality(train_, "day_of_year")

p20

def plot_individual_ts(df):
    for country in df["country"].unique():
        # Create subplots with the appropriate number of rows
        fig, axes = plt.subplots(
            nrows=df["store"].nunique() * df["product"].nunique(),
            figsize=(20, 70),
            constrained_layout=True
        )
        
        count = 0
        for store in df["store"].unique():
            for product in df["product"].unique():
                # Filter the DataFrame for the current country, store, and product
                plot_df = df.loc[
                    (df["product"] == product) & 
                    (df["country"] == country) & 
                    (df["store"] == store)
                ]
                
                if len(plot_df) > 0:  # Only plot if data is available
                    axes[count].plot(plot_df["date"], plot_df["num_sold"], linewidth=0.5)
                    axes[count].set_title(f"{country} - {store} - {product}")
                
                count += 1
        
        # Remove any unused subplots (if any)
        for idx in range(count, len(axes)):
            fig.delaxes(axes[idx])

        plt.grid()

plot_individual_ts(train.reset_index(drop=True))

查看各地区商场产品的销量图

在这里插入图片描述

更新缺失值，保存新的数据集

df=pd.read_csv('/kaggle/input/playground-series-s5e1/train.csv')
df['num_sold']=train['num_sold']
df.to_csv('train.csv', index = False)

总结

该数据集有以下特征：

时间序列数据集，特别是产品销量是有一定时间周期性
其周期性图形可以看出，具有类似正弦曲线
产品销量与该地区的GPD完全正相关
销量与是否周末有关
以上这些特征为下一次建模的特征工程提供有方向。

【Kaggle】练习赛《预测贴纸的销量》（上）

前言

题目说明

加载数据

查看数据情况

删除不需要的列

查看训练集与测试集分类内容是否相同

查看训练集分类的数量比

查看训练集缺失情况

查看训练集与测试集时间区间

统计按周、月、年不同 地区、产品的销量

不同分类的所有销量情况

按周查看

按月查看

按年查看

查看各地区销量权重

查看各地区按每日销量权重

将销量按每日作图

获取各地区的GDP数据

统计各地区按GDP占的权重

原数据用GDP权重调整后作图

查看数据集缺失值的情况

用GDP的权重方式的填充其他地匹配的销量缺失

替换后查看数据集缺失值的情况

查看缺失值前后数据情况

直接用确定值进行填充缺失值

查看各商场销量权重

作图查看各商场销量权重

重新调整后作图

各产品的销量情况

各产品的销量权重作图（转化为权重）

按日统计销量

按月统计销量

按周统计销量

按季统计销量

查看各地区商场产品的销量图

更新缺失值，保存新的数据集

总结

统计按周、月、年不同地区、产品的销量