【Kaggle】练习赛《预测贴纸的销量》(上)

前言

本篇文章介绍的是2025年首个Kaggle月赛《Forecasting Sticker Sales》,即《预测贴纸的销量》。与之前一样,也同样适合初学者,但与之前不同的是,本次比赛的数据集是个时间序列,从题目来看,就是通过之前销量来预测未来的销量。因此,本期用先用着重讲述如何填充缺失值。

题目说明

数据集就4个特征列和一个标签,分别如下:date(日期)、country(国家)、store(商店)、product(产品)、num_sold(销量),非常简单。

加载数据

import pandas as pd
train=pd.read_csv('/kaggle/input/playground-series-s5e1/train.csv')
train.head()
_iddatecountrystoreproductnum_sold
002010-01-01CanadaDiscount StickersHolographic GooseNaN
112010-01-01CanadaDiscount StickersKaggle973.0
222010-01-01CanadaDiscount StickersKaggle Tiers906.0
332010-01-01CanadaDiscount StickersKerneler423.0
442010-01-01CanadaDiscount StickersKerneler Dark Mode491.0
test=pd.read_csv('/kaggle/input/playground-series-s5e1/test.csv')
test.head()
_iddatecountrystoreproductnum_sold
02301302017-01-01CanadaDiscount StickersHolographic Goose
12301312017-01-01CanadaDiscount StickersKaggle
22301322017-01-01CanadaDiscount StickersKaggle Tiers
32301332017-01-01CanadaDiscount StickersKerneler
42301342017-01-01CanadaDiscount StickersKerneler Dark Mode

查看数据情况

train.date=pd.to_datetime(train.date, infer_datetime_format=True)
train.info()

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 230130 entries, 0 to 230129
Data columns (total 6 columns):
# Column Non-Null Count Dtype


0 id 230130 non-null int64
1 date 230130 non-null datetime64[ns]
2 country 230130 non-null object
3 store 230130 non-null object
4 product 230130 non-null object
5 num_sold 221259 non-null float64
dtypes: datetime64ns, float64(1), int64(1), object(3)
memory usage: 10.5+ MB

test.date=pd.to_datetime(test.date, infer_datetime_format=True)
test.info()

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 98550 entries, 0 to 98549
Data columns (total 5 columns):
# Column Non-Null Count Dtype


0 id 98550 non-null int64
1 date 98550 non-null datetime64[ns]
2 country 98550 non-null object
3 store 98550 non-null object
4 product 98550 non-null object
dtypes: datetime64ns, int64(1), object(3)
memory usage: 3.8+ MB

删除不需要的列

train=train.drop(['id'],axis=1)
test=test.drop(['id'],axis=1)

查看训练集与测试集分类内容是否相同

counter=0
for i in test.select_dtypes(include=['object']).columns.tolist():
    if (len(list(set(train[i].unique().tolist())^set(test[i].unique().tolist())))!=0):
        print(i ,'need to be worked on')
        counter+=1
    else:
        continue
if(counter==0):
    print('No work needed')

No work needed
即两个数据集一致

查看训练集分类的数量比

country=train['country'].value_counts().to_frame()
country.reset_index()
-countrycount
0Canada38355
1Finland38355
2Italy38355
3Kenya38355
4Norway38355
5Singapore38355
train.groupby('country').size().plot(kind='pie', autopct='%.2f%%');

p1

product=train['product'].value_counts().to_frame()
product.reset_index()

p2

product=train['product'].value_counts().to_frame()
product.reset_index()
-productcount
0Holographic Goose46026
1Kaggle46026
2Kaggle Tiers46026
3Kerneler46026
4Kerneler Dark Mode46026
train.groupby('product').size().plot(kind='pie', autopct='%.2f%%');

p3

从上述分析来看,各分类数据集分布非常均匀

counts = train.groupby(["country","store","product"])["num_sold"].count()
counts
Canada     Discount Stickers  Holographic Goose        0
                              Kaggle                2557
                              Kaggle Tiers          2557
                              Kerneler              2556
                              Kerneler Dark Mode    2557
                                                    ... 
Singapore  Stickers for Less  Holographic Goose     2557
                              Kaggle                2557
                              Kaggle Tiers          2557
                              Kerneler              2557
                              Kerneler Dark Mode    2557
Name: num_sold, Length: 90, dtype: int64

查看训练集缺失情况

missing_data = counts.loc[counts != 2557]
missing_data
Canada   Discount Stickers     Holographic Goose        0
                               Kerneler              2556
         Premium Sticker Mart  Holographic Goose     2177
         Stickers for Less     Holographic Goose     1249
Kenya    Discount Stickers     Holographic Goose        0
                               Kerneler              2494
                               Kerneler Dark Mode    2556
         Premium Sticker Mart  Holographic Goose     1911
         Stickers for Less     Holographic Goose     1199
Name: num_sold, dtype: int64
import matplotlib.pyplot as plt

# Plot setup
f, axs = plt.subplots(9, 1, figsize=(20, 50))

for i, (country, store, product) in enumerate(missing_data.index):
    # Filter the data for the current combination of country, store, and product
    plot = train.loc[
        (train["country"] == country) & 
        (train["store"] == store) & 
        (train["product"] == product)
    ]
    
    if plot.empty:
        print(f"No data available for {country} - {store} - {product}")
        continue

    # Handle missing values in num_sold
    missing_vals = plot.loc[plot["num_sold"].isna()]
    
    # Plot the num_sold data
    axs[i].plot(plot["date"], plot["num_sold"], marker='o', linestyle='-', color='blue', label='num_sold')

    # Add vertical lines for missing dates
    for missing_date in missing_vals["date"].dropna():
        axs[i].axvline(missing_date, color='red', linestyle='-', linewidth=1, alpha=0.5)

    # Add title and labels
    axs[i].set_title(f"{country} - {store} - {product}")
    axs[i].set_xlabel("Date")
    axs[i].set_ylabel("Number Sold")
    axs[i].legend(loc='upper right')

# Adjust layout
plt.tight_layout()
plt.show()

下图是表示缺失值分布情况,蓝色为真实值,红色垂直线为缺失值所在的日期

p4

从上图可以看出,缺失值分布情况,有些产品是完全缺失

查看训练集与测试集时间区间

print("Train - Earliest date:", train["date"].min())
print("Train - Latest date:", train["date"].max())

print("Test - Earliest date:", test["date"].min())
print("Test - Latest date:", test["date"].max())

Train - Earliest date: 2010-01-01 00:00:00
Train - Latest date: 2016-12-31 00:00:00
Test - Earliest date: 2017-01-01 00:00:00
Test - Latest date: 2019-12-31 00:00:00

统计按周、月、年不同 地区、产品的销量

weekly = train.groupby(["country","store", "product", pd.Grouper(key="date", freq="W")])["num_sold"].sum().rename("num_sold").reset_index()
monthly = train.groupby(["country","store", "product", pd.Grouper(key="date", freq="MS")])["num_sold"].sum().rename("num_sold").reset_index()
yearly = train.groupby(["country","store", "product", pd.Grouper(key="date", freq="Y")])["num_sold"].sum().rename("num_sold").reset_index()

不同分类的所有销量情况

def plot_all(df):
    import matplotlib.pyplot as plt

    # Set up the figure and axes
    f, axes = plt.subplots(3, 2, figsize=(20, 30), sharex=True, sharey=True)
    f.tight_layout(pad=5.0)  # Adjust layout for better spacing
    
    # Iterate over each unique product
    for n, prod in enumerate(df["product"].unique()):
        plot = df.loc[df["product"] == prod]
        
        # Get the current subplot axis
        ax = axes[n // 2, n % 2]
        
        # Plot data for each combination of country and store
        for country in plot["country"].unique():
            for store in plot["store"].unique():
                subset = plot.loc[
                    (plot["country"] == country) & 
                    (plot["store"] == store)
                ]
                if not subset.empty:
                    ax.plot(
                        subset["date"], 
                        subset["num_sold"], 
                        marker='o', 
                        label=f"{country}-{store}"
                    )
        
        # Set title and labels
        ax.set_title(f"Product: {prod}", fontsize=16)
        ax.set_xlabel("Date", fontsize=12)
        ax.set_ylabel("Number Sold", fontsize=12)
        ax.legend(title="Country-Store", loc='upper left', fontsize=10)
    
    # Hide unused subplots if any
    for idx in range(n + 1, len(axes.flatten())):
        axes.flatten()[idx].set_visible(False)
    
    # Show the plot
    plt.show()

按周查看

plot_all(weekly)

p5

按月查看

plot_all(monthly)

p6

按年查看

plot_all(yearly)

p7

查看各地区销量权重

country_weights = train.groupby("country")["num_sold"].sum() / train["num_sold"].sum()
country_weights.head()

country
Canada 0.172087
Finland 0.172362
Italy 0.124908
Kenya 0.004139
Norway 0.327695
Name: num_sold, dtype: float64

查看各地区按每日销量权重

country_ratio_over_time = (train.groupby(["date","country"])["num_sold"].sum() / train.groupby(["date"])["num_sold"].sum()).reset_index()
country_ratio_over_time.head()
-datecountrynum_sold
02010-01-01Canada0.176460
12010-01-01Finland0.179511
22010-01-01Italy0.136094
32010-01-01Kenya0.003424
42010-01-01Norway0.326333

发现什么了吗?
每日销量的权重与每年的一样,与地区一致

将销量按每日作图

f, ax = plt.subplots(figsize=(20, 10))

# Loop through the countries and plot each one
for country in country_ratio_over_time['country'].unique():
    country_data = country_ratio_over_time[country_ratio_over_time['country'] == country]
    ax.plot(country_data['date'], country_data['num_sold'], label=country)

ax.set_ylabel("Proportion of sales")
ax.legend(title='Country')

plt.show()

p8

获取各地区的GDP数据

gdp_per_capita = pd.read_csv("/kaggle/input/world-gdpgdp-gdp-per-capita-and-annual-growths/gdp_per_capita.csv")
years =  ["2010", "2011", "2012", "2013", "2014", "2015", "2016"]
gdp_per_capita = gdp_per_capita.loc[gdp_per_capita["Country Name"].isin(train["country"].unique()), ["Country Name"] + years].set_index("Country Name")
gdp_per_capita.head()
Country Name2010201120122013201420152016
Canada47562.08342552223.69611252669.08996352635.17495850955.99832343596.13553742315.603706
Finland46459.97325551081.99767047710.79021749878.04324450260.29985942784.69836243784.283962
Italy36000.52011838599.06220735053.52624435549.97469735518.41529230230.22630230939.714246
Kenya1080.2961841085.4871521271.8153831354.8208331462.2200521464.5540091525.235192
Norway87693.790066100600.562408101524.141852102913.45084497019.18275374355.51585870460.560532

获取各地区GDP数据方法很多,可以直接用API方式获取,详见下一篇文章

统计各地区按GDP占的权重

# 以下处理是为了,从一年的第一天到最后一天用相同的数据填充,为下图作图方便
for year in years:
    gdp_per_capita[f"{year}_ratio"] = gdp_per_capita[year] / gdp_per_capita.sum()[year]
gdp_per_capita_ratios = gdp_per_capita[[i+"_ratio" for i in years]]
gdp_per_capita_ratios.columns = [int(i) for i in years]
gdp_per_capita_ratios = gdp_per_capita_ratios.unstack().reset_index().rename(columns = {"level_0": "year", 0: "ratio", "Country Name": "country"})
gdp_per_capita_ratios['year'] = pd.to_datetime(gdp_per_capita_ratios['year'], format='%Y')

gdp_per_capita_ratios_nextyear = gdp_per_capita_ratios.copy()
gdp_per_capita_ratios_nextyear["year"] = pd.to_datetime(gdp_per_capita_ratios_nextyear['year'].astype(str)) + pd.offsets.YearEnd(1)

gdp_per_capita_ratios = pd.concat([gdp_per_capita_ratios, gdp_per_capita_ratios_nextyear],ignore_index=True)
gdp_per_capita_ratios=gdp_per_capita_ratios.sort_values(by=['year'], ascending=True)
gdp_per_capita_ratios.reset_index(drop=True, inplace=True)
gdp_per_capita_ratios.head()
-yearcountryratio
02010-01-01Canada0.178782
12010-01-01Finland0.174639
22010-01-01Italy0.135323
32010-01-01Kenya0.004061
42010-01-01Norway0.329634

小伙伴们 又发现什么了?
这组数据与前面的权重是不是又一致了

f, ax = plt.subplots(figsize=(20, 20))

# Plot the first dataset (country_ratio_over_time)
for country in country_ratio_over_time['country'].unique():
    country_data = country_ratio_over_time[country_ratio_over_time['country'] == country]
    ax.plot(country_data['date'], country_data['num_sold'], label=country)

# Plot the second dataset (gdp_per_capita_ratios) with black color for all countries
for country in gdp_per_capita_ratios['country'].unique():
    country_data = gdp_per_capita_ratios[gdp_per_capita_ratios['country'] == country]
    ax.plot(country_data['year'], country_data['ratio'], color='black')

# Set ylabel
ax.set_ylabel("Proportion of sales")

# Show legend for the first dataset (country_ratio_over_time)
ax.legend(title='Country')

plt.show()

p9

上图 将GDP的权重 与 该地区的销量 图

gdp_per_capita_ratios_nextyear["year"] = gdp_per_capita_ratios_nextyear["year"].dt.year
def plot_adjust_country(df):
    new = df.copy()
    new["year"] = new["date"].dt.year
    
    for country in new["country"].unique():
        for year in new["year"].unique():
            new.loc[(new["country"] == country) & (new["year"] == year), "num_sold"] = new.loc[(new["country"] == country) & (new["year"] == year), "num_sold"] / gdp_per_capita_ratios_nextyear.loc[(gdp_per_capita_ratios_nextyear["country"] == country) & (gdp_per_capita_ratios_nextyear["year"] == year), "ratio"].values[0]
            
    plot_all(new)

原数据用GDP权重调整后作图

plot_adjust_country(monthly)

在这里插入图片描述

查看数据集缺失值的情况

print(f"Missing values remaining: {train['num_sold'].isna().sum()}")
# Missing values remaining: 8871
train["year"] = train["date"].dt.year

用GDP的权重方式的填充其他地匹配的销量缺失

具体实现有点复杂
举个例子,读取加拿大和挪威的GDP权重,找出挪威的该产品的销量,用等比例计算出加拿大该产品的销量进行填充
还提到 ,完全用挪威数据替换肯尼亚的几个商店,其中包含大量丢失的数据,这将有点冒险

for year in train["year"].unique():
    # Impute Time Series 1 (Canada, Discount Stickers, Holographic Goose)
    target_ratio = gdp_per_capita_ratios_nextyear.loc[(gdp_per_capita_ratios_nextyear["year"] == year) & (gdp_per_capita_ratios_nextyear["country"] == "Norway"), "ratio"].values[0]
    current_ratio = gdp_per_capita_ratios_nextyear.loc[(gdp_per_capita_ratios_nextyear["year"] == year) & (gdp_per_capita_ratios_nextyear["country"] == "Canada"), "ratio"].values[0]
    ratio_can = current_ratio / target_ratio
    train.loc[(train["country"] == "Canada") & (train["store"] == "Discount Stickers") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] = (train.loc[(train["country"] == "Norway") & (train["store"] == "Discount Stickers") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] * ratio_can).values

    train.loc[(train["country"] == "Canada") & (train["store"] == "Premium Sticker Mart") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] = (train.loc[(train["country"] == "Norway") & (train["store"] == "Premium Sticker Mart") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] * ratio_can).values
    train.loc[(train["country"] == "Canada") & (train["store"] == "Stickers for Less") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] = (train.loc[(train["country"] == "Norway") & (train["store"] == "Stickers for Less") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] * ratio_can).values
    
    # Impute Time Series 1 (Kenya, Discount Stickers, Holographic Goose)
    current_ratio = gdp_per_capita_ratios_nextyear.loc[(gdp_per_capita_ratios_nextyear["year"] == year) & (gdp_per_capita_ratios_nextyear["country"] == "Kenya"), "ratio"].values[0]
    ratio_ken = current_ratio / target_ratio
    train.loc[(train["country"] == "Kenya") & (train["store"] == "Discount Stickers") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] = (train.loc[(train["country"] == "Norway") & (train["store"] == "Discount Stickers") & (train["product"] == "Holographic Goose")& (train["year"] == year), "num_sold"] * ratio_ken).values

    # Going to be a bit risky and completely replace a couple of stores for Kenya with a lot of missing data with Norways data:
    train.loc[(train["country"] == "Kenya") & (train["store"] == "Premium Sticker Mart") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] = (train.loc[(train["country"] == "Norway") & (train["store"] == "Premium Sticker Mart") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] * ratio_ken).values
    train.loc[(train["country"] == "Kenya") & (train["store"] == "Stickers for Less") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] = (train.loc[(train["country"] == "Norway") & (train["store"] == "Stickers for Less") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] * ratio_ken).values
    train.loc[(train["country"] == "Kenya") & (train["store"] == "Discount Stickers") & (train["product"] == "Kerneler") & (train["year"] == year), "num_sold"] = (train.loc[(train["country"] == "Norway") & (train["store"] == "Discount Stickers") & (train["product"] == "Kerneler") & (train["year"] == year), "num_sold"] * ratio_ken).values
 

替换后查看数据集缺失值的情况

print(f"Missing values remaining: {train['num_sold'].isna().sum()}")
# Missing values remaining: 2

仅仅发现还只有2个数据缺失

# 找出MISS的两个数据
missing_rows = train.loc[train["num_sold"].isna()]
missing_rows
iddatecountrystoreproductnum_soldyear
237192010-09-21KenyaDiscount StickersKerneler Dark ModeNaN2010
2070032016-04-19CanadaDiscount StickersKernelerNaN2016

查看缺失值前后数据情况

a=train.loc[(train.country=='Kenya') & (train.store=='Discount Stickers') &
             (train['product']=='Kerneler Dark Mode')]
a.loc[a.date>=' 2010-09-11'].head(20)
iddatecountrystoreproductnum_soldyear
228192010-09-11KenyaDiscount StickersKerneler Dark Mode6.02010
229092010-09-12KenyaDiscount StickersKerneler Dark Mode6.02010
229992010-09-13KenyaDiscount StickersKerneler Dark Mode5.02010
230892010-09-14KenyaDiscount StickersKerneler Dark Mode6.02010
231792010-09-15KenyaDiscount StickersKerneler Dark Mode5.02010
232692010-09-16KenyaDiscount StickersKerneler Dark Mode5.02010
233592010-09-17KenyaDiscount StickersKerneler Dark Mode6.02010
234492010-09-18KenyaDiscount StickersKerneler Dark Mode6.02010
235392010-09-19KenyaDiscount StickersKerneler Dark Mode6.02010
236292010-09-20KenyaDiscount StickersKerneler Dark Mode5.02010
237192010-09-21KenyaDiscount StickersKerneler Dark ModeNaN2010
238092010-09-22KenyaDiscount StickersKerneler Dark Mode5.02010
238992010-09-23KenyaDiscount StickersKerneler Dark Mode5.02010
239892010-09-24KenyaDiscount StickersKerneler Dark Mode5.02010
240792010-09-25KenyaDiscount StickersKerneler Dark Mode6.02010
241692010-09-26KenyaDiscount StickersKerneler Dark Mode7.02010
242592010-09-27KenyaDiscount StickersKerneler Dark Mode5.02010
243492010-09-28KenyaDiscount StickersKerneler Dark Mode5.02010
244392010-09-29KenyaDiscount StickersKerneler Dark Mode5.02010
245292010-09-30KenyaDiscount StickersKerneler Dark Mode5.02010
a=train.loc[(train.country=='Canada') & (train.store=='Discount Stickers') &
             (train['product']=='Kerneler')]
a.loc[a.date>=' 2016-04-15'].head(20)
iddatecountrystoreproductnum_soldyear
2066432016-04-15CanadaDiscount StickersKerneler245.02016
2067332016-04-16CanadaDiscount StickersKerneler239.02016
2068232016-04-17CanadaDiscount StickersKerneler290.02016
2069132016-04-18CanadaDiscount StickersKerneler229.02016
2070032016-04-19CanadaDiscount StickersKernelerNaN2016
2070932016-04-20CanadaDiscount StickersKerneler245.02016
2071832016-04-21CanadaDiscount StickersKerneler228.02016
2072732016-04-22CanadaDiscount StickersKerneler246.02016
2073632016-04-23CanadaDiscount StickersKerneler265.02016
2074532016-04-24CanadaDiscount StickersKerneler298.02016
2075432016-04-25CanadaDiscount StickersKerneler217.02016
2076332016-04-26CanadaDiscount StickersKerneler226.02016
2077232016-04-27CanadaDiscount StickersKerneler221.02016
2078132016-04-28CanadaDiscount StickersKerneler223.02016
2079032016-04-29CanadaDiscount StickersKerneler256.02016
2079932016-04-30CanadaDiscount StickersKerneler215.02016
2080832016-05-01CanadaDiscount StickersKerneler304.02016
2081732016-05-02CanadaDiscount StickersKerneler222.02016
2082632016-05-03CanadaDiscount StickersKerneler256.02016
2083532016-05-04CanadaDiscount StickersKerneler229.02016

直接用确定值进行填充缺失值

missing_rows = train.loc[train["num_sold"].isna()]
train.loc[train.index == 23719, "num_sold"] = 5
train.loc[train.index == 207003, "num_sold"] = 300

print(f"Missing values remaining: {train['num_sold'].isna().sum()}")

Missing values remaining: 0

查看各商场销量权重

weekly = train.groupby(["country","store", "product", pd.Grouper(key="date", freq="W")])["num_sold"].sum().rename("num_sold").reset_index()
monthly = train.groupby(["country","store", "product", pd.Grouper(key="date", freq="MS")])["num_sold"].sum().rename("num_sold").reset_index()
store_weights = train.groupby("store")["num_sold"].sum()/train["num_sold"].sum()
store_weights

store
Discount Stickers 0.184728
Premium Sticker Mart 0.441569
Stickers for Less 0.373702
Name: num_sold, dtype: float64

store_ratio_over_time = (train.groupby(["date","store"])["num_sold"].sum() / train.groupby(["date"])["num_sold"].sum()).reset_index()

作图查看各商场销量权重

f, ax = plt.subplots(figsize=(20, 10))

# Loop through the unique stores and plot them
for store in store_ratio_over_time['store'].unique():
    store_data = store_ratio_over_time[store_ratio_over_time['store'] == store]
    ax.plot(store_data['date'], store_data['num_sold'], label=store)

# Set the ylabel
ax.set_ylabel("Proportion of sales")
plt.grid()

p11

重新调整后作图

def plot_adjusted_store(df):
    new = df.copy()
    weights = store_weights.loc["Premium Sticker Mart"] / store_weights
    print(weights)
    for store in weights.index:
        new.loc[new["store"] == store, "num_sold"] = new.loc[new["store"] == store, "num_sold"] * weights[store]
    plot_all(new)
plot_adjusted_store(monthly)

store
Discount Stickers 2.390369
Premium Sticker Mart 1.000000
Stickers for Less 1.181606
Name: num_sold, dtype: float64

p12

各产品的销量情况

product_ = train.groupby(["date","product"])["num_sold"].sum().reset_index()
f, ax = plt.subplots(figsize=(20, 10))

# Loop through the unique products and plot them
for product in product_['product'].unique():
    product_data = product_[product_['product'] == product]
    ax.plot(product_data['date'], product_data['num_sold'], label=product)

# Set labels
ax.set_xlabel("Date")
ax.set_ylabel("Number of Units Sold")

# Add a legend
ax.legend(title="Product")

# Display the plot
plt.grid()

p13

各产品的销量权重作图(转化为权重)

product_ratio = product_.pivot(index="date", columns="product", values="num_sold")
product_ratio = product_ratio.apply(lambda x: x/x.sum(),axis=1)
product_ratio = product_ratio.stack().rename("ratios").reset_index()

f, ax = plt.subplots(figsize=(20, 10))

# Loop through the unique products and plot them
for product in product_ratio['product'].unique():
    product_data = product_ratio[product_ratio['product'] == product]
    ax.plot(product_data['date'], product_data['ratios'], label=product)

# Set axis labels
ax.set_xlabel("Date")
ax.set_ylabel("Ratios")

# Add a legend
ax.legend(title="Product")

# Display the plot
plt.grid()

p14

比较上面两图,发现曲线的波动更加平滑了

按日统计销量

train_ = train.groupby(["date"])["num_sold"].sum().reset_index()
f, ax = plt.subplots(figsize=(20, 10))

# Plot the data
ax.plot(train_['date'], train_['num_sold'], linewidth=1)

# Set axis labels
ax.set_xlabel("Date")
ax.set_ylabel("Number of Units Sold")

# Display the plot
plt.grid()

p15

weekly = train_.groupby([pd.Grouper(key="date", freq="W")])["num_sold"].sum().rename("num_sold").reset_index()
monthly = train_.groupby([pd.Grouper(key="date", freq="MS")])["num_sold"].sum().rename("num_sold").reset_index()

按月统计销量

f, ax = plt.subplots(figsize=(20, 10))

# Plot the data
ax.plot(monthly['date'], monthly['num_sold'], linewidth=1)

# Set axis labels
ax.set_xlabel("Date")
ax.set_ylabel("Number of Units Sold")

# Display the plot
plt.grid()

p16

按周统计销量

f, ax = plt.subplots(figsize=(20, 10))

# Plot the data
ax.plot(weekly['date'][1:-1], weekly['num_sold'][1:-1], linewidth=1)

# Set axis labels
ax.set_xlabel("Date")
ax.set_ylabel("Number of Units Sold")

# Display the plot
plt.grid()

p17

按季统计销量

def plot_seasonality(df, x_axis):
    # Create new columns for seasonality analysis
    df["month"] = df["date"].dt.month
    df["day_of_week"] = df["date"].dt.dayofweek
    df["day_of_year"] = df['date'].apply(
        lambda x: x.timetuple().tm_yday if not (x.is_leap_year and x.month > 2) else x.timetuple().tm_yday - 1
    )

    # Create the plot
    f, ax = plt.subplots(1, 1, figsize=(20, 8))

    # Group the data by the specified x-axis value and calculate the mean of num_sold
    grouped_data = df.groupby(x_axis)["num_sold"].mean().reset_index()

    # Plot the data
    ax.plot(grouped_data[x_axis], grouped_data["num_sold"], linewidth=1)

    # Set title and labels
    ax.set_title(f"{x_axis} Seasonality")
    ax.set_xlabel(x_axis)
    ax.set_ylabel("Number of Units Sold")

    # Display the plot
    plt.grid()
plot_seasonality(train_, "month")

p18

plot_seasonality(train_, "day_of_week")

p19

plot_seasonality(train_, "day_of_year")

p20

def plot_individual_ts(df):
    for country in df["country"].unique():
        # Create subplots with the appropriate number of rows
        fig, axes = plt.subplots(
            nrows=df["store"].nunique() * df["product"].nunique(),
            figsize=(20, 70),
            constrained_layout=True
        )
        
        count = 0
        for store in df["store"].unique():
            for product in df["product"].unique():
                # Filter the DataFrame for the current country, store, and product
                plot_df = df.loc[
                    (df["product"] == product) & 
                    (df["country"] == country) & 
                    (df["store"] == store)
                ]
                
                if len(plot_df) > 0:  # Only plot if data is available
                    axes[count].plot(plot_df["date"], plot_df["num_sold"], linewidth=0.5)
                    axes[count].set_title(f"{country} - {store} - {product}")
                
                count += 1
        
        # Remove any unused subplots (if any)
        for idx in range(count, len(axes)):
            fig.delaxes(axes[idx])

        plt.grid()

plot_individual_ts(train.reset_index(drop=True))

查看各地区商场产品的销量图

在这里插入图片描述

更新缺失值,保存新的数据集

df=pd.read_csv('/kaggle/input/playground-series-s5e1/train.csv')
df['num_sold']=train['num_sold']
df.to_csv('train.csv', index = False)

总结

该数据集有以下特征:

  1. 时间序列数据集,特别是产品销量是有一定时间周期性
  2. 其周期性图形可以看出,具有类似正弦曲线
  3. 产品销量与该地区的GPD完全正相关
  4. 销量与是否周末有关
    以上这些特征为下一次建模的特征工程提供有方向。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值