前言
本篇文章介绍的是2025年首个Kaggle月赛《Forecasting Sticker Sales》,即《预测贴纸的销量》。与之前一样,也同样适合初学者,但与之前不同的是,本次比赛的数据集是个时间序列,从题目来看,就是通过之前销量来预测未来的销量。因此,本期用先用着重讲述如何填充缺失值。
题目说明
数据集就4个特征列和一个标签,分别如下:date(日期)、country(国家)、store(商店)、product(产品)、num_sold(销量),非常简单。
加载数据
import pandas as pd
train=pd.read_csv('/kaggle/input/playground-series-s5e1/train.csv')
train.head()
_ | id | date | country | store | product | num_sold |
---|---|---|---|---|---|---|
0 | 0 | 2010-01-01 | Canada | Discount Stickers | Holographic Goose | NaN |
1 | 1 | 2010-01-01 | Canada | Discount Stickers | Kaggle | 973.0 |
2 | 2 | 2010-01-01 | Canada | Discount Stickers | Kaggle Tiers | 906.0 |
3 | 3 | 2010-01-01 | Canada | Discount Stickers | Kerneler | 423.0 |
4 | 4 | 2010-01-01 | Canada | Discount Stickers | Kerneler Dark Mode | 491.0 |
test=pd.read_csv('/kaggle/input/playground-series-s5e1/test.csv')
test.head()
_ | id | date | country | store | product | num_sold |
---|---|---|---|---|---|---|
0 | 230130 | 2017-01-01 | Canada | Discount Stickers | Holographic Goose | |
1 | 230131 | 2017-01-01 | Canada | Discount Stickers | Kaggle | |
2 | 230132 | 2017-01-01 | Canada | Discount Stickers | Kaggle Tiers | |
3 | 230133 | 2017-01-01 | Canada | Discount Stickers | Kerneler | |
4 | 230134 | 2017-01-01 | Canada | Discount Stickers | Kerneler Dark Mode |
查看数据情况
train.date=pd.to_datetime(train.date, infer_datetime_format=True)
train.info()
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 230130 entries, 0 to 230129
Data columns (total 6 columns):
# Column Non-Null Count Dtype
0 id 230130 non-null int64
1 date 230130 non-null datetime64[ns]
2 country 230130 non-null object
3 store 230130 non-null object
4 product 230130 non-null object
5 num_sold 221259 non-null float64
dtypes: datetime64ns, float64(1), int64(1), object(3)
memory usage: 10.5+ MB
test.date=pd.to_datetime(test.date, infer_datetime_format=True)
test.info()
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 98550 entries, 0 to 98549
Data columns (total 5 columns):
# Column Non-Null Count Dtype
0 id 98550 non-null int64
1 date 98550 non-null datetime64[ns]
2 country 98550 non-null object
3 store 98550 non-null object
4 product 98550 non-null object
dtypes: datetime64ns, int64(1), object(3)
memory usage: 3.8+ MB
删除不需要的列
train=train.drop(['id'],axis=1)
test=test.drop(['id'],axis=1)
查看训练集与测试集分类内容是否相同
counter=0
for i in test.select_dtypes(include=['object']).columns.tolist():
if (len(list(set(train[i].unique().tolist())^set(test[i].unique().tolist())))!=0):
print(i ,'need to be worked on')
counter+=1
else:
continue
if(counter==0):
print('No work needed')
No work needed
即两个数据集一致
查看训练集分类的数量比
country=train['country'].value_counts().to_frame()
country.reset_index()
- | country | count |
---|---|---|
0 | Canada | 38355 |
1 | Finland | 38355 |
2 | Italy | 38355 |
3 | Kenya | 38355 |
4 | Norway | 38355 |
5 | Singapore | 38355 |
train.groupby('country').size().plot(kind='pie', autopct='%.2f%%');
product=train['product'].value_counts().to_frame()
product.reset_index()
product=train['product'].value_counts().to_frame()
product.reset_index()
- | product | count |
---|---|---|
0 | Holographic Goose | 46026 |
1 | Kaggle | 46026 |
2 | Kaggle Tiers | 46026 |
3 | Kerneler | 46026 |
4 | Kerneler Dark Mode | 46026 |
train.groupby('product').size().plot(kind='pie', autopct='%.2f%%');
从上述分析来看,各分类数据集分布非常均匀
counts = train.groupby(["country","store","product"])["num_sold"].count()
counts
Canada Discount Stickers Holographic Goose 0
Kaggle 2557
Kaggle Tiers 2557
Kerneler 2556
Kerneler Dark Mode 2557
...
Singapore Stickers for Less Holographic Goose 2557
Kaggle 2557
Kaggle Tiers 2557
Kerneler 2557
Kerneler Dark Mode 2557
Name: num_sold, Length: 90, dtype: int64
查看训练集缺失情况
missing_data = counts.loc[counts != 2557]
missing_data
Canada Discount Stickers Holographic Goose 0
Kerneler 2556
Premium Sticker Mart Holographic Goose 2177
Stickers for Less Holographic Goose 1249
Kenya Discount Stickers Holographic Goose 0
Kerneler 2494
Kerneler Dark Mode 2556
Premium Sticker Mart Holographic Goose 1911
Stickers for Less Holographic Goose 1199
Name: num_sold, dtype: int64
import matplotlib.pyplot as plt
# Plot setup
f, axs = plt.subplots(9, 1, figsize=(20, 50))
for i, (country, store, product) in enumerate(missing_data.index):
# Filter the data for the current combination of country, store, and product
plot = train.loc[
(train["country"] == country) &
(train["store"] == store) &
(train["product"] == product)
]
if plot.empty:
print(f"No data available for {country} - {store} - {product}")
continue
# Handle missing values in num_sold
missing_vals = plot.loc[plot["num_sold"].isna()]
# Plot the num_sold data
axs[i].plot(plot["date"], plot["num_sold"], marker='o', linestyle='-', color='blue', label='num_sold')
# Add vertical lines for missing dates
for missing_date in missing_vals["date"].dropna():
axs[i].axvline(missing_date, color='red', linestyle='-', linewidth=1, alpha=0.5)
# Add title and labels
axs[i].set_title(f"{country} - {store} - {product}")
axs[i].set_xlabel("Date")
axs[i].set_ylabel("Number Sold")
axs[i].legend(loc='upper right')
# Adjust layout
plt.tight_layout()
plt.show()
下图是表示缺失值分布情况,蓝色为真实值,红色垂直线为缺失值所在的日期
从上图可以看出,缺失值分布情况,有些产品是完全缺失
查看训练集与测试集时间区间
print("Train - Earliest date:", train["date"].min())
print("Train - Latest date:", train["date"].max())
print("Test - Earliest date:", test["date"].min())
print("Test - Latest date:", test["date"].max())
Train - Earliest date: 2010-01-01 00:00:00
Train - Latest date: 2016-12-31 00:00:00
Test - Earliest date: 2017-01-01 00:00:00
Test - Latest date: 2019-12-31 00:00:00
统计按周、月、年不同 地区、产品的销量
weekly = train.groupby(["country","store", "product", pd.Grouper(key="date", freq="W")])["num_sold"].sum().rename("num_sold").reset_index()
monthly = train.groupby(["country","store", "product", pd.Grouper(key="date", freq="MS")])["num_sold"].sum().rename("num_sold").reset_index()
yearly = train.groupby(["country","store", "product", pd.Grouper(key="date", freq="Y")])["num_sold"].sum().rename("num_sold").reset_index()
不同分类的所有销量情况
def plot_all(df):
import matplotlib.pyplot as plt
# Set up the figure and axes
f, axes = plt.subplots(3, 2, figsize=(20, 30), sharex=True, sharey=True)
f.tight_layout(pad=5.0) # Adjust layout for better spacing
# Iterate over each unique product
for n, prod in enumerate(df["product"].unique()):
plot = df.loc[df["product"] == prod]
# Get the current subplot axis
ax = axes[n // 2, n % 2]
# Plot data for each combination of country and store
for country in plot["country"].unique():
for store in plot["store"].unique():
subset = plot.loc[
(plot["country"] == country) &
(plot["store"] == store)
]
if not subset.empty:
ax.plot(
subset["date"],
subset["num_sold"],
marker='o',
label=f"{country}-{store}"
)
# Set title and labels
ax.set_title(f"Product: {prod}", fontsize=16)
ax.set_xlabel("Date", fontsize=12)
ax.set_ylabel("Number Sold", fontsize=12)
ax.legend(title="Country-Store", loc='upper left', fontsize=10)
# Hide unused subplots if any
for idx in range(n + 1, len(axes.flatten())):
axes.flatten()[idx].set_visible(False)
# Show the plot
plt.show()
按周查看
plot_all(weekly)
按月查看
plot_all(monthly)
按年查看
plot_all(yearly)
查看各地区销量权重
country_weights = train.groupby("country")["num_sold"].sum() / train["num_sold"].sum()
country_weights.head()
country
Canada 0.172087
Finland 0.172362
Italy 0.124908
Kenya 0.004139
Norway 0.327695
Name: num_sold, dtype: float64
查看各地区按每日销量权重
country_ratio_over_time = (train.groupby(["date","country"])["num_sold"].sum() / train.groupby(["date"])["num_sold"].sum()).reset_index()
country_ratio_over_time.head()
- | date | country | num_sold |
---|---|---|---|
0 | 2010-01-01 | Canada | 0.176460 |
1 | 2010-01-01 | Finland | 0.179511 |
2 | 2010-01-01 | Italy | 0.136094 |
3 | 2010-01-01 | Kenya | 0.003424 |
4 | 2010-01-01 | Norway | 0.326333 |
发现什么了吗?
每日销量的权重与每年的一样,与地区一致
将销量按每日作图
f, ax = plt.subplots(figsize=(20, 10))
# Loop through the countries and plot each one
for country in country_ratio_over_time['country'].unique():
country_data = country_ratio_over_time[country_ratio_over_time['country'] == country]
ax.plot(country_data['date'], country_data['num_sold'], label=country)
ax.set_ylabel("Proportion of sales")
ax.legend(title='Country')
plt.show()
获取各地区的GDP数据
gdp_per_capita = pd.read_csv("/kaggle/input/world-gdpgdp-gdp-per-capita-and-annual-growths/gdp_per_capita.csv")
years = ["2010", "2011", "2012", "2013", "2014", "2015", "2016"]
gdp_per_capita = gdp_per_capita.loc[gdp_per_capita["Country Name"].isin(train["country"].unique()), ["Country Name"] + years].set_index("Country Name")
gdp_per_capita.head()
Country Name | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 |
---|---|---|---|---|---|---|---|
Canada | 47562.083425 | 52223.696112 | 52669.089963 | 52635.174958 | 50955.998323 | 43596.135537 | 42315.603706 |
Finland | 46459.973255 | 51081.997670 | 47710.790217 | 49878.043244 | 50260.299859 | 42784.698362 | 43784.283962 |
Italy | 36000.520118 | 38599.062207 | 35053.526244 | 35549.974697 | 35518.415292 | 30230.226302 | 30939.714246 |
Kenya | 1080.296184 | 1085.487152 | 1271.815383 | 1354.820833 | 1462.220052 | 1464.554009 | 1525.235192 |
Norway | 87693.790066 | 100600.562408 | 101524.141852 | 102913.450844 | 97019.182753 | 74355.515858 | 70460.560532 |
获取各地区GDP数据方法很多,可以直接用API方式获取,详见下一篇文章
统计各地区按GDP占的权重
# 以下处理是为了,从一年的第一天到最后一天用相同的数据填充,为下图作图方便
for year in years:
gdp_per_capita[f"{year}_ratio"] = gdp_per_capita[year] / gdp_per_capita.sum()[year]
gdp_per_capita_ratios = gdp_per_capita[[i+"_ratio" for i in years]]
gdp_per_capita_ratios.columns = [int(i) for i in years]
gdp_per_capita_ratios = gdp_per_capita_ratios.unstack().reset_index().rename(columns = {"level_0": "year", 0: "ratio", "Country Name": "country"})
gdp_per_capita_ratios['year'] = pd.to_datetime(gdp_per_capita_ratios['year'], format='%Y')
gdp_per_capita_ratios_nextyear = gdp_per_capita_ratios.copy()
gdp_per_capita_ratios_nextyear["year"] = pd.to_datetime(gdp_per_capita_ratios_nextyear['year'].astype(str)) + pd.offsets.YearEnd(1)
gdp_per_capita_ratios = pd.concat([gdp_per_capita_ratios, gdp_per_capita_ratios_nextyear],ignore_index=True)
gdp_per_capita_ratios=gdp_per_capita_ratios.sort_values(by=['year'], ascending=True)
gdp_per_capita_ratios.reset_index(drop=True, inplace=True)
gdp_per_capita_ratios.head()
- | year | country | ratio |
---|---|---|---|
0 | 2010-01-01 | Canada | 0.178782 |
1 | 2010-01-01 | Finland | 0.174639 |
2 | 2010-01-01 | Italy | 0.135323 |
3 | 2010-01-01 | Kenya | 0.004061 |
4 | 2010-01-01 | Norway | 0.329634 |
小伙伴们
又发现什么了?
这组数据与前面的权重是不是又一致了
f, ax = plt.subplots(figsize=(20, 20))
# Plot the first dataset (country_ratio_over_time)
for country in country_ratio_over_time['country'].unique():
country_data = country_ratio_over_time[country_ratio_over_time['country'] == country]
ax.plot(country_data['date'], country_data['num_sold'], label=country)
# Plot the second dataset (gdp_per_capita_ratios) with black color for all countries
for country in gdp_per_capita_ratios['country'].unique():
country_data = gdp_per_capita_ratios[gdp_per_capita_ratios['country'] == country]
ax.plot(country_data['year'], country_data['ratio'], color='black')
# Set ylabel
ax.set_ylabel("Proportion of sales")
# Show legend for the first dataset (country_ratio_over_time)
ax.legend(title='Country')
plt.show()
上图 将GDP的权重 与 该地区的销量 图
gdp_per_capita_ratios_nextyear["year"] = gdp_per_capita_ratios_nextyear["year"].dt.year
def plot_adjust_country(df):
new = df.copy()
new["year"] = new["date"].dt.year
for country in new["country"].unique():
for year in new["year"].unique():
new.loc[(new["country"] == country) & (new["year"] == year), "num_sold"] = new.loc[(new["country"] == country) & (new["year"] == year), "num_sold"] / gdp_per_capita_ratios_nextyear.loc[(gdp_per_capita_ratios_nextyear["country"] == country) & (gdp_per_capita_ratios_nextyear["year"] == year), "ratio"].values[0]
plot_all(new)
原数据用GDP权重调整后作图
plot_adjust_country(monthly)
查看数据集缺失值的情况
print(f"Missing values remaining: {train['num_sold'].isna().sum()}")
# Missing values remaining: 8871
train["year"] = train["date"].dt.year
用GDP的权重方式的填充其他地匹配的销量缺失
具体实现有点复杂
举个例子,读取加拿大和挪威的GDP权重,找出挪威的该产品的销量,用等比例计算出加拿大该产品的销量进行填充
还提到 ,完全用挪威数据替换肯尼亚的几个商店,其中包含大量丢失的数据,这将有点冒险
for year in train["year"].unique():
# Impute Time Series 1 (Canada, Discount Stickers, Holographic Goose)
target_ratio = gdp_per_capita_ratios_nextyear.loc[(gdp_per_capita_ratios_nextyear["year"] == year) & (gdp_per_capita_ratios_nextyear["country"] == "Norway"), "ratio"].values[0]
current_ratio = gdp_per_capita_ratios_nextyear.loc[(gdp_per_capita_ratios_nextyear["year"] == year) & (gdp_per_capita_ratios_nextyear["country"] == "Canada"), "ratio"].values[0]
ratio_can = current_ratio / target_ratio
train.loc[(train["country"] == "Canada") & (train["store"] == "Discount Stickers") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] = (train.loc[(train["country"] == "Norway") & (train["store"] == "Discount Stickers") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] * ratio_can).values
train.loc[(train["country"] == "Canada") & (train["store"] == "Premium Sticker Mart") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] = (train.loc[(train["country"] == "Norway") & (train["store"] == "Premium Sticker Mart") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] * ratio_can).values
train.loc[(train["country"] == "Canada") & (train["store"] == "Stickers for Less") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] = (train.loc[(train["country"] == "Norway") & (train["store"] == "Stickers for Less") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] * ratio_can).values
# Impute Time Series 1 (Kenya, Discount Stickers, Holographic Goose)
current_ratio = gdp_per_capita_ratios_nextyear.loc[(gdp_per_capita_ratios_nextyear["year"] == year) & (gdp_per_capita_ratios_nextyear["country"] == "Kenya"), "ratio"].values[0]
ratio_ken = current_ratio / target_ratio
train.loc[(train["country"] == "Kenya") & (train["store"] == "Discount Stickers") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] = (train.loc[(train["country"] == "Norway") & (train["store"] == "Discount Stickers") & (train["product"] == "Holographic Goose")& (train["year"] == year), "num_sold"] * ratio_ken).values
# Going to be a bit risky and completely replace a couple of stores for Kenya with a lot of missing data with Norways data:
train.loc[(train["country"] == "Kenya") & (train["store"] == "Premium Sticker Mart") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] = (train.loc[(train["country"] == "Norway") & (train["store"] == "Premium Sticker Mart") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] * ratio_ken).values
train.loc[(train["country"] == "Kenya") & (train["store"] == "Stickers for Less") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] = (train.loc[(train["country"] == "Norway") & (train["store"] == "Stickers for Less") & (train["product"] == "Holographic Goose") & (train["year"] == year), "num_sold"] * ratio_ken).values
train.loc[(train["country"] == "Kenya") & (train["store"] == "Discount Stickers") & (train["product"] == "Kerneler") & (train["year"] == year), "num_sold"] = (train.loc[(train["country"] == "Norway") & (train["store"] == "Discount Stickers") & (train["product"] == "Kerneler") & (train["year"] == year), "num_sold"] * ratio_ken).values
替换后查看数据集缺失值的情况
print(f"Missing values remaining: {train['num_sold'].isna().sum()}")
# Missing values remaining: 2
仅仅发现还只有2个数据缺失
# 找出MISS的两个数据
missing_rows = train.loc[train["num_sold"].isna()]
missing_rows
id | date | country | store | product | num_sold | year |
---|---|---|---|---|---|---|
23719 | 2010-09-21 | Kenya | Discount Stickers | Kerneler Dark Mode | NaN | 2010 |
207003 | 2016-04-19 | Canada | Discount Stickers | Kerneler | NaN | 2016 |
查看缺失值前后数据情况
a=train.loc[(train.country=='Kenya') & (train.store=='Discount Stickers') &
(train['product']=='Kerneler Dark Mode')]
a.loc[a.date>=' 2010-09-11'].head(20)
id | date | country | store | product | num_sold | year |
---|---|---|---|---|---|---|
22819 | 2010-09-11 | Kenya | Discount Stickers | Kerneler Dark Mode | 6.0 | 2010 |
22909 | 2010-09-12 | Kenya | Discount Stickers | Kerneler Dark Mode | 6.0 | 2010 |
22999 | 2010-09-13 | Kenya | Discount Stickers | Kerneler Dark Mode | 5.0 | 2010 |
23089 | 2010-09-14 | Kenya | Discount Stickers | Kerneler Dark Mode | 6.0 | 2010 |
23179 | 2010-09-15 | Kenya | Discount Stickers | Kerneler Dark Mode | 5.0 | 2010 |
23269 | 2010-09-16 | Kenya | Discount Stickers | Kerneler Dark Mode | 5.0 | 2010 |
23359 | 2010-09-17 | Kenya | Discount Stickers | Kerneler Dark Mode | 6.0 | 2010 |
23449 | 2010-09-18 | Kenya | Discount Stickers | Kerneler Dark Mode | 6.0 | 2010 |
23539 | 2010-09-19 | Kenya | Discount Stickers | Kerneler Dark Mode | 6.0 | 2010 |
23629 | 2010-09-20 | Kenya | Discount Stickers | Kerneler Dark Mode | 5.0 | 2010 |
23719 | 2010-09-21 | Kenya | Discount Stickers | Kerneler Dark Mode | NaN | 2010 |
23809 | 2010-09-22 | Kenya | Discount Stickers | Kerneler Dark Mode | 5.0 | 2010 |
23899 | 2010-09-23 | Kenya | Discount Stickers | Kerneler Dark Mode | 5.0 | 2010 |
23989 | 2010-09-24 | Kenya | Discount Stickers | Kerneler Dark Mode | 5.0 | 2010 |
24079 | 2010-09-25 | Kenya | Discount Stickers | Kerneler Dark Mode | 6.0 | 2010 |
24169 | 2010-09-26 | Kenya | Discount Stickers | Kerneler Dark Mode | 7.0 | 2010 |
24259 | 2010-09-27 | Kenya | Discount Stickers | Kerneler Dark Mode | 5.0 | 2010 |
24349 | 2010-09-28 | Kenya | Discount Stickers | Kerneler Dark Mode | 5.0 | 2010 |
24439 | 2010-09-29 | Kenya | Discount Stickers | Kerneler Dark Mode | 5.0 | 2010 |
24529 | 2010-09-30 | Kenya | Discount Stickers | Kerneler Dark Mode | 5.0 | 2010 |
a=train.loc[(train.country=='Canada') & (train.store=='Discount Stickers') &
(train['product']=='Kerneler')]
a.loc[a.date>=' 2016-04-15'].head(20)
id | date | country | store | product | num_sold | year |
---|---|---|---|---|---|---|
206643 | 2016-04-15 | Canada | Discount Stickers | Kerneler | 245.0 | 2016 |
206733 | 2016-04-16 | Canada | Discount Stickers | Kerneler | 239.0 | 2016 |
206823 | 2016-04-17 | Canada | Discount Stickers | Kerneler | 290.0 | 2016 |
206913 | 2016-04-18 | Canada | Discount Stickers | Kerneler | 229.0 | 2016 |
207003 | 2016-04-19 | Canada | Discount Stickers | Kerneler | NaN | 2016 |
207093 | 2016-04-20 | Canada | Discount Stickers | Kerneler | 245.0 | 2016 |
207183 | 2016-04-21 | Canada | Discount Stickers | Kerneler | 228.0 | 2016 |
207273 | 2016-04-22 | Canada | Discount Stickers | Kerneler | 246.0 | 2016 |
207363 | 2016-04-23 | Canada | Discount Stickers | Kerneler | 265.0 | 2016 |
207453 | 2016-04-24 | Canada | Discount Stickers | Kerneler | 298.0 | 2016 |
207543 | 2016-04-25 | Canada | Discount Stickers | Kerneler | 217.0 | 2016 |
207633 | 2016-04-26 | Canada | Discount Stickers | Kerneler | 226.0 | 2016 |
207723 | 2016-04-27 | Canada | Discount Stickers | Kerneler | 221.0 | 2016 |
207813 | 2016-04-28 | Canada | Discount Stickers | Kerneler | 223.0 | 2016 |
207903 | 2016-04-29 | Canada | Discount Stickers | Kerneler | 256.0 | 2016 |
207993 | 2016-04-30 | Canada | Discount Stickers | Kerneler | 215.0 | 2016 |
208083 | 2016-05-01 | Canada | Discount Stickers | Kerneler | 304.0 | 2016 |
208173 | 2016-05-02 | Canada | Discount Stickers | Kerneler | 222.0 | 2016 |
208263 | 2016-05-03 | Canada | Discount Stickers | Kerneler | 256.0 | 2016 |
208353 | 2016-05-04 | Canada | Discount Stickers | Kerneler | 229.0 | 2016 |
直接用确定值进行填充缺失值
missing_rows = train.loc[train["num_sold"].isna()]
train.loc[train.index == 23719, "num_sold"] = 5
train.loc[train.index == 207003, "num_sold"] = 300
print(f"Missing values remaining: {train['num_sold'].isna().sum()}")
Missing values remaining: 0
查看各商场销量权重
weekly = train.groupby(["country","store", "product", pd.Grouper(key="date", freq="W")])["num_sold"].sum().rename("num_sold").reset_index()
monthly = train.groupby(["country","store", "product", pd.Grouper(key="date", freq="MS")])["num_sold"].sum().rename("num_sold").reset_index()
store_weights = train.groupby("store")["num_sold"].sum()/train["num_sold"].sum()
store_weights
store
Discount Stickers 0.184728
Premium Sticker Mart 0.441569
Stickers for Less 0.373702
Name: num_sold, dtype: float64
store_ratio_over_time = (train.groupby(["date","store"])["num_sold"].sum() / train.groupby(["date"])["num_sold"].sum()).reset_index()
作图查看各商场销量权重
f, ax = plt.subplots(figsize=(20, 10))
# Loop through the unique stores and plot them
for store in store_ratio_over_time['store'].unique():
store_data = store_ratio_over_time[store_ratio_over_time['store'] == store]
ax.plot(store_data['date'], store_data['num_sold'], label=store)
# Set the ylabel
ax.set_ylabel("Proportion of sales")
plt.grid()
重新调整后作图
def plot_adjusted_store(df):
new = df.copy()
weights = store_weights.loc["Premium Sticker Mart"] / store_weights
print(weights)
for store in weights.index:
new.loc[new["store"] == store, "num_sold"] = new.loc[new["store"] == store, "num_sold"] * weights[store]
plot_all(new)
plot_adjusted_store(monthly)
store
Discount Stickers 2.390369
Premium Sticker Mart 1.000000
Stickers for Less 1.181606
Name: num_sold, dtype: float64
各产品的销量情况
product_ = train.groupby(["date","product"])["num_sold"].sum().reset_index()
f, ax = plt.subplots(figsize=(20, 10))
# Loop through the unique products and plot them
for product in product_['product'].unique():
product_data = product_[product_['product'] == product]
ax.plot(product_data['date'], product_data['num_sold'], label=product)
# Set labels
ax.set_xlabel("Date")
ax.set_ylabel("Number of Units Sold")
# Add a legend
ax.legend(title="Product")
# Display the plot
plt.grid()
各产品的销量权重作图(转化为权重)
product_ratio = product_.pivot(index="date", columns="product", values="num_sold")
product_ratio = product_ratio.apply(lambda x: x/x.sum(),axis=1)
product_ratio = product_ratio.stack().rename("ratios").reset_index()
f, ax = plt.subplots(figsize=(20, 10))
# Loop through the unique products and plot them
for product in product_ratio['product'].unique():
product_data = product_ratio[product_ratio['product'] == product]
ax.plot(product_data['date'], product_data['ratios'], label=product)
# Set axis labels
ax.set_xlabel("Date")
ax.set_ylabel("Ratios")
# Add a legend
ax.legend(title="Product")
# Display the plot
plt.grid()
比较上面两图,发现曲线的波动更加平滑了
按日统计销量
train_ = train.groupby(["date"])["num_sold"].sum().reset_index()
f, ax = plt.subplots(figsize=(20, 10))
# Plot the data
ax.plot(train_['date'], train_['num_sold'], linewidth=1)
# Set axis labels
ax.set_xlabel("Date")
ax.set_ylabel("Number of Units Sold")
# Display the plot
plt.grid()
weekly = train_.groupby([pd.Grouper(key="date", freq="W")])["num_sold"].sum().rename("num_sold").reset_index()
monthly = train_.groupby([pd.Grouper(key="date", freq="MS")])["num_sold"].sum().rename("num_sold").reset_index()
按月统计销量
f, ax = plt.subplots(figsize=(20, 10))
# Plot the data
ax.plot(monthly['date'], monthly['num_sold'], linewidth=1)
# Set axis labels
ax.set_xlabel("Date")
ax.set_ylabel("Number of Units Sold")
# Display the plot
plt.grid()
按周统计销量
f, ax = plt.subplots(figsize=(20, 10))
# Plot the data
ax.plot(weekly['date'][1:-1], weekly['num_sold'][1:-1], linewidth=1)
# Set axis labels
ax.set_xlabel("Date")
ax.set_ylabel("Number of Units Sold")
# Display the plot
plt.grid()
按季统计销量
def plot_seasonality(df, x_axis):
# Create new columns for seasonality analysis
df["month"] = df["date"].dt.month
df["day_of_week"] = df["date"].dt.dayofweek
df["day_of_year"] = df['date'].apply(
lambda x: x.timetuple().tm_yday if not (x.is_leap_year and x.month > 2) else x.timetuple().tm_yday - 1
)
# Create the plot
f, ax = plt.subplots(1, 1, figsize=(20, 8))
# Group the data by the specified x-axis value and calculate the mean of num_sold
grouped_data = df.groupby(x_axis)["num_sold"].mean().reset_index()
# Plot the data
ax.plot(grouped_data[x_axis], grouped_data["num_sold"], linewidth=1)
# Set title and labels
ax.set_title(f"{x_axis} Seasonality")
ax.set_xlabel(x_axis)
ax.set_ylabel("Number of Units Sold")
# Display the plot
plt.grid()
plot_seasonality(train_, "month")
plot_seasonality(train_, "day_of_week")
plot_seasonality(train_, "day_of_year")
def plot_individual_ts(df):
for country in df["country"].unique():
# Create subplots with the appropriate number of rows
fig, axes = plt.subplots(
nrows=df["store"].nunique() * df["product"].nunique(),
figsize=(20, 70),
constrained_layout=True
)
count = 0
for store in df["store"].unique():
for product in df["product"].unique():
# Filter the DataFrame for the current country, store, and product
plot_df = df.loc[
(df["product"] == product) &
(df["country"] == country) &
(df["store"] == store)
]
if len(plot_df) > 0: # Only plot if data is available
axes[count].plot(plot_df["date"], plot_df["num_sold"], linewidth=0.5)
axes[count].set_title(f"{country} - {store} - {product}")
count += 1
# Remove any unused subplots (if any)
for idx in range(count, len(axes)):
fig.delaxes(axes[idx])
plt.grid()
plot_individual_ts(train.reset_index(drop=True))
查看各地区商场产品的销量图
更新缺失值,保存新的数据集
df=pd.read_csv('/kaggle/input/playground-series-s5e1/train.csv')
df['num_sold']=train['num_sold']
df.to_csv('train.csv', index = False)
总结
该数据集有以下特征:
- 时间序列数据集,特别是产品销量是有一定时间周期性
- 其周期性图形可以看出,具有类似正弦曲线
- 产品销量与该地区的GPD完全正相关
- 销量与是否周末有关
以上这些特征为下一次建模的特征工程提供有方向。