shopee刷销量
Disclaimer : This article is based on my personal experience and I do not represent anyone or any brand. This article will mostly talk about Python programming language and its Pandas library.
免责声明:本文基于我的个人经验,我不代表任何人或任何品牌。 本文将主要讨论Python编程语言及其Pandas库。
There are hundreds brand at Shopee. Every brand surely want to increase their sales so that they can gain more profit. In order to do that, Shopee Data Analyst would identify three products with highest revenue from each brand. Then, campaign and ads for those three products could be increase in such a way that overall sales for each brand have a high chance to increase.
Shopee有数百个品牌。 每个品牌肯定希望增加销量,以便获得更多利润。 为此,Shopee Data Analyst将从每个品牌中找出收入最高的三种产品。 然后,可以增加这三个产品的广告系列和广告,以使每个品牌的整体销售额都有很大的增长机会。
In this article we will walk through the code one may use to do the task mentioned before. This task is adapted from “I’m The Best Coder ! Challenge 2019”, a data science competition held by Shopee (go to this page https://www.kaggle.com/c/ptr-rd1 to see the original problem).
在本文中,我们将逐步介绍一个人可能用来执行前面提到的任务的代码。 此任务改编自“我是最好的编码器! Shopee举办的数据科学竞赛“挑战2019”(转到此页面https://www.kaggle.com/c/ptr-rd1以查看原始问题)。
Task:
任务:
Find the top three itemids (in a list) from the ‘Official Shop’ of that particular brand that generated the highest Gross Sales Revenue (amount*item_price_usd) from 10th May to 31st May 2019.
从该品牌的'官方商店'中找到前三个最重要的id(在列表中),该ID在2019年5月10日至5月31日产生了最高的总销售收入(amount * item_price_usd)。
Without further ado, let’s jump right into the code. We use Pandas to analyze and modify our data.
事不宜迟,让我们直接进入代码。 我们使用熊猫来分析和修改我们的数据。
In[1]:import pandas as pd brands = pd.read_csv("d:/Extra_material_2.csv")
brands.head()
We store a dataframe in a variable named ‘brands’. Each shop_id belong to certain brand. From this dataframe, we know whether that shop_id is an official shop or not. Below is brands’ first five rows.
我们将数据框存储在名为“ brands”的变量中。 每个shop_id都属于某个品牌。 从这个数据框中,我们知道shop_id是否是官方商店。 以下是品牌的前五行。
Out[1]: shop_id brand shop_type
0 92567513.0 JBL Official Shop
1 1657263.0 Sandisk Official Shop
2 83401047.0 Revlon Official Shop
3 4259661.0 Philips Official Shop
4 4350106.0 Kimberly Clark Official Shop
Before analyzing the data, we should “clean” it first. Three steps that we used:
在分析数据之前,我们应该先“清理”数据。 我们使用了三个步骤:
- Remove duplicated rows 删除重复的行
- Change ‘shop_id’ from float to integer (so that ‘.0’ will be removed)将'shop_id'从float更改为整数(这样将删除'.0')
- Sort the data based on ‘brand’ alphabetically 按字母顺序对数据进行排序
In[2]:
brands.drop_duplicates(keep=’first’, inplace=True)
brands['shop_id'] = dataa.apply(lambda x: int(x['shop_id']),axis=1)
brands.sort_values(‘brand’, inplace=True)
brands.reset_index(inplace=True)
brands
Below is our dataframe ‘brands’ after being cleaned and organized.
以下是经过整理和整理后的数据框“品牌”。
Out[2]:
index shop_id brand shop_type
0 1553 63512637 3M NaN
1 1554 15424243 3M NaN
2 40 13735592 3M Official Shop
3 41 15607060 3M Official Shop
4 42 63512637 3M Official Shop
… … … … …
1388 2237 46944726 Za NaN
1389 2435 78346322 eMart NaN
1390 613 78346970 eMart Shopee Store
1391 2436 78346970 eMart NaN
1392 612 78346322 eMart Shopee Store
Next, we import our second data and saved it to a dataframe named ‘transactions’
接下来,我们导入第二个数据并将其保存到名为“ transactions”的数据框中
In[3]:
transactions = pd.read_csv(”d:/Extra_material_3.csv”)
transactions.drop_duplicates(keep='first', inplace=True)
transactions.reset_index(inplace=True)
transactions.head()
This is how our dataframe ‘transactions’ looks like.
这就是我们的数据框“交易”的样子。
Out[3]:
index orderid itemid date_id amount item_price_usd shopid
0 0 1316773501 1355135789 16/5/2019 1 5.483204 65948755
1 1 1316774684 1257902741 16/5/2019 1 0.387050 61364537
2 2 1316803330 2126770455 16/5/2019 1 0.344044 4980129
3 3 1317196434 2112999088 16/5/2019 1 6.704561 29089750
4 4 1318105348 1832131969 16/5/2019 1 5.354187 94372493
Unfortunately, transactions’ column named ‘date_id’ is a string.
不幸的是,名为“ date_id”的“交易”列是一个字符串。
In[4]:
transactions[‘date_id’].describe()
Out[4]:
count 1013725
unique 12
top 5/5/2019
freq 317810
Name: date_id, dtype: object
We have to convert it to datetime first. Then, remove orders before May 10th 2019 and orders after May 31st 2019. For further interests, we also convert column ‘itemid’ to string.
我们必须先将其转换为日期时间。 然后,在2019年5月10日之前删除订单,在2019年5月31日之后删除订单。为了进一步引起兴趣,我们还将列“ itemid”转换为字符串。
In[5]: transactions[‘date_id’]=pd.to_datetime(transactions[‘date_id’])
transactions = transactions[(transactions[‘date_id’]>=’2019–05–10')&
(transactions[‘date_id’]<=’2019–05–31')]transactions['itemid']=transactions.apply(lambda x:str
(int(x['itemid'])),axis=1)
We created new column named ‘subtotal’. This column contains number came from formula ‘amount’ x ‘item_price_usd’.
我们创建了一个名为“小计”的新列。 此列包含来自公式“金额” x“ item_price_usd”的数字。
In[6]:
transactions[‘subtotal’]=transactions.apply(lambda x: x.amount*
x.item_price_usd,axis=1)
transactions
This is how our dataframe ‘transactions’ looks like.
这就是我们的数据框“交易”的样子。
Out[6]:
... itemid date_id amount item_price_usd shopid subtotal
0 1355135789 2019–05–16 1 5.483204 65948755 5.483204
1 1257902741 2019–05–16 1 0.387050 61364537 0.387050
2 2126770455 2019–05–16 1 0.344044 4980129 0.344044
3 2112999088 2019–05–16 1 6.704561 29089750 6.704561
4 1832131969 2019–05–16 1 5.354187 94372493 5.354187
… … … … … …
658070 1286506704 2019–05–29 1 2.924375 73789223 2.924375
658071 1726943954 2019–05–29 2 18.062319 52323543 36.124638
658072 423611298 2019–05–29 1 4.300552 20187300 4.300552
658073 1420094546 2019–05–29 1 13.976795 46595353 13.976795
658074 237867635 2019–05–29 1 23.179976 21180592 23.179976
Next, we create a new dataframe named ‘products’. This dataframe will tell us how much revenue generated by each itemid. (Each product have unique itemid)
接下来,我们创建一个名为“产品”的新数据框。 该数据框将告诉我们每个itemid产生了多少收入。 (每个产品都有唯一的itemid)
In[7]:
products = pd.DataFrame({‘itemid’:transactions[‘itemid’].unique()},
columns=[‘itemid’])
gross_revenue = []for x in products[‘itemid’]:
gross_revenue.append(transactions[transactions[‘itemid’]==x]
[‘subtotal’].sum(axis=0))products[‘gross_revenue’]=gross_revenue
Next, we labelled each itemid with its corresponding shopid.
接下来,我们用其对应的shopid标记每个itemid。
In[8]:shopid = []for x in products[‘itemid’]:
shopid.append(transactions[transactions[‘itemid’]==x]
[‘shopid’].unique()[0])products[‘shopid’]=shopidproducts.head()
This is how five first row of our dataframe ‘products’ looks like.
这就是我们数据框“产品”的五个第一行的样子。
Out[8]:
itemid gross_revenue shopid
0 1355135789 5.483204 65948755
1 1257902741 1.161149 61364537
2 2126770455 1.720221 4980129
3 2112999088 6.704561 29089750
4 1832131969 67.088614 94372493
Next, we want to find three products with highest revenue. Here are the steps:
接下来,我们想找到三种收入最高的产品。 步骤如下:
- Create dataframe ‘list’ which is a list of ‘Official Shop’ shopid from certain brand 创建数据框“列表”,该列表是特定品牌的“官方商店”购物清单
- Create dataframe ‘product_list’ which is a list of product that is sold from Official Shops listed in ‘list’创建数据框“ product_list”,这是从“ list”中列出的官方商店出售的产品的列表
- Find three products from ‘product_list’ with highest revenue从“ product_list”中找到收入最高的三个产品
- Repeat to other brands重复其他品牌
Store the answer in a dataframe named ‘Answers’
将答案存储在名为“答案”的数据框中
In[9]:
Answers = []for x in dataa[‘brand’].unique():
list = brands[(brands[‘brand’]==x) &
(brands[‘shop_type’]==’Official Shop’)][‘shop_id’]
product_list = datac[datac[‘shopid’].isin(list)]
best3 = product_list.nlargest(3, ‘gross_revenue’).reset_index()
if len(best3) == 3:
Answers.append(x +’, ‘+best3.loc[0,’itemid’]+’, ‘
+best3.loc[1,’itemid’]+’, ‘+
best3.loc[2,’itemid’])
elif len(best3) == 2:
Answers.append(x +’, ‘+best3.loc[0,’itemid’]+’, ‘
+best3.loc[1,’itemid’])
elif len(best3) == 1:
Answers.append(x +’, ‘+best3.loc[0,’itemid’])
else :
Answers.append(x +’, N.A’)
Export ‘Answers’ to csv file and submit it to Kaggle.
将“答案”导出到csv文件并将其提交给Kaggle。
In[10]:Answers = pd.DataFrame(Answers, columns=[‘Answers’])
Index = range(1, 271)
Answers.insert(0, ‘Index’, Index)
Answers.to_csv(”d:/Answers3.csv”, index = False)
Answers
Our ‘Answers’ should look like this
我们的“答案”应如下所示
Out[10]: Index Answers
0 1 3M, N.A
1 2 3M Littmann, N.A
2 3 AHC, 1617870234, 1617870245, 1617870178
3 4 ASUS, N.A
4 5 Abbott, 1437756029, 1414181351, 1414181383
.. ... ...
265 266 Wyeth Nutrition , N.A
266 267 Xiaomi, 1373245117, 967538060, 1913571916
267 268 Yves Rocher, 1719453583, 1719453786, 2109554227
268 269 Za, 365981058, 1074619349, 1867715485
269 270 eMart, N.A
Some brands have no product sold, so we put ‘N.A’ after the brand’s name.
有些品牌没有出售产品,因此我们在品牌名称后加上“ NA”。
This method yielded score 0.88888 out of 1. This method also took a very long time to proceed. Any comments and suggestions for better and faster method will be much appreciated ! :D
该方法得出的满分为0.88888。该方法也花费了很长时间。 任何更好和更快的方法的意见和建议,将不胜感激! :D
In summary, one can boost its profit by increasing campaign and advertisement for products that generated highest revenue. In fact, this simple task can be done by hand. But with thousands of data, doing it by hand is hard and painful. A little knowledge about Python and Pandas will help us a lot.
总之,可以通过增加产生最高收入的产品的广告系列和广告来增加利润。 实际上,这个简单的任务可以手动完成。 但是,要处理成千上万的数据,手工完成将非常困难。 对Python和Pandas的一点了解将对我们有很大帮助。
Thank you for reading this article ! I hope you gained some new knowledge after reading this article. Any comments and suggestions will be much appreciated. Thank you !
感谢您阅读本文! 我希望您阅读本文后能获得一些新知识。 任何意见和建议将不胜感激。 谢谢 !
shopee刷销量