发现印度尼西亚提克多克州的病毒性食物食谱

In the past 5 months quarantine from the Covid-19 pandemic, I’ve seen changes in my friend’s and relatives’ Instagram posts and story updates. From posts on outdoor communal activities — travel, hangouts, parties — , now it’s shifting to home activities on gardening, Netflix binge-watching, and cooking!

在过去五个月的Covid-19大流行隔离中,我看到我的朋友和亲戚的Instagram帖子和故事更新发生了变化。 从户外公共活动的帖子(旅行,视频群聊,聚会),现在已经转移到园艺,Netflix狂欢观看和烹饪等家庭活动!

I’m seeing there are similar dishes that people are trying out and posts on social media. It started with dalgona coffee on March-April, to Korean garlic cheese bun in the recent days. Interestingly enough, I also see people are now sharing their recipes on TikTok, with demo videos for the recipes.

我看到人们正在尝试类似的菜肴并在社交媒体上发帖。 从3月至4月的dalgona咖啡到最近几天的韩国大蒜奶酪面包 。 有趣的是,我还看到人们现在在TikTok上分享他们的食谱,以及有关食谱的演示视频。

Now using my curiosity to play around with data, I decided to look around and explore the TikTok posts on some hashtags that people use to explore recipes to try at home. My goal is pretty simple :

现在,出于好奇,我开始研究数据,然后决定环顾四周,并探索一些人们在用来在家中尝试食谱的标签上的TikTok帖子。 我的目标很简单:

  • Scrape some top TikTok posts with these hashtags

    使用这些标签刮掉一些TikTok的重要帖子
  • Extract the captions, likes/views, and check for any interesting trends — literally playing around with data :D

    提取字幕,喜欢/观看次数,并检查是否有任何有趣的趋势- 实际上就是在玩数据 :D

TikTok数据收集 (TikTok Data Collection)

Disclaimer: I’m using TikTokApi from David Teather, which is available on RapidAPI.

免责声明 :我使用 TikTokApi 大卫Teather ,这是可以的RapidAPI。

For this case, I’m using the endpoints of this API from RapidAPI. Whilst web-scraping has its ethical debate, in this exploration I’m using it responsibly for only retrieving publicly available data and within a limited amount.

对于这种情况,我正在使用RapidAPI中该API的端点。 尽管网络爬虫在道德上存在争议 ,但在本次探索中,我以负责任的态度将其用于仅检索有限数量的公共可用数据。

The full code of the TikTok scraping code can be found here.

TikTok抓取代码的完整代码可以在这里找到。

1.从RapidApi端点获取原始数据 (1. Get Raw Data from RapidApi endpoints)

Here I’m using Python HTTP request comment, calling to RapidApi endpoints with the hashtag query that I need. I have pre-defined the count of posts to be captured as 1000 posts(out of maximum 2000 posts/request)

在这里,我使用的是Python HTTP请求注释,并通过所需的主题标签查询调用RapidApi端点。 我已经预定义了要捕获的帖子数为1000个帖子(每个请求最多2000个帖子)

import requestsurl = "https://tiktok2.p.rapidapi.com/hashtag"querystring = {"hashtag":"masakdirumah","count":"1000"}headers = {
'x-rapidapi-host': "tiktok2.p.rapidapi.com",
'x-rapidapi-key': "yourkeyhere"
}response = requests.request("GET", url, headers=headers, params=querystring)print(response.text)

2.解析响应数据 (2. Parse the response data)

From the above script, we will get a response to JSON data.

从上面的脚本中,我们将获得对JSON数据的响应。

import jsonimport pandas as pddata = response.text
y = json.loads(data)
raw_df = pd.json_normalize(y['data'])
raw_df

Next, I’m parsing this data to data frame format, and only for the columns that I need: video URL, user_name, caption, count of likes, comments, plays, and shares.

接下来,我将这些数据解析为数据帧格式,并且仅用于我需要的列:视频URL,用户名,标题,顶数,评论,播放和共享。

df_summary = raw_df[['itemInfos.createTime','itemInfos.video.urls','itemInfos.text','authorInfos.uniqueId','authorStats.followerCount','itemInfos.commentCount','itemInfos.playCount','itemInfos.shareCount','itemInfos.diggCount','itemInfos.video.videoMeta.duration']]df_summary.rename(columns={
"itemInfos.createTime" : "created_time",
"itemInfos.video.urls" : "video_url",
'authorInfos.uniqueId' : 'user_name',
'itemInfos.text' : 'video_desc',
'authorStats.followerCount':'user_follower_cnt',
'itemInfos.commentCount':'cnt_comments',
'itemInfos.playCount':'cnt_plays',
'itemInfos.shareCount':'cnt_shares',
'itemInfos.diggCount':'cnt_likes',
'itemInfos.video.videoMeta.duration':'video_length'})
Image for post
Parsing Result
解析结果

3.从字幕中提取提及和主题标签 (3. Extract mentions and hashtags from the captions)

Here is where I thank John Naujoks for his functions to extract hashtags and mentions from a string. I did some modification, but without his example script, I won’t be able to figure out what to do to get this.

在这里,我感谢John Naujoks的功能,它从字符串中提取标签和提及内容。 我做了一些修改,但是没有他的示例脚本,我将无法弄清楚该怎么做。

def find_hashtags(comment):"""Find hastags used in comment and return them"""
hashtags = re.findall('#[A-Za-z]+', comment)if (len(hashtags) > 1) & (len(hashtags) != 1):return hashtagselif len(hashtags) == 1:return hashtags[0]else:return ""def find_mentions(comment):"""Find mentions used in comment and return them"""
mentions = re.findall('@[A-Za-z]+', comment)if (len(mentions) > 1) & (len(mentions) != 1):return mentionselif len(mentions) == 1:return mentions[0]else:return ""df_summary['hashtags'] = [find_hashtags(video_desc) for video_desc in df_summary['video_desc']]
df_summary['mentions'] = [find_mentions(video_desc) for video_desc in df_summary['video_desc']]
df_summary
Image for post
Metadata of TikTok posts after list result converted to the data frame
将列表结果转换为数据框后,TikTok发布的元数据

There you go, some dataset to play around with!

到这里,您可以玩一些数据集!

数据整理时间! (Data Wrangling Time!)

Now onto the fun part — data exploration. With the data above, here’s what I have in mind for the exploration part :

现在进入有趣的部分-数据探索。 有了上面的数据,这就是我要探索的部分:

  • Timeseries trends of the posts

    帖子的时间序列趋势
  • Plays/likes/shares distribution across accounts

    帐户之间的播放/喜欢/分享分配
  • Content of popular recipes

    热门食谱的内容

Starting with some time-series analysis, here is the trend of post over time. Looking at the below timeseries charts we get some interesting insights:

从一些时间序列分析开始,这是帖子随时间变化的趋势。 查看下面的时间序列图,我们可以获得一些有趣的见解:

  • Video posts have uptrends from March 2020, peaking in May 2020 (Ramadhan season in Indonesia). Within the past 2 months, the total videos being posted are quite stable of ~10 posts/day.

    视频帖子从2020年3月开始呈上升趋势,到2020年5月 (印度尼西亚的斋月季节) 达到顶峰 。 在过去2个月内,发布的视频总数非常稳定,每天大约有10个帖子。

  • The length of videos posted also have an uptrend. It used to be ~40 seconds in April 2020, but now reaching ~60 seconds in August 2020.

    发布的视频长度也有上升的趋势 。 过去在2020年4月约为40秒,但现在在2020年8月约为60秒。

  • Afternoon hours — 3 pm to 7 pm seems to be the peak hours where people are posting their cooking tutorials. Prepping for afternoon snack or dinner, perhaps?

    下午时间-下午3点至7点似乎是人们发布烹饪教程的高峰时间。 也许准备下午点心或晚餐?

Image for post
Image for post
Daily TikTok post counts and video length trends
每日TikTok帖子数和视频时长趋势
Image for post
TikTok post counts by day of week and hour of the day
TikTok帖子按星期几和一天中的小时计数

How about post trends across accounts? Interestingly, for the top 15 users with the highest posts, we see a quite different distribution of likes and shares. Accounts like ‘2beha10ribu’, ‘fahmimiasmr’, and ‘venithyacalistaa’ have high likes distribution, reaching above 1mio likes. On the other hand, ‘cookingwithhel’ is the winner in terms of the distribution of the shares. One of their posts even reaching 70k shares.

各个帐户的发布趋势如何? 有趣的是,对于帖子数最高的前15位用户,我们发现喜欢和分享的分布情况大不相同。 诸如“ 2beha10ribu ”,“ fahmimiasmr ”和“ venithyacalistaa ”这样的帐户具有很高的喜欢分布,达到了1mio以上的喜欢。 另一方面,就股份分配而言,“ cookingwithhel ”是赢家。 他们的职位之一甚至达到7万股。

Image for post
Image for post
Distribution of likes and shares across accounts
帐户中喜欢和分享的分布

Now on to the highlight of the post, let’s find out the food recipes with the highest shares, likes, and plays (indicating their popularity).

现在到帖子的重点,让我们找出分享,喜欢和玩耍最多食物食谱 (表明它们的受欢迎程度)。

The biggest challenge of this is extracting the dish names from the posts’ captions. The reason being is in a TikTok post you can type anything you want — without any structured fields. Also, the videos itself can be edited to show the dish names — instead of using captions. In this case, I did several data cleansing methods: remove numbers and special characters, filter out word noise (stopwords and common words on the posts), and then extract dish names from popular bigrams and trigrams in the dataset.

最大的挑战是从帖子标题中提取菜名 。 原因是在TikTok帖子中,您可以键入任何想要的内容-无需任何结构化字段。 另外,可以编辑视频本身以显示菜名-而不是使用字幕。 在这种情况下,我做了几种数据清理方法: 删除数字和特殊字符过滤掉单词杂音 (职位上的停用和常用单词),然后从数据集中的流行字母组和三字母组中提取菜名。

Some word clouds of the food, in unigram, bigrams, and trigrams. You might want to translate it since it’s in Bahasa Indonesia, but the components here are mostly related to desserts and snacks — oreo, chocolate, martabak, cake, milo, pie, pudding, cheese stick, etc. No wonder the peak hour of the posts is in the afternoon — these are prefect afternoon snacks!

食物中的某些词云,用会标,二字组和三字组表示。 你可能会想翻译它,因为它是在印尼语,但这里的部件大多与甜点和小吃 -奥利奥,巧克力,martabak,饼,高粱,馅饼,布丁,奶酪棒等难怪的高峰小时帖子是在下午- 这些都是下午的点心

Image for post
Image for post
Image for post
Unigrams, Bigrams, and Trigrams of the food
食物的字母,二元组和三字组

And the popular dishes are

最受欢迎的菜是

Image for post
Videos with Highest Plays
播放次数最高的视频
Image for post
Videos with Highest Likes
最喜欢的影片
Image for post
Videos with Highest Shares
分享次数最高的视频

There are various recipes of dishes here and even a few posts referring to the same dishes. Summarizing quickly, here are the viral food recipes of Indonesia TikTok:

这里有各种菜谱,甚至有几篇文章都提到了同一道菜。 总结一下,这是印度尼西亚TikTok的病毒性食品食谱:

  • Desserts: brownies, dessert box, cake, smoothies bowl, milk pie

    甜点:布朗尼蛋糕,甜点盒,蛋糕,冰沙碗,牛奶派
  • Snacks: rolled egg, coffee bread, fried tofu, potato hotdog, mochi

    小吃:鸡蛋卷,面包,炸豆腐,土豆热狗,麻chi
  • Savory dishes: meatballs, Korean fried rice, grilled chicken, chicken katsu

    美味佳肴:丸子,韩国炒饭,烤鸡,炸鸡

Quite a number of it seems to be desserts and snacks as opposed to side dishes to be eaten with rice.

与米饭一起食用的小菜相比,似乎有很多是甜点和小吃。

Image for post
Viral food recipes in TikTok Indonesia: rolled egg stick, brownies, grilled chicken. dessert box, milk pie
印尼TikTok的病毒式食品食谱: 鸡蛋卷,布朗尼蛋糕,烤鸡。 甜点盒,牛奶派

Some extra viz to make a more fancy wordcloud — I’m using PIL and pylab for getting an image color as the background of the word cloud.

一些额外的方法来制作更精美的 wordcloud —我正在使用PIL和pylab获取图像颜色作为单词云的背景。

Image for post
Can you guess — what is the underlying picture of this viz?
您能猜到吗?

The full code of data exploration and visualization can be found here.

完整的数据探索和可视化代码可在此处找到。

结束语 (Closing Remarks)

That sums up my discovery for viral food recipes of TikTok Indonesia. Try out your version of the top recipes: brownies pudding, milk pie, martabak, and dessert box and see for yourself if they’re worthy of the virality :D

总结我对印尼TikTok病毒性食品食谱的发现。 试试你的食谱顶级版本: 巧克力布丁奶饼martabak点心盒 ,看看自己,如果他们是当之无愧的传播性的:d

Again, although there are limitless possibilities to scrape data all over the web, we still need to be mindful of the ethical stands of it. Just keep in mind that you retrieve only the publicly available data and not in a destructive manner to the server account (i.e hitting the API to the server limit).

同样,尽管在网络上刮刮数据的可能性无限大,但我们仍然需要注意其道德立场。 请记住,您只能检索公开可用的数据,而不会以破坏性的方式检索服务器帐户(即,使API达到服务器限制)。

Happy exploring (and cooking)!

愉快的探索(和烹饪)!

翻译自: https://medium.com/swlh/discovering-viral-food-recipes-of-tiktok-indonesia-7f7e353c52ef

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值