TikTok User Engagement data EDA
This dataset contains information about TikTok videos, including claim status(whether the video gets classified as a claim, which can likely be a violated video), author status(verified status, banned status), and engagement metrics (views, likes, shares, downloads, comments). From this dataset, I want to explore and answer the following research questions:
Research Questions:
- What is the relationship between the claim status of a video and other user status?
- Non-verified/Banned users and users under review mostly make more claims than opinions.
- What is the relationship between a user’s verified status and the engagement metrics (views, likes, shares, downloads, comments) of their videos?
- Non verified users have a higher average engagement metrics
- What machine learning model can best predict whether a video will go viral based on the video’s features?
- Random Forest.
Data Setting and Methods
Data link: https://www.kaggle.com/datasets/yakhyojon/tiktok
Three ways the context of the dataset might complicate or deepen my analysis:
- The dataset differentiates the “claims” from the “opinions.” And this means that they are entirely different by nature; since claims can be questioned and challenged while opinions are always subjective. Such a differentiation can lead to varying user engagement metrics which in turn makes it difficult to study how content type is related with author characteristics due to the lack of reliability in such data.
- Investigating and analyzing the relationship between the author’s characteristics and whether the video went viral can provide a lot of information for video creators, and help them make better videos in the future.
- Each video in the dataset has many related factors such as verified status, band status, view count, share count, and like count. It is hard to find the right factor related to the claim status.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import seaborn as sns
from IPython.display import Image
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
Data preparation
- Read and explore the data
data = pd.read_csv("tiktok_dataset.csv")
data.head()
# | claim_status | video_id | video_duration_sec | video_transcription_text | verified_status | author_ban_status | video_view_count | video_like_count | video_share_count | video_download_count | video_comment_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | claim | 7017666017 | 59 | someone shared with me that drone deliveries a... | not verified | under review | 343296.0 | 19425.0 | 241.0 | 1.0 | 0.0 |
1 | 2 | claim | 4014381136 | 32 | someone shared with me that there are more mic... | not verified | active | 140877.0 | 77355.0 | 19034.0 | 1161.0 | 684.0 |
2 | 3 | claim | 9859838091 | 31 | someone shared with me that american industria... | not verified | active | 902185.0 | 97690.0 | 2858.0 | 833.0 | 329.0 |
3 | 4 | claim | 1866847991 | 25 | someone shared with me that the metro of st. p... | not verified | active | 437506.0 | 239954.0 | 34812.0 | 1234.0 | 584.0 |
4 | 5 | claim | 7105231098 | 19 | someone shared with me that the number of busi... | not verified | active | 56167.0 | 34987.0 | 4110.0 | 547.0 | 152.0 |
data.shape
(19382, 12)
data.dtypes
# int64
claim_status object
video_id int64
video_duration_sec int64
video_transcription_text object
verified_status object
author_ban_status object
video_view_count float64
video_like_count float64
video_share_count float64
video_download_count float64
video_comment_count float64
dtype: object
- Clean the data
data.isna().sum()
# 0
claim_status 298
video_id 0
video_duration_sec 0
video_transcription_text 298
verified_status 0
author_ban_status 0
video_view_count 298
video_like_count 298
video_share_count 298
video_download_count 298
video_comment_count 298
dtype: int64
data.duplicated().sum()
0
data = data.dropna()
data.head()
# | claim_status | video_id | video_duration_sec | video_transcription_text | verified_status | author_ban_status | video_view_count | video_like_count | video_share_count | video_download_count | video_comment_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | claim | 7017666017 | 59 | someone shared with me that drone deliveries a... | not verified | under review | 343296.0 | 19425.0 | 241.0 | 1.0 | 0.0 |
1 | 2 | claim | 4014381136 | 32 | someone shared with me that there are more mic... | not verified | active | 140877.0 | 77355.0 | 19034.0 | 1161.0 | 684.0 |
2 | 3 | claim | 9859838091 | 31 | someone shared with me that american industria... | not verified | active | 902185.0 | 97690.0 | 2858.0 | 833.0 | 329.0 |
3 | 4 | claim | 1866847991 | 25 | someone shared with me that the metro of st. p... | not verified | active | 437506.0 | 239954.0 | 34812.0 | 1234.0 | 584.0 |
4 | 5 | claim | 7105231098 | 19 | someone shared with me that the number of busi... | not verified | active | 56167.0 | 34987.0 | 4110.0 | 547.0 | 152.0 |
data.shape
(19084, 12)
- Exlpor the data
# Create scatter plots for view count vs other metrics
fig, axes = plt.subplots(2, 2, figsize=(16, 16))
fig.suptitle('Relationship between Video View Count and Other Engagement Metrics', fontsize=16)
for i, metric in enumerate(['video_like_count', 'video_share_count', 'video_download_count', 'video_comment_count']):
ax = axes[i // 2, i % 2]
ax.scatter(data['video_view_count'], data[metric], alpha=0.5)
ax.set_xlabel('Video View Count')
ax.set_ylabel(metric.replace('_', ' ').title())
ax.set_xscale('log')
ax.set_yscale('log')
plt.tight_layout()
plt.savefig('engagement_scatter_plots.png')
These scatter plots visualize the relationship between video view count and other engagement metrics on a logarithmic scale.
Key observations:
- There’s a clear positive trend between view count and all other engagement metrics.
- The relationship appears to be roughly linear on the log-log scale, suggesting a power-law relationship between views and other engagement types.
- There’s some variation in the strength of these relationships, with likes showing the tightest correlation and downloads showing more spread.
In conclusion, the analysis reveals a strong positive relationship between video view count and other engagement metrics on TikTok. Videos that attract more views are likely to receive more likes, shares, downloads, and comments. This suggests that content creators should focus on strategies that increase view count, as this is likely to lead to higher engagement across all metrics.
Results
Research question 1:
What is the relationship between the claim status of a video and other user status?
# Compare the count between verified status and claim status
verified_vs_claim_count = pd.crosstab(data['verified_status'], data['claim_status'])
print("Verified Status VS Claim Status count:")
print(verified_vs_claim_count)
Verified Status VS Claim Status count:
claim_status claim opinion
verified_status
not verified 9399 8485
verified 209 991
# Compare the percentage between verified status and claim status
verified_vs_claim_percentage = verified_vs_claim_count.div( \
verified_vs_claim_count.sum(axis=1), axis=0) * 100
print("Percentage of Claim Status VS Verified Status:")
print(verified_vs_claim_percentage)
Percentage of Claim Status VS Verified Status:
claim_status claim opinion
verified_status
not verified 52.555357 47.444643
verified 17.416667 82.583333
# Compare the count between author ban status and claim status
ban_vs_claim_count = pd.crosstab(data['author_ban_status'], data['claim_status'])
print("Author Ban Status VS Claim Status Count:")
print(ban_vs_claim_count)
Author Ban Status VS Claim Status Count:
claim_status claim opinion
author_ban_status
active 6566 8817
banned 1439 196
under review 1603 463
# Compare the percentage between author ban status and claim status
ban_vs_claim_percentage = ban_vs_claim_count.div(ban_vs_claim_count.sum(axis=1), axis=0) * 100
print("Percentage of Claim Status VS Author Ban Status:")
print(ban_vs_claim_percentage)
Percentage of Claim Status VS Author Ban Status:
claim_status claim opinion
author_ban_status
active 42.683482 57.316518
banned 88.012232 11.987768
under review 77.589545 22.410455
# Visualize the relationships using bar plots
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
verified_vs_claim_count.plot(kind='bar', stacked=True, ax=plt.gca())
plt.title('Verified Status vs Claim Status')
plt.xlabel('Verified Status')
plt.ylabel('Count')
plt.legend(title='Claim Status', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.subplot(1, 2, 2)
ban_vs_claim_count.plot(kind='bar', stacked=True, ax=plt.gca())
plt.title('Author Ban Status vs Claim Status')
plt.xlabel('Author Ban Status')
plt.ylabel('Count')
plt.legend(title='Claim Status', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
Oberservations:
- General:
- There is a pretty even split between the number of claim and opinion videos in this data set.
- Verified Status vs Claim Status:
- Most of the claim videos are made by non-verified users.
- Non-verified users have a nearly even split between claims and opinions.
- Verified users also make claims, but they are more likely to make opinions.
- Author Ban Status vs Claim Status:
- Active users have a balanced distribution between claims and opinions.
- Banned users and users under review mostly make more claims than opinions.
Possible explanations:
- Non-verified users and banned users/users under review are the majority who post a claimed video, which can likely be a violated video.
- Authors who get verified and are not banned follow the platform’s rules, so the majority of their videos are opinions.
Research question 2:
What is the relationship between a user’s verified status and the engagement metrics (views, likes, shares, downloads, comments) of their videos?
# Calculate average engagement metrics for verified and non-verified users
engagement_metrics = ['video_view_count', 'video_like_count',
'video_share_count', 'video_download_count', 'video_comment_count']
avg_engagement = data.groupby('verified_status')[engagement_metrics].mean().reset_index()
print("Average Engagement Metrics:")
print(avg_engagement)
Average Engagement Metrics:
verified_status video_view_count video_like_count video_share_count \
0 not verified 265663.785339 87925.772422 17415.888000
1 verified 91439.164167 30337.633333 6591.448333
video_download_count video_comment_count
0 1095.814080 363.700514
1 358.146667 134.877500
# Calculate the ratio of engagement for verified vs. non-verified users
ratio = avg_engagement.loc[avg_engagement['verified_status'] == 'verified', \
engagement_metrics].values[0] / avg_engagement.loc[avg_engagement \
['verified_status'] == 'not verified', engagement_metrics].values[0]
print("Ratio of Verified VS. Non-Verified Engagement:")
# Round to two decimal places
for metric, r in zip(engagement_metrics, ratio):
print(f"{metric}: {r:.2}")
Ratio of Verified VS. Non-Verified Engagement:
video_view_count: 0.34
video_like_count: 0.35
video_share_count: 0.38
video_download_count: 0.33
video_comment_count: 0.37
# Visualize the results
plt.figure(figsize=(12, 6))
avg_engagement_melted = pd.melt(avg_engagement, id_vars=['verified_status'], value_vars=engagement_metrics, var_name='Metric', value_name='Average Count')
sns.barplot(x='Metric', y='Average Count', hue='verified_status', data=avg_engagement_melted)
plt.title('Average Engagement Metrics: Verified vs. Non-Verified Users')
plt.xticks(rotation=45)
plt.ylabel('Average Count (log scale)')
plt.yscale('log')
plt.legend(title='Verified Status')
plt.tight_layout()
plt.show()
Oberservations:
- Non verified users have a higher average engagement metrics (views, likes, shares, downloads, and comments count) compared to verified users.
- All of the ratios of verified to non-verified engagement are below 1, ranging from 0.33 to 0.38. These results show that verified users receive way less (about 1/3) engagement than non-verified users do.
- The bar plot visualizations show higher engagement for non verified users across all metrics.
Possible explanations:
- The number of verified users is way less than the non verified users, so the sample size is smaller and results in a skewed distribution.
- Non verified users might be more likely to produce claim content videos, leading to more user engagement.
- TikTok’s algorithm might be favoring content from non-verified users so that good content can get more visibility and grow faster.
Research question 3:
What machine learning model can best predict whether a video will go viral based on the video’s features?
- Define a viral threshold (lets say top 10% of views)
viral_threshold = data['video_view_count'].quantile(0.9)
# Create a viral videos variable: 1 represents video is viral, 0 otherwise
data['viral'] = (data['video_view_count'] > viral_threshold).astype(int)
data.head()
# | claim_status | video_id | video_duration_sec | video_transcription_text | verified_status | author_ban_status | video_view_count | video_like_count | video_share_count | video_download_count | video_comment_count | viral | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | claim | 7017666017 | 59 | someone shared with me that drone deliveries a... | not verified | under review | 343296.0 | 19425.0 | 241.0 | 1.0 | 0.0 | 0 |
1 | 2 | claim | 4014381136 | 32 | someone shared with me that there are more mic... | not verified | active | 140877.0 | 77355.0 | 19034.0 | 1161.0 | 684.0 | 0 |
2 | 3 | claim | 9859838091 | 31 | someone shared with me that american industria... | not verified | active | 902185.0 | 97690.0 | 2858.0 | 833.0 | 329.0 | 1 |
3 | 4 | claim | 1866847991 | 25 | someone shared with me that the metro of st. p... | not verified | active | 437506.0 | 239954.0 | 34812.0 | 1234.0 | 584.0 | 0 |
4 | 5 | claim | 7105231098 | 19 | someone shared with me that the number of busi... | not verified | active | 56167.0 | 34987.0 | 4110.0 | 547.0 | 152.0 | 0 |
- Feature selection
features = [ 'video_duration_sec', 'video_like_count', \
'video_share_count', 'video_download_count', 'video_comment_count']
X = pd.get_dummies(data[features], drop_first=True)
y = data['viral']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
- Model Traning
models = {
'Logistic Regression': LogisticRegression(random_state=42),
'Decision Tree': DecisionTreeClassifier(random_state=42),
'Random Forest': RandomForestClassifier(random_state=42)
}
# Create pipelines
pipelines = {name: Pipeline([
('scaler', StandardScaler()),
('model', model)
]) for name, model in models.items()}
# Train and evaluate models
results = []
for name, pipeline in pipelines.items():
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
cv_scores = cross_val_score(pipeline, X, y, cv=5)
results.append({
'Model': name,
'Accuracy': accuracy,
'CV Mean': cv_scores.mean()
})
- Compare the Models
results = pd.DataFrame(results)
print("Compare the Models:")
print(results)
Compare the Models:
Model Accuracy CV Mean
0 Logistic Regression 0.921404 0.906520
1 Decision Tree 0.873199 0.857319
2 Random Forest 0.920618 0.906362
# Get detailed report for the best model
best_model = results.loc[results['CV Mean'].idxmax(), 'Model']
best_pipeline = pipelines[best_model]
y_pred = best_pipeline.predict(X_test)
print(f"Classification Report for {best_model}:")
print(classification_report(y_test, y_pred))
Classification Report for Logistic Regression:
precision recall f1-score support
0 0.93 0.98 0.96 3432
1 0.72 0.36 0.48 385
accuracy 0.92 3817
macro avg 0.83 0.67 0.72 3817
weighted avg 0.91 0.92 0.91 3817
By comparing the mean of the cross-validation score, I can define the best machine-learning model as the random forest model. Next, I will use the random forest to find the most important feature that affects the view (whether the video will go viral) of a TikTok video.
# Train the model
model = RandomForestClassifier(random_state=42)
model.fit(X_train_scaled, y_train)
RandomForestClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=42)
# Feature importance
feature_importance = pd.DataFrame({'feature': features, 'importance': model.feature_importances_})
feature_importance = feature_importance.sort_values('importance', ascending=False)
print("Feature Importance:")
print(feature_importance)
Feature Importance:
feature importance
1 video_like_count 0.332143
2 video_share_count 0.205772
3 video_download_count 0.195245
4 video_comment_count 0.158660
0 video_duration_sec 0.108180
Oberservation:
From the feature importance, I found that the most important features for predicting virality is the video like count.
Use the model to predict some examples:
# Example prediction
X_test = pd.DataFrame(X_test)
for i in range(55,60):
example = X_test.iloc[i]
example_scaled = scaler.transform([example])
prediction = model.predict(example_scaled)
print(f"Actual viral status: {'Yes' if y_test.iloc[i] == 1 else 'No'}")
print(f"Predicted viral status: {'Yes' if prediction[0] == 1 else 'No'}")
print(f"Video features: {example.to_dict()}")
Actual viral status: No
Predicted viral status: No
Video features: {'video_duration_sec': 12.0, 'video_like_count': 1079.0, 'video_share_count': 12.0, 'video_download_count': 2.0, 'video_comment_count': 0.0}
Actual viral status: No
Predicted viral status: No
Video features: {'video_duration_sec': 43.0, 'video_like_count': 81.0, 'video_share_count': 21.0, 'video_download_count': 0.0, 'video_comment_count': 0.0}
Actual viral status: No
Predicted viral status: No
Video features: {'video_duration_sec': 9.0, 'video_like_count': 24684.0, 'video_share_count': 5363.0, 'video_download_count': 283.0, 'video_comment_count': 26.0}
Actual viral status: Yes
Predicted viral status: Yes
Video features: {'video_duration_sec': 33.0, 'video_like_count': 501235.0, 'video_share_count': 117357.0, 'video_download_count': 11126.0, 'video_comment_count': 180.0}
Actual viral status: No
Predicted viral status: No
Video features: {'video_duration_sec': 38.0, 'video_like_count': 3029.0, 'video_share_count': 705.0, 'video_download_count': 35.0, 'video_comment_count': 0.0}
/opt/conda/lib/python3.10/site-packages/sklearn/base.py:465: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
warnings.warn(
/opt/conda/lib/python3.10/site-packages/sklearn/base.py:465: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
warnings.warn(
/opt/conda/lib/python3.10/site-packages/sklearn/base.py:465: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
warnings.warn(
/opt/conda/lib/python3.10/site-packages/sklearn/base.py:465: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
warnings.warn(
/opt/conda/lib/python3.10/site-packages/sklearn/base.py:465: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
warnings.warn(
In summary, I was able to create a Random Forest model to predict whether a TikTok video will go viral or not based on several engagement metrics. These include the number of video views, duration in seconds, likes, and share count. As well as the previous three models that performed very accurately. This one also showed an accuracy level of 92%, indicating that it is good at discerning between viral and non-viral videos.
The analysis found that, as far as predictors go, engagement metrics are very important, with the like count and share count topping the list. However, this model showed some conservatism in labeling videos as viral, hence the low recall for the viral class. This indicates that while the model is good, it could be better in capturing more information regarding class imbalance and improving its sensitivity to viral videos.
Overall, the Random Forest model was quite informative about what makes a video viral on TikTok and thus is a useful model for content creators and marketers trying to make their work reach as many people as possible. Future work might consider more features, dealing with class imbalance, and trying out other machine learning algorithms for enhancing the model performance.
Implications and Limitations
- Benefits:
- Content creators can use this information to help their videos for virality.
- TikTok can refine its recommendation algorithms using information from this research.
- Data Setting Impacts:
- The data might not represent all TikTok videos for long terms.
- The method of collecting this dataset might over or under represent certain types of content or users.
- Limitations:
- The model only considers some user engagement quantitative metrics like likes, shares, and duration, it doesn’t account for other qualitative factors such as content quality, trends, or cultural relevance.
- Advice: The results for creating a viral video are only for reference, there are many other factors in real life to be considered.
- The model doesn’t account for the ethical implications of content, and a video could go viral for negative reasons.
- Advice: Authors should consider the content that they create.
- The model doesn’t capture how virality changes over time.
- Advice: Authors should be aware that the viral patterns of a video can change quickly.
- The model only considers some user engagement quantitative metrics like likes, shares, and duration, it doesn’t account for other qualitative factors such as content quality, trends, or cultural relevance.
sources = ['https://www.kaggle.com/datasets/yakhyojon/tiktok',
'https://www.geeksforgeeks.org/working-images-python/',
'https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html',
'learning-algorithms.ipynb',
'https://www.kaggle.com/code/yakhyojon/tiktok-misinformation-classification-99-6-accurat',
'model-evaluation.ipynb',
'https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html',
'Chatgpt: How to train multiple machine learning algorithms a the same time?',
'Chatgpt: How to compare which machine learning model is the best?',
'https://machinelearningmastery.com/calculate-feature-importance-with-python/',
'https://www.geeksforgeeks.org/feature-importance-with-random-forests/',
'https://www.geeksforgeeks.org/numpy-quantile-in-python/',
'https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html',
'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html',
'https://scikit-learn.org/stable/modules/cross_validation.html',
'Chatgpt: Hot to use machine learning modelto predict some examples in the dataset?',
'https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html'
]