TikTok User Engagement data EDA

最新推荐文章于 2024-10-18 00:00:00 发布

驴蹄子的数据库

最新推荐文章于 2024-10-18 00:00:00 发布

阅读量925

点赞数 19

文章标签： python 数据分析 scikit-learn 机器学习

本文链接：https://blog.csdn.net/2301_79179541/article/details/141352836

版权

TikTok User Engagement data EDA

This dataset contains information about TikTok videos, including claim status(whether the video gets classified as a claim, which can likely be a violated video), author status(verified status, banned status), and engagement metrics (views, likes, shares, downloads, comments). From this dataset, I want to explore and answer the following research questions:

Research Questions:

What is the relationship between the claim status of a video and other user status?
- Non-verified/Banned users and users under review mostly make more claims than opinions.
What is the relationship between a user’s verified status and the engagement metrics (views, likes, shares, downloads, comments) of their videos?
- Non verified users have a higher average engagement metrics
What machine learning model can best predict whether a video will go viral based on the video’s features?
- Random Forest.

Data Setting and Methods

Data link: https://www.kaggle.com/datasets/yakhyojon/tiktok
Three ways the context of the dataset might complicate or deepen my analysis:

The dataset differentiates the “claims” from the “opinions.” And this means that they are entirely different by nature; since claims can be questioned and challenged while opinions are always subjective. Such a differentiation can lead to varying user engagement metrics which in turn makes it difficult to study how content type is related with author characteristics due to the lack of reliability in such data.
Investigating and analyzing the relationship between the author’s characteristics and whether the video went viral can provide a lot of information for video creators, and help them make better videos in the future.
Each video in the dataset has many related factors such as verified status, band status, view count, share count, and like count. It is hard to find the right factor related to the claim status.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from matplotlib.ticker import PercentFormatter
import seaborn as sns
from IPython.display import Image

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline

Data preparation

Read and explore the data

data = pd.read_csv("tiktok_dataset.csv")

data.head()

	#	claim_status	video_id	video_duration_sec	video_transcription_text	verified_status	author_ban_status	video_view_count	video_like_count	video_share_count	video_download_count	video_comment_count
0	1	claim	7017666017	59	someone shared with me that drone deliveries a...	not verified	under review	343296.0	19425.0	241.0	1.0	0.0
1	2	claim	4014381136	32	someone shared with me that there are more mic...	not verified	active	140877.0	77355.0	19034.0	1161.0	684.0
2	3	claim	9859838091	31	someone shared with me that american industria...	not verified	active	902185.0	97690.0	2858.0	833.0	329.0
3	4	claim	1866847991	25	someone shared with me that the metro of st. p...	not verified	active	437506.0	239954.0	34812.0	1234.0	584.0
4	5	claim	7105231098	19	someone shared with me that the number of busi...	not verified	active	56167.0	34987.0	4110.0	547.0	152.0

data.shape

(19382, 12)

data.dtypes

#                             int64
claim_status                 object
video_id                      int64
video_duration_sec            int64
video_transcription_text     object
verified_status              object
author_ban_status            object
video_view_count            float64
video_like_count            float64
video_share_count           float64
video_download_count        float64
video_comment_count         float64
dtype: object

Clean the data

data.isna().sum()

#                             0
claim_status                298
video_id                      0
video_duration_sec            0
video_transcription_text    298
verified_status               0
author_ban_status             0
video_view_count            298
video_like_count            298
video_share_count           298
video_download_count        298
video_comment_count         298
dtype: int64

data.duplicated().sum()

data = data.dropna()
data.head()

	#	claim_status	video_id	video_duration_sec	video_transcription_text	verified_status	author_ban_status	video_view_count	video_like_count	video_share_count	video_download_count	video_comment_count
0	1	claim	7017666017	59	someone shared with me that drone deliveries a...	not verified	under review	343296.0	19425.0	241.0	1.0	0.0
1	2	claim	4014381136	32	someone shared with me that there are more mic...	not verified	active	140877.0	77355.0	19034.0	1161.0	684.0
2	3	claim	9859838091	31	someone shared with me that american industria...	not verified	active	902185.0	97690.0	2858.0	833.0	329.0
3	4	claim	1866847991	25	someone shared with me that the metro of st. p...	not verified	active	437506.0	239954.0	34812.0	1234.0	584.0
4	5	claim	7105231098	19	someone shared with me that the number of busi...	not verified	active	56167.0	34987.0	4110.0	547.0	152.0

data.shape

(19084, 12)

Exlpor the data

# Create scatter plots for view count vs other metrics
fig, axes = plt.subplots(2, 2, figsize=(16, 16))
fig.suptitle('Relationship between Video View Count and Other Engagement Metrics', fontsize=16)

for i, metric in enumerate(['video_like_count', 'video_share_count', 'video_download_count', 'video_comment_count']):
    ax = axes[i // 2, i % 2]
    ax.scatter(data['video_view_count'], data[metric], alpha=0.5)
    ax.set_xlabel('Video View Count')
    ax.set_ylabel(metric.replace('_', ' ').title())
    ax.set_xscale('log')
    ax.set_yscale('log')

plt.tight_layout()
plt.savefig('engagement_scatter_plots.png')

在这里插入图片描述

These scatter plots visualize the relationship between video view count and other engagement metrics on a logarithmic scale.

Key observations:

There’s a clear positive trend between view count and all other engagement metrics.
The relationship appears to be roughly linear on the log-log scale, suggesting a power-law relationship between views and other engagement types.
There’s some variation in the strength of these relationships, with likes showing the tightest correlation and downloads showing more spread.

In conclusion, the analysis reveals a strong positive relationship between video view count and other engagement metrics on TikTok. Videos that attract more views are likely to receive more likes, shares, downloads, and comments. This suggests that content creators should focus on strategies that increase view count, as this is likely to lead to higher engagement across all metrics.

Results

Research question 1:

What is the relationship between the claim status of a video and other user status?

# Compare the count between verified status and claim status
verified_vs_claim_count = pd.crosstab(data['verified_status'], data['claim_status'])

print("Verified Status VS Claim Status count:")
print(verified_vs_claim_count)

Verified Status VS Claim Status count:
claim_status     claim  opinion
verified_status                
not verified      9399     8485
verified           209      991

# Compare the percentage between verified status and claim status
verified_vs_claim_percentage = verified_vs_claim_count.div( \
    verified_vs_claim_count.sum(axis=1), axis=0) * 100

print("Percentage of Claim Status VS Verified Status:")
print(verified_vs_claim_percentage)

Percentage of Claim Status VS Verified Status:
claim_status         claim    opinion
verified_status                      
not verified     52.555357  47.444643
verified         17.416667  82.583333

# Compare the count between author ban status and claim status
ban_vs_claim_count = pd.crosstab(data['author_ban_status'], data['claim_status'])

print("Author Ban Status VS Claim Status Count:")
print(ban_vs_claim_count)

Author Ban Status VS Claim Status Count:
claim_status       claim  opinion
author_ban_status                
active              6566     8817
banned              1439      196
under review        1603      463

# Compare the percentage between author ban status and claim status
ban_vs_claim_percentage = ban_vs_claim_count.div(ban_vs_claim_count.sum(axis=1), axis=0) * 100

print("Percentage of Claim Status VS Author Ban Status:")
print(ban_vs_claim_percentage)

Percentage of Claim Status VS Author Ban Status:
claim_status           claim    opinion
author_ban_status                      
active             42.683482  57.316518
banned             88.012232  11.987768
under review       77.589545  22.410455

# Visualize the relationships using bar plots
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
verified_vs_claim_count.plot(kind='bar', stacked=True, ax=plt.gca())
plt.title('Verified Status vs Claim Status')
plt.xlabel('Verified Status')
plt.ylabel('Count')
plt.legend(title='Claim Status', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.subplot(1, 2, 2)
ban_vs_claim_count.plot(kind='bar', stacked=True, ax=plt.gca())
plt.title('Author Ban Status vs Claim Status')
plt.xlabel('Author Ban Status')
plt.ylabel('Count')
plt.legend(title='Claim Status', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()

在这里插入图片描述

Oberservations:

General:
- There is a pretty even split between the number of claim and opinion videos in this data set.
Verified Status vs Claim Status:
- Most of the claim videos are made by non-verified users.
- Non-verified users have a nearly even split between claims and opinions.
- Verified users also make claims, but they are more likely to make opinions.
Author Ban Status vs Claim Status:
- Active users have a balanced distribution between claims and opinions.
- Banned users and users under review mostly make more claims than opinions.

Possible explanations:

Non-verified users and banned users/users under review are the majority who post a claimed video, which can likely be a violated video.
Authors who get verified and are not banned follow the platform’s rules, so the majority of their videos are opinions.

Research question 2:

What is the relationship between a user’s verified status and the engagement metrics (views, likes, shares, downloads, comments) of their videos?

# Calculate average engagement metrics for verified and non-verified users
engagement_metrics = ['video_view_count', 'video_like_count', 
                      'video_share_count', 'video_download_count', 'video_comment_count']

avg_engagement = data.groupby('verified_status')[engagement_metrics].mean().reset_index()

print("Average Engagement Metrics:")
print(avg_engagement)

Average Engagement Metrics:
  verified_status  video_view_count  video_like_count  video_share_count  \
0    not verified     265663.785339      87925.772422       17415.888000   
1        verified      91439.164167      30337.633333        6591.448333   

   video_download_count  video_comment_count  
0           1095.814080           363.700514  
1            358.146667           134.877500

# Calculate the ratio of engagement for verified vs. non-verified users
ratio = avg_engagement.loc[avg_engagement['verified_status'] == 'verified', \
engagement_metrics].values[0] / avg_engagement.loc[avg_engagement \
['verified_status'] == 'not verified', engagement_metrics].values[0] 

print("Ratio of Verified VS. Non-Verified Engagement:")
# Round to two decimal places
for metric, r in zip(engagement_metrics, ratio):
    print(f"{metric}: {r:.2}")

Ratio of Verified VS. Non-Verified Engagement:
video_view_count: 0.34
video_like_count: 0.35
video_share_count: 0.38
video_download_count: 0.33
video_comment_count: 0.37

# Visualize the results
plt.figure(figsize=(12, 6))
avg_engagement_melted = pd.melt(avg_engagement, id_vars=['verified_status'], value_vars=engagement_metrics, var_name='Metric', value_name='Average Count')

sns.barplot(x='Metric', y='Average Count', hue='verified_status', data=avg_engagement_melted)
plt.title('Average Engagement Metrics: Verified vs. Non-Verified Users')
plt.xticks(rotation=45)
plt.ylabel('Average Count (log scale)')
plt.yscale('log')
plt.legend(title='Verified Status')
plt.tight_layout()
plt.show()

在这里插入图片描述

Oberservations:

Non verified users have a higher average engagement metrics (views, likes, shares, downloads, and comments count) compared to verified users.
All of the ratios of verified to non-verified engagement are below 1, ranging from 0.33 to 0.38. These results show that verified users receive way less (about 1/3) engagement than non-verified users do.
The bar plot visualizations show higher engagement for non verified users across all metrics.

Possible explanations:

The number of verified users is way less than the non verified users, so the sample size is smaller and results in a skewed distribution.
Non verified users might be more likely to produce claim content videos, leading to more user engagement.
TikTok’s algorithm might be favoring content from non-verified users so that good content can get more visibility and grow faster.

Research question 3:

What machine learning model can best predict whether a video will go viral based on the video’s features?

Define a viral threshold (lets say top 10% of views)

viral_threshold = data['video_view_count'].quantile(0.9)

# Create a viral videos variable: 1 represents video is viral, 0 otherwise
data['viral'] = (data['video_view_count'] > viral_threshold).astype(int)

data.head()

	#	claim_status	video_id	video_duration_sec	video_transcription_text	verified_status	author_ban_status	video_view_count	video_like_count	video_share_count	video_download_count	video_comment_count	viral
0	1	claim	7017666017	59	someone shared with me that drone deliveries a...	not verified	under review	343296.0	19425.0	241.0	1.0	0.0	0
1	2	claim	4014381136	32	someone shared with me that there are more mic...	not verified	active	140877.0	77355.0	19034.0	1161.0	684.0	0
2	3	claim	9859838091	31	someone shared with me that american industria...	not verified	active	902185.0	97690.0	2858.0	833.0	329.0	1
3	4	claim	1866847991	25	someone shared with me that the metro of st. p...	not verified	active	437506.0	239954.0	34812.0	1234.0	584.0	0
4	5	claim	7105231098	19	someone shared with me that the number of busi...	not verified	active	56167.0	34987.0	4110.0	547.0	152.0	0

Feature selection

features = [ 'video_duration_sec', 'video_like_count', \
            'video_share_count', 'video_download_count', 'video_comment_count']
X = pd.get_dummies(data[features], drop_first=True)
y = data['viral']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Model Traning

models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42)
}

# Create pipelines
pipelines = {name: Pipeline([
    ('scaler', StandardScaler()),
    ('model', model)
]) for name, model in models.items()}

# Train and evaluate models
results = []
for name, pipeline in pipelines.items():
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    cv_scores = cross_val_score(pipeline, X, y, cv=5)
    results.append({
        'Model': name,
        'Accuracy': accuracy,
        'CV Mean': cv_scores.mean()
    })

Compare the Models

results = pd.DataFrame(results)
print("Compare the Models:")
print(results)

Compare the Models:
                 Model  Accuracy   CV Mean
0  Logistic Regression  0.921404  0.906520
1        Decision Tree  0.873199  0.857319
2        Random Forest  0.920618  0.906362

# Get detailed report for the best model
best_model = results.loc[results['CV Mean'].idxmax(), 'Model']
best_pipeline = pipelines[best_model]
y_pred = best_pipeline.predict(X_test)

print(f"Classification Report for {best_model}:")
print(classification_report(y_test, y_pred))

Classification Report for Logistic Regression:
              precision    recall  f1-score   support

           0       0.93      0.98      0.96      3432
           1       0.72      0.36      0.48       385

    accuracy                           0.92      3817
   macro avg       0.83      0.67      0.72      3817
weighted avg       0.91      0.92      0.91      3817

By comparing the mean of the cross-validation score, I can define the best machine-learning model as the random forest model. Next, I will use the random forest to find the most important feature that affects the view (whether the video will go viral) of a TikTok video.

# Train the model
model = RandomForestClassifier(random_state=42)
model.fit(X_train_scaled, y_train)

RandomForestClassifier(random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

RandomForestClassifier

RandomForestClassifier(random_state=42)

# Feature importance
feature_importance = pd.DataFrame({'feature': features, 'importance': model.feature_importances_})
feature_importance = feature_importance.sort_values('importance', ascending=False)
print("Feature Importance:")
print(feature_importance)

Feature Importance:
                feature  importance
1      video_like_count    0.332143
2     video_share_count    0.205772
3  video_download_count    0.195245
4   video_comment_count    0.158660
0    video_duration_sec    0.108180

Oberservation:

From the feature importance, I found that the most important features for predicting virality is the video like count.

Use the model to predict some examples:

# Example prediction
X_test = pd.DataFrame(X_test)
for i in range(55,60):
    example = X_test.iloc[i]
    example_scaled = scaler.transform([example])
    prediction = model.predict(example_scaled)
    print(f"Actual viral status: {'Yes' if y_test.iloc[i] == 1 else 'No'}")
    print(f"Predicted viral status: {'Yes' if prediction[0] == 1 else 'No'}")
    print(f"Video features: {example.to_dict()}")

Actual viral status: No
Predicted viral status: No
Video features: {'video_duration_sec': 12.0, 'video_like_count': 1079.0, 'video_share_count': 12.0, 'video_download_count': 2.0, 'video_comment_count': 0.0}
Actual viral status: No
Predicted viral status: No
Video features: {'video_duration_sec': 43.0, 'video_like_count': 81.0, 'video_share_count': 21.0, 'video_download_count': 0.0, 'video_comment_count': 0.0}
Actual viral status: No
Predicted viral status: No
Video features: {'video_duration_sec': 9.0, 'video_like_count': 24684.0, 'video_share_count': 5363.0, 'video_download_count': 283.0, 'video_comment_count': 26.0}
Actual viral status: Yes
Predicted viral status: Yes
Video features: {'video_duration_sec': 33.0, 'video_like_count': 501235.0, 'video_share_count': 117357.0, 'video_download_count': 11126.0, 'video_comment_count': 180.0}
Actual viral status: No
Predicted viral status: No
Video features: {'video_duration_sec': 38.0, 'video_like_count': 3029.0, 'video_share_count': 705.0, 'video_download_count': 35.0, 'video_comment_count': 0.0}


/opt/conda/lib/python3.10/site-packages/sklearn/base.py:465: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/sklearn/base.py:465: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/sklearn/base.py:465: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/sklearn/base.py:465: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/sklearn/base.py:465: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
  warnings.warn(

In summary, I was able to create a Random Forest model to predict whether a TikTok video will go viral or not based on several engagement metrics. These include the number of video views, duration in seconds, likes, and share count. As well as the previous three models that performed very accurately. This one also showed an accuracy level of 92%, indicating that it is good at discerning between viral and non-viral videos.

The analysis found that, as far as predictors go, engagement metrics are very important, with the like count and share count topping the list. However, this model showed some conservatism in labeling videos as viral, hence the low recall for the viral class. This indicates that while the model is good, it could be better in capturing more information regarding class imbalance and improving its sensitivity to viral videos.

Overall, the Random Forest model was quite informative about what makes a video viral on TikTok and thus is a useful model for content creators and marketers trying to make their work reach as many people as possible. Future work might consider more features, dealing with class imbalance, and trying out other machine learning algorithms for enhancing the model performance.

Implications and Limitations

Benefits:
- Content creators can use this information to help their videos for virality.
- TikTok can refine its recommendation algorithms using information from this research.
Data Setting Impacts:
- The data might not represent all TikTok videos for long terms.
- The method of collecting this dataset might over or under represent certain types of content or users.
Limitations:
- The model only considers some user engagement quantitative metrics like likes, shares, and duration, it doesn’t account for other qualitative factors such as content quality, trends, or cultural relevance.
  - Advice: The results for creating a viral video are only for reference, there are many other factors in real life to be considered.
- The model doesn’t account for the ethical implications of content, and a video could go viral for negative reasons.
  - Advice: Authors should consider the content that they create.
- The model doesn’t capture how virality changes over time.
  - Advice: Authors should be aware that the viral patterns of a video can change quickly.

sources = ['https://www.kaggle.com/datasets/yakhyojon/tiktok',
           'https://www.geeksforgeeks.org/working-images-python/',
           'https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html',
           'learning-algorithms.ipynb',
           'https://www.kaggle.com/code/yakhyojon/tiktok-misinformation-classification-99-6-accurat',
           'model-evaluation.ipynb',
           'https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html',
           'Chatgpt: How to train multiple machine learning algorithms a the same time?',
           'Chatgpt: How to compare which machine learning model is the best?',
           'https://machinelearningmastery.com/calculate-feature-importance-with-python/',
           'https://www.geeksforgeeks.org/feature-importance-with-random-forests/',
           'https://www.geeksforgeeks.org/numpy-quantile-in-python/',
           'https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html',
           'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html',
           'https://scikit-learn.org/stable/modules/cross_validation.html',
           'Chatgpt: Hot to use machine learning modelto predict some examples in the dataset?',
           'https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html'         
]