instagram akp_网络抓取instagram使用instaloader和

最新推荐文章于 2024-08-08 07:17:20 发布

weixin_26729763

最新推荐文章于 2024-08-08 07:17:20 发布

阅读量886

点赞数

文章标签： python 网络

原文链接：https://medium.com/analytics-vidhya/web-scraping-instagram-to-build-your-own-profiles-interactive-dashboard-with-instaloader-and-42141575e009

版权

instagram akp

Looking for an excuse to learn a bit more about web scraping and Google Data Studio, I decided to begin a project based on my wife’s commercial Instagram profile. The goal was to build an online updatable dashboard with some useful metrics, like top hashtags, frequently used words, and posts distribution per weekday:

寻找借口以了解更多有关网络抓取和Google Data Studio的信息，我决定根据妻子的商业Instagram个人资料开始一个项目。我们的目标是建立一个具有一些有用指标的在线可更新仪表盘，例如热门话题标签，常用词以及每个工作日的帖子分布：

Requirements:

要求：

Fast and easy to create/update
快速轻松地创建/更新
Usable for any Instagram account, as long as it’s public
只要是公开帐户，即可用于任何Instagram帐户

In this article, I want to share with you my approach to do that.

在本文中，我想与您分享我的方法。

The whole project can be found at the project’s GitHub repository, and if you are only interested in usage rather than knowing how it works, you might consider going straight to the project’s README.

整个项目都可以在项目的GitHub存储库中找到，如果您仅对用法感兴趣，而又不了解其工作原理，则可以考虑直接使用项目的README。

So, in order to achieve my goal, I would need to do the following:

因此，为了实现我的目标，我需要执行以下操作：

Extract information from Instagram
从Instagram提取信息
Transform this information into useful metrics
将此信息转换为有用的指标
Upload these metrics and information to a data source accessible to Google Data Studio
将这些指标和信息上传到Google Data Studio可访问的数据源
Connect the data source and build the dashboard at Data Studio
连接数据源并在Data Studio上构建仪表板

It should be noted that I won’t cover number 4 here. In this article, I’ll limit myself to steps 1 to 3, and leave the explanation on how to actually build the dashboard for some other time.

应该注意的是，我在这里不讨论数字4。在本文中，我将局限于步骤1至3，并在其他时间保留有关如何实际构建仪表板的说明。

To summarize, I used the Instaloader package for extracting information and then processed it in a python script using Pandas.

总而言之，我使用Instaloader包提取信息，然后使用Pandas在python脚本中对其进行处理。

As a data source, I decided to use a Google Sheet at my personal Drive account. To manipulate this spreadsheet, I used a python library called gspread.

作为数据源，我决定在我的个人云端硬盘帐户中使用Google表格。为了操作此电子表格，我使用了一个名为gspread的python库。

The dashboard also uses some images for the logo and to generate the word cloud (which I will discuss later). At the time I was building the dashboard, Data Studio didn’t recognize image URLs from my Drive account, so I created an Imgur profile and used the python API imgurpython.

仪表板还使用一些图像作为徽标并生成词云(我将在后面讨论)。在我构建仪表板时，Data Studio无法识别来自我的云端硬盘帐户的图像URL，因此我创建了一个Imgur配置文件并使用了python API imgurpython 。

Let’s detail the whole process a bit more.

让我们详细说明整个过程。

管道 (The Pipeline)

I want to do all these tasks sequentially, so I wrote a shell script in order to generate/update the report in a single command:

我想按顺序执行所有这些任务，因此我编写了一个Shell脚本，以便在单个命令中生成/更新报告：

./insta_pipe.sh <your-profile> <language>

In which your_profile is your Instagram profile, and language is the language you want the report to be generated in(currently en and pt).

其中your_profile是您的Instagram个人资料，而language是希望生成报告的语言(当前为en和pt )。

The shell script looks something like this:

Shell脚本如下所示：

I will talk about each command of this script in the following sections.

我将在以下各节中讨论此脚本的每个命令。

我们最终得到什么 (What we end up with)

After running the shell script, the goal is to have a google spreadsheet updated with the information we need, much like the one below:

运行shell脚本后，目标是用我们需要的信息更新google电子表格，就像下面的信息一样：

I decided to divide groups of information into different worksheets:

我决定将信息组划分为不同的工作表：

ProfileInfo — profile name and imgur URL for the profile pic
ProfileInfo-个人资料图片的个人资料名称和imgur URL
WordCloud — Imgur URL for the word cloud image
WordCloud —词云图像的Imgur URL
Top_Hash — top 10 hashtags, according to the average number of likes
Top_Hash —根据平均点赞次数得出的前10个主题标签
Data — table of one Instagram post per row, with info about media type (image, video, or sidecar), plus the number of likes and comments
数据-每行只有一条Instagram帖子，带有有关媒体类型(图像，视频或边车)的信息，以及喜欢和评论的数量
MediaMetrics — the average number of likes per media type
MediaMetrics-每种媒体类型的平均点赞次数
DayMetrics — the average number of likes and number of posts per weekday
DayMetrics-每个工作日的平均点赞次数和帖子数
MainMetrics — the overall average number of likes and comments
MainMetrics-点赞和评论的平均总数

从您的Insta个人资料中提取信息(Extracting Information from your Insta Profile)

For the first task, I decided to use this awesome library called Instaloader. As it is said at the library’s website, Instaloader is a tool to download pictures or videos from Instagram, along with other metadata. For this project, I am mainly interested in info such as the number of likes, comments, captions, and hashtags.

对于第一个任务，我决定使用这个名为Instaloader的出色库。就像在图书馆的网站上所说的那样，Instaloader是一种从Instagram下载图片或视频以及其他元数据的工具。对于这个项目，我主要对诸如点赞，评论，标题和主题标签之类的信息感兴趣。

Once you pip install instaloader and read the documentation, it turns out that for this project, a single command is all it takes:

pip install instaloader并阅读文档后，事实证明，对于此项目，仅需一个命令即可：

instaloader --no-pictures --no-videos --no-metadata-json --post-metadata-txt="date:\n{date_local}\ntypename:\n{typename}\nlikes:\n{likes}\ncomments:\n{comments}\ncaption:\n{caption}" $1;

That will create a folder named with your profile, and inside there will be a bunch of .txt files, one for each post. Inside each file, there will be information about:

这将创建一个以您的个人资料命名的文件夹，内部将有一堆.txt文件，每个帖子一个。在每个文件中，将包含以下信息：

date
日期
media type (image, video ou sidecar)
媒体类型(图片，视频或视频)
number of likes
点赞次数
number of comments
评论数
caption
标题

If you run this command, you’ll see that my attempt of breaking the lines with \nwas not successful. The command automatically escapes the backslash, and I end up just with “\n” written on it.

如果运行此命令，将会看到我用\n换行的尝试没有成功。该命令自动转义反斜杠，最后我只写了“ \ n”。

I am certain there is a smarter way to do that, but my workaround was to replace \\n with \nfor every text file, which is whatfix_lines.py does.

我敢肯定有一种更聪明的方法，但是我的解决方法是为每个文本文件用\n替换\\n ，这就是fix_lines.py所做的。

Oh, Instaloader will also download your profile pic, which I will also use it as a logo for the dashboard.

哦，Instaloader还将下载您的个人资料照片，我还将其用作仪表板的徽标。

转换和上传-初步步骤 (Transform and Upload — Preliminary Steps)

For this step, I had to make sure I had some things beforehand:

对于这一步，我必须确保事先准备好一些东西：

a Google account to use Drive
使用Google云端硬盘的Google帐户
an Imgur Account
Imgur帐户

I also had to follow some instructions to authenticate and authorize the application for both gspread and imgur.

我还必须遵循一些说明来对gspread进行身份验证和授权和imgur 。

For gspread, I followed these instructions:

对于gspread ，我遵循以下说明：

to, in the end, have a credentials.json to put at ~/.config/gspread/credentials.json.

到，到最后，有一个credentials.json把在~/.config/gspread/credentials.json 。

As for Imgur, I followed these instructions:

至于伊格玛尔，我遵循以下指示：

and followed the steps at the registration quickstart section just up until I had the following: client_id,client_secret,access_token and refresh_token. These tokens are to replace the placeholders at the imgur_credentials.json file, along with the username of your Imgur account.

并按照registration quickstart部分中的步骤进行操作，直到获得以下内容： client_id ， client_secret ， access_token和refresh_token 。这些令牌将替换imgur_credentials.json文件中的占位符以及Imgur帐户的用户名。

The last thing is that I had to create a blank google sheet beforehand and get its key. If you open a google sheet, the link will be something like this:

最后一件事是，我必须事先创建一个空白的Google工作表并获取其密钥。如果您打开Google工作表，则链接将如下所示：

https://docs.google.com/spreadsheets/d/1h093LCbdJtDCNcDUnln4Lco-RANtl6-_XVi49InZCBw/edit#gid=0

The key would be that sequence of letters and numbers in the middle:

关键是中间的字母和数字序列：

1h093LCbdJtDCNcDUnln4Lco-RANtl6-_XVi49InZCBw

I will use it later, to let gspread know where to update the information into.

我将在以后使用它，以使gspread知道将信息更新到何处。

转换和上传—组装，生成和上传 (Transform and Upload — Assemble, Generate and Upload)

The script at transform_and_upload.py reads the .txt files created with Instaloader, assembles all the information, creates metrics and dataframes, and then updates the worksheets:

transform_and_upload.py的脚本读取使用Instaloader创建的.txt文件，组合所有信息，创建指标和数据框，然后更新工作表：

import sys
import json
from gSheet import gSheet
from reportTools import assemble_info,generate_df_per_day,generate_top_hashes,assemble_metrics
from imgur import Imgur
#########################################################################


def main():


    path = sys.argv[1]
    language = sys.argv[2]






    # Google Sheet Key we want to use to update Data Studio
    sheet_key = 'your-sheet-key'
    g_sheet = gSheet(sheet_key)


    with open('imgur_credentials.json','r') as f:
        imgur_credentials = json.load(f)
    imgur_conn = Imgur(imgur_credentials)
    print("Cleaning preexisting images...")
    res = imgur_conn.clean_user_images()
    if not res:
        "Error on cleaning user images!"




    df_posts,caption_text,img_link = assemble_info(path,imgur_conn,language)
    df_hash = generate_top_hashes(df_posts)
    df_day = generate_df_per_day(df_posts,language)
    df_data = df_posts[['media_code','media_type','likes','comments']]


    metrics = assemble_metrics(df_posts)


    #------------#-------------#--------#---------------
    mean_likes = metrics['mean_likes']
    mean_comments = metrics['mean_comments']
    likes_image = metrics['likes_image']
    likes_video = metrics['likes_video']
    likes_carousel = metrics['likes_carousel']




    g_sheet.update_data_sheet(df_data)
    g_sheet.update_wordcloud(caption_text,path,imgur_conn,language)
    g_sheet.update_mainmetrics(mean_likes,mean_comments)
    g_sheet.update_top_hashes(df_hash)
    g_sheet.update_media_metrics(likes_image,likes_video,likes_carousel)
    g_sheet.update_day_metrics(df_day)
    g_sheet.update_profile_info(img_link,path)


if __name__ == "__main__":
    main()

创建客户(Creating the clients)

First, we begin by setting the Google Sheet key we wish to update and creating a sheet object so we can later update its content:

首先，我们首先设置要更新的Google Sheet键并创建一个工作表对象，以便以后可以更新其内容：

sheet_key = 'your-sheet-key'
g_sheet = gSheet(sheet_key)

g_sheet is an instance of the gSheet class, which contains the methods to authenticate and update the worksheets. Just to show you the beginning of it (you can check the rest of it at the repository):

g_sheet是gSheet类的实例，其中包含用于认证和更新工作表的方法。只是为了向您展示它的开始(您可以在存储库中检查其余部分)：

import gspread


class gSheet:
    def __init__(self, sheet_key):
        self.gc = gspread.oauth()
        self.sh = self.gc.open_by_key(sheet_key)

An Imgur client is also created, using the credentials to perform operations on your account, like removing images from past runs of your application and uploading the new images you want to be displayed at your dashboard:

还创建了一个Imgur客户端，使用凭据对您的帐户执行操作，例如从应用程序的先前运行中删除图像并上传要在仪表板上显示的新图像：

with open('imgur_credentials.json','r') as f:
    imgur_credentials = json.load(f)
imgur_conn = Imgur(imgur_credentials)
print("Cleaning preexisting images...")
res = imgur_conn.clean_user_images()
if not res:
    "Error on cleaning user images!"

生成数据框和指标(Generating dataframes and metrics)

The assemble_info function is the one that actually reads the text files line by line and assembles the information into an initial dataframe called df_posts :

assemble_info函数实际上是逐行读取文本文件并将信息组装到称为df_posts的初始数据帧中的df_posts ：

df_posts,caption_text,img_link = assemble_info(path,imgur_conn,language)

Each row of df_posts refers to a single post, with the following columns:

df_posts每一行都引用一个帖子，其中包含以下列：

media_type: Video, Image or Sidecar
media_type：视频，图像或边车
media_code: Just encoding the above into integers (1,2 or 3)
media_code：只需将上述内容编码为整数(1,2或3)
likes: Number of likes the post currently has
点赞：该帖子当前拥有的点赞次数
comments: Number of comments the post currently has
评论：帖子当前拥有的评论数
date: Date of creation
日期：创建日期
hashed: List of hashtags used
hashed：使用的主题标签列表

In addition, assemble_info also uploads the image profile at Imgur, returning its URL, and concatenates the text of every caption (except hashtags), which is used later to generate the word cloud.

另外， assemble_info还在Imgur上载图像配置文件，返回其URL，并连接每个标题的文本(＃标签除外)，稍后将其用于生成词云。

df_posts is then used to generate further specific dataframes and metrics:

df_posts然后用于生成其他特定的数据df_posts和指标：

df_hash = generate_top_hashes(df_posts)
df_day = generate_df_per_day(df_posts,language)
df_data = df_posts[['media_code','media_type','likes','comments']]


metrics = assemble_metrics(df_posts)
mean_likes = metrics['mean_likes']
mean_comments = metrics['mean_comments']
likes_image = metrics['likes_image']
likes_video = metrics['likes_video']
likes_carousel = metrics['likes_carousel']

df_hash contains the hashtags in descending order, according to the average number of likes (hashtags that appear less than 5 times are ignored).

df_hash 根据喜欢的平均次数，主题标签按降序排列(出现次数少于5次的主题标签将被忽略)。

Information according to weekday is stored in df_day, while df_data is just a subset of df_posts that will be used by Data Studio to create the pie graph according to media type.

根据平日的信息存储在df_day ，而df_data只是的一个子集df_posts将由数据Studio中使用根据媒体类型来创建饼图。

As for metrics, it contains some overall metrics of average likes and comments, besides info according to media type.

至于metrics ，除了根据媒体类型的信息外，还包含一些平均喜欢和评论的总体指标。

上载到云端硬盘 (Uploading to Drive)

With everything at hand, it is time to update our Google Sheet:

一切准备就绪，现在该更新我们的Google表格：

g_sheet.update_data_sheet(df_data)
g_sheet.update_wordcloud(caption_text,path,imgur_conn,language)
g_sheet.update_mainmetrics(mean_likes,mean_comments)
g_sheet.update_top_hashes(df_hash)
g_sheet.update_media_metrics(likes_image,likes_video,likes_carousel)
g_sheet.update_day_metrics(df_day)
g_sheet.update_profile_info(img_link,path)

All these methods basically update each worksheet with the appropriate information. The only exception would be update_wordcloud, that also generates the word cloud, using nltk to tokenize the text and remove stopwords, and thewordcloud package:

所有这些方法基本上都使用适当的信息来更新每个工作表。唯一的例外是update_wordcloud ，它也会生成词云，使用nltk标记文本并删除停用词，以及wordcloud包：

from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import nltk
from nltk.tokenize import word_tokenize


nltk.download('stopwords')


def generate_wordcloud(caption_text,language):


    if language == 'pt':
        stopwords = nltk.corpus.stopwords.words('portuguese')
    elif language == 'en':
        stopwords = nltk.corpus.stopwords.words('english')


    try:
        text_tokens = word_tokenize(caption_text)
        tokens_without_sw = [word for word in text_tokens if not word in stopwords]
        filtered_sentence = (" ").join(tokens_without_sw)
        wordcloud = WordCloud(background_color="white").generate(filtered_sentence)
        wordcloud.to_file("wcloud.png")
        return 1
    except:
        return None

and then uploads it to Imgur before sending the URL to the worksheet:

然后将其上传到Imgur ，然后再将URL发送到工作表：

def update_wordcloud(self,caption_text,path,imgur_conn,language):
    res = generate_wordcloud(caption_text,language)
    cloud_sheet = self.sh.worksheet('WordCloud')
    cloud_sheet.update('A1', 'cloud_url')
    cloud_sheet.update('B1','profile_name')


    if res:
        cloud_image = imgur_conn.upload_image('wcloud.png')
        cloud_link = cloud_image['link']
        cloud_sheet.update('A2', cloud_link)
        cloud_sheet.update('B2',path)
    else:
        cloud_link = ' '
        cloud_sheet.update('A2', cloud_link)
        cloud_sheet.update('B2',path)

And that’s it!

就是这样！

For the sake of brevity, I didn’t show here every line of code used throughout the script, but if you are interested, you can check it all out at the Project’s Repository.

为了简洁起见，我没有在此处显示整个脚本中使用的每一行代码，但是如果您有兴趣，可以在Project的Repository中查看全部内容。

下一步 (Next Steps)

Now that we have an updated Google Sheet, we can use it as a data source to plug into Data Studio. The next step would be, of course, building the dashboard. As I said before, I feel that explaining the process here would make the story somewhat extensive, so I will leave it for some other time!

现在我们有了更新的Google表格，可以将其用作插入Data Studio的数据源。当然，下一步将是构建仪表板。就像我之前说过的那样，我觉得在这里解释这一过程将使故事变得更广泛，因此我将其搁置一段时间！

Thank you for letting me share the experience!

感谢您让我分享经验！