Instgram爬虫及其断点续传_一个AJAX异步加载爬虫

任务描述

对于给定的Ins账户列表,需要爬下所有posts,对于每条post要有

  • 时间
  • 配文
  • 配图
  • 点赞数
  • 评论数

如果是小视频,需要有

  • 视频
  • 观看数

Ins网站结构分析

Ins的post数据在json文件里存储,一个json文件存储12条post信息,并会给出查询下一组post的cursor。

由于使用了hash加密,所以典型的查询链接如下:

‘https://www.instagram.com/graphql/query/?query_hash=a5164aed103f24b03e7b7747a2d94e3c&variables=%7B%22id%22%3A%22{user_id}%22%2C%22first%22%3A12%2C%22after%22%3A%22{cursor}%22%7D’

也就是说对于每一组post,需要知道两个参数:user_id和cursor。将这两个参数填入url中并requests就能得到想要的json文件。事实上填入参数后的url直接在浏览器中打开也能得到对应的json:

显然,以上的查询链接中的user_id和end_cursor都是从上一页中继承的。可是对于每个账户的第一页,显然没有办法继承。因此第一页要单独爬,好在每个账户第一页的url都是固定的:

https://www.instagram.com/{account}/

从第一页中分析得到user_id和第二页的cursor即可。

数据存储和可视化策略

为了提高存储效率,用两层数据格式存储。外层是一个列表,内存是一个词典。词典中按照需要爬取的变量设置不同的属性,每爬取一个新的post更新一次词典。得到的结果如下:

[{'img_url': 'https://scontent-frt3-1.cdninstagram.com/vp/2de4fe0ca443d27ead0601306a4d2d9f/5E65FD4A/t51.2885-15/e35/73497417_178613769947279_3483644294279361168_n.jpg?_nc_ht=scontent-frt3-1.cdninstagram.com&_nc_cat=1', 'comment_count': 9042, 'like_count': 565657, 'text': 'Meet today’s #WeeklyFluff, Albert (@pompous.albert), a Selkirk Rex cat who might look faux... but is keeping it real. 😻\\u2063\\n\\u2063\\nPhoto by @pompous.albert'}, 
{'img_url': 'https://scontent-frt3-1.cdninstagram.com/vp/ff83ef12404713e3584ba07441a23913/5E856EC0/t51.2885-15/e35/p1080x1080/72783038_1207153232810009_5652648210556063310_n.jpg?_nc_ht=scontent-frt3-1.cdninstagram.com&_nc_cat=1', 'comment_count': 5506, 'like_count': 637442, 'text': 'For Colombian singer-songwriter Camilo (@camilomusica), the #LatinGRAMMY Awards are a big party of close friends, who just happen to be some of the biggest artists in the world right now. 🔥🌎\\u2063\\n\\u2063\\nSee who Camilo runs into and guess who he’s going to collab with next. It’s #GameOn at the @latingrammys, right now on our story.'}]

为了使爬取结果更直观,写了一个函数将以上数据转成表格形式:

def nestedlist2csv(list, out_file):
    with open(out_file, 'w') as f:
        w = csv.writer(f)
        fieldnames=list[0].keys() 
        w.writerow(fieldnames)
        for row in list:
            w.writerow(row.values())

最终的结果如下:

主程序

import re
import json
import time
import random
import requests
from pyquery import PyQuery as pq
import pandas as pd
import csv
from datetime import datetime
import math

def baseurl(acc):
    url_base = 'https://www.instagram.com/%s/'%acc
    return(url_base)

uri = 'https://www.instagram.com/graphql/query/?query_hash=a5164aed103f24b03e7b7747a2d94e3c&variables=%7B%22id%22%3A%22{user_id}%22%2C%22first%22%3A12%2C%22after%22%3A%22{cursor}%22%7D'

idlist = pd.read_table('accidlist.txt',header=0,encoding='gb18030',delim_whitespace=True)
idlist.columns=['acc','id','postno']

headers = {
    "Origin": "https://www.instagram.com/",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/58.0.3029.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "accept-encoding": "gzip, deflate, sdch, br",
    "accept-language": "zh-CN,zh;q=0.8",
    "X-Instragram-AJAX": "1",
    "X-Requested-With": "XMLHttpRequest",
    "Upgrade-Insecure-Requests": "1",
}

def get_html(url):
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
        else:
            print('请求网页源代码错误, 错误状态码:', response.status_code)
    except Exception as e:
        print(e)
        return None


def get_json(headers, url):
    try:
        response = requests.get(url, headers=headers, timeout=10)
        if response.status_code == 200:
            return response.json()
        else:
            print('请求网页json错误, 错误状态码:', response.status_code)
    except Exception as e:
        print(e)
        time.sleep(60 + float(random.randint(1, 4000)) / 100)
        return get_json(headers, url)


def get_pics(picurl,picname):
    picresp = requests.get(picurl, headers=headers, timeout=10)
    with open('%s.png'%picname, 'wb') as f:
        f.write(picresp.content)

def nestedlist2csv(list, out_file):
    with open(out_file, 'w') as f:
        w = csv.writer(f)
        fieldnames=list[0].keys() 
        w.writerow(fieldnames)
        for row in list:
            w.writerow(row.values())

def get_date(timestamp):
    local_str_time = datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d')
    return local_str_time

def get_samples(html,acc):
    samples = []
    page = 0
    user_id = re.findall('"profilePage_([0-9]+)"', html, re.S)[0]
    print("The user id is %s"%user_id)

    doc = pq(html)
    items = doc('script[type="text/javascript"]').items()
    for item in items:
        if item.text().strip().startswith('window._sharedData'):
            js_data = json.loads(item.text()[21:-1], encoding='utf-8')

            edges = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"]["edges"]
            totalpost = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"]["count"]
            totalpage = math.ceil(totalpost/12)
            page_info = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"][
                'page_info']
            cursor = page_info['end_cursor']
            flag = page_info['has_next_page']

            for edge in edges:
                sample = {}

                if edge['node']['display_url']:
                    sample["Influencer"] = acc
                    timestamp = edge['node']['taken_at_timestamp']
                    sample["date"] = get_date(timestamp)
                    sample["comment_count"] = edge['node']['edge_media_to_comment']["count"]
                    sample["like_count"] = edge['node']['edge_liked_by']["count"]

                if edge['node']['shortcode']:
                    shortcode = edge['node']['shortcode']
                    sample['postlink'] = 'https://www.instagram.com/p/%s/'%(shortcode)
                    textUrl = 'https://www.instagram.com/p/' + shortcode + '/?__a=1'
                    textRespose = get_json(headers, textUrl)
                    try:
                        textDict = textRespose['graphql']['shortcode_media']['edge_media_to_caption']['edges'][0][
                            'node']
                        sample["caption"] = str(textDict)[10:-2]
                    except:
                        sample["caption"] = ""
                    children = textRespose["graphql"]["shortcode_media"].get('edge_sidecar_to_children')
                    if children:
                        sample['multipic'] = 'True'
                        picurls = ""
                        for child in children['edges']:
                            picurls = picurls + child['node']['display_url'] + ','
                        sample['img_urls'] = picurls
                    else:
                        sample['multipic'] = 'False'
                        sample['img_urls'] = textRespose['graphql']['shortcode_media']['display_url']
                    isvideo = textRespose["graphql"]["shortcode_media"].get('is_video')
                    if isvideo:
                        sample['video_url'] = textRespose["graphql"]["shortcode_media"].get('video_url')
                        sample['video_view_count'] = textRespose["graphql"]["shortcode_media"].get('video_view_count')
                    else:
                        sample['video_url'] = ""
                        sample['video_view_count'] = ""

                samples.append(sample)
                time.sleep(float(random.randint(1, 3)))
            nestedlist2csv(samples,'%s_postlist.csv'%acc)
            page += 1
            print("Finish the %s page of %s, the total page number is %s"%(page,acc,totalpage))

    while flag:
        url = uri.format(user_id=user_id, cursor=cursor)
        print([user_id, cursor])
        js_data = get_json(headers, url)
        infos = js_data['data']['user']['edge_owner_to_timeline_media']['edges']
        cursor = js_data['data']['user']['edge_owner_to_timeline_media']['page_info']['end_cursor']
        flag = js_data['data']['user']['edge_owner_to_timeline_media']['page_info']['has_next_page']
        for info in infos:
            sample = {}
            sample["Influencer"] = acc
            timestamp = info['node']['taken_at_timestamp']
            sample["date"] = get_date(timestamp)
            sample["comment_count"] = info['node']['edge_media_to_comment']["count"]
            sample["like_count"] = info['node']['edge_media_preview_like']["count"]

            if info['node']['shortcode']:
                time.sleep(1)
                shortcode = info['node']['shortcode']
                sample['postlink'] = 'https://www.instagram.com/p/%s/' % (shortcode)
                textUrl = 'https://www.instagram.com/p/' + shortcode + '/?__a=1'
                textRespose = get_json(headers, textUrl)

                try:
                    textDict = textRespose['graphql']['shortcode_media']['edge_media_to_caption']['edges'][0][
                        'node']
                    sample["caption"] = str(textDict)[10:-2]
                except:
                    sample["caption"] = ""

                children = textRespose["graphql"]["shortcode_media"].get('edge_sidecar_to_children')
                if children:
                    sample['multipic'] = 'True'
                    picurls = ""
                    for child in children['edges']:
                        picurls = picurls + child['node']['display_url'] + ','
                    sample['img_urls'] = picurls
                else:
                    sample['multipic'] = 'False'
                    sample['img_urls'] = textRespose['graphql']['shortcode_media']['display_url']
                isvideo = textRespose["graphql"]["shortcode_media"].get('is_video')
                if isvideo:
                    sample['video_url'] = textRespose["graphql"]["shortcode_media"].get('video_url')
                    sample['video_view_count'] = textRespose["graphql"]["shortcode_media"].get('video_view_count')
                else:
                    sample['video_url'] = ""
                    sample['video_view_count'] = ""
            samples.append(sample)
            time.sleep(float(random.randint(1, 3)))
        nestedlist2csv(samples,'%s_postlist.csv'%acc)
        page += 1
        print("Finish the %s page of %s, the total page number is %s" % (page, acc, totalpage))


def main():
    for i in range(len(idlist.loc[:, 'acc'])):
        acc = idlist.loc[i, 'acc']
        url = baseurl(acc)
        print(url)
        html = get_html(url)
        ticks = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        print("开始处理账户%s,当前时间为:%s"%(acc,ticks))
        try:
            get_samples(html,acc)
        except:
            print("程序中断,中断时间为%s" % (ticks))
            break
        print("结束处理账户%s,当前时间为:%s" % (acc, ticks))
        time.sleep(float(random.randint(1, 4000)/10))

if __name__ == '__main__':
    start = time.time()
    main()

断点续传

import time
import random
import requests
import pandas as pd
import csv
from datetime import datetime

uri = 'https://www.instagram.com/graphql/query/?query_hash=a5164aed103f24b03e7b7747a2d94e3c&variables=%7B%22id%22%3A%22{user_id}%22%2C%22first%22%3A12%2C%22after%22%3A%22{cursor}%22%7D'

idlist = pd.read_table('accidlist.txt',header=0,encoding='gb18030',delim_whitespace=True)
idlist.columns=['acc','id','postno']

headers = {
    "Origin": "https://www.instagram.com/",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/58.0.3029.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "accept-encoding": "gzip, deflate, sdch, br",
    "accept-language": "zh-CN,zh;q=0.8",
    "X-Instragram-AJAX": "1",
    "X-Requested-With": "XMLHttpRequest",
    "Upgrade-Insecure-Requests": "1",
}

def get_html(url):
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
        else:
            print('请求网页源代码错误, 错误状态码:', response.status_code)
    except Exception as e:
        print(e)
        return None


def get_json(headers, url):
    try:
        response = requests.get(url, headers=headers, timeout=10)
        if response.status_code == 200:
            return response.json()
        else:
            print('请求网页json错误, 错误状态码:', response.status_code)
    except Exception as e:
        print(e)
        time.sleep(60 + float(random.randint(1, 4000)) / 100)
        return get_json(headers, url)


def get_pics(picurl,picname):
    picresp = requests.get(picurl, headers=headers, timeout=10)
    with open('%s.png'%picname, 'wb') as f:
        f.write(picresp.content)

def nestedlist2csv(list, out_file):
    with open(out_file, 'a') as f:
        w = csv.writer(f)
        fieldnames=list[0].keys()  
        for row in list:
            w.writerow(row.values())

def get_date(timestamp):
    local_str_time = datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d')
    return local_str_time

def get_breakpoint(breakpage, id, cursor, acc, flag):
    page = breakpage
    while flag:
        samples = []
        url = uri.format(user_id=id, cursor=cursor)
        print([id, cursor])
        js_data = get_json(headers, url)
        infos = js_data['data']['user']['edge_owner_to_timeline_media']['edges']
        cursor = js_data['data']['user']['edge_owner_to_timeline_media']['page_info']['end_cursor']
        flag = js_data['data']['user']['edge_owner_to_timeline_media']['page_info']['has_next_page']
        for info in infos:
            sample = {}
            sample["Influencer"] = acc
            timestamp = info['node']['taken_at_timestamp']
            sample["date"] = get_date(timestamp)
            sample["comment_count"] = info['node']['edge_media_to_comment']["count"]
            sample["like_count"] = info['node']['edge_media_preview_like']["count"]

            if info['node']['shortcode']:
                time.sleep(1)
                shortcode = info['node']['shortcode']
                sample['postlink'] = 'https://www.instagram.com/p/%s/' % (shortcode)
                textUrl = 'https://www.instagram.com/p/' + shortcode + '/?__a=1'
                textRespose = get_json(headers, textUrl)

                try:
                    textDict = textRespose['graphql']['shortcode_media']['edge_media_to_caption']['edges'][0]['node']
                    sample["caption"] = str(textDict)[10:-2]
                except:
                    sample["caption"] = ""

                children = textRespose["graphql"]["shortcode_media"].get('edge_sidecar_to_children')
                if children:
                    sample['multipic'] = 'True'
                    picurls = ""
                    for child in children['edges']:
                        picurls = picurls + child['node']['display_url'] + ','
                    sample['img_urls'] = picurls
                else:
                    sample['multipic'] = 'False'
                    sample['img_urls'] = textRespose['graphql']['shortcode_media']['display_url']
                isvideo = textRespose["graphql"]["shortcode_media"].get('is_video')
                if isvideo:
                    sample['video_url'] = textRespose["graphql"]["shortcode_media"].get('video_url')
                    sample['video_view_count'] = textRespose["graphql"]["shortcode_media"].get('video_view_count')
                else:
                    sample['video_url'] = ""
                    sample['video_view_count'] = ""
            samples.append(sample)
            time.sleep(float(random.randint(1, 3)))
        nestedlist2csv(samples, '%s_postlist.csv' % acc)
        page += 1
        print("Finish the %s page of %s" % (page, acc))

if __name__ == '__main__':
    breakpage = 58
    id = "89899"
    cursor = "QVFBajF5bVdqV0otYUhfSGJHTFZOdDhULTQ3X19kU0J3ZXd5cXJ2UnNkblNkQW5sU3A0UHFNeU1YbjU1Sm5UZ3pkaUphTC1xZVVyeTRaLXFFdDRyc0lXNw=="
    acc = "oliviermorisse"
    flag = "true"
    ticks = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    print("Restart time is %s"%ticks)
    get_breakpoint(breakpage, id, cursor, acc, flag)

爬所有图片

import requests
import pandas as pd
import random
import os
import time
from datetime import datetime


picpath = '/Users/mengjiexu/Googledrive/Influencers_pic/'
postpath = '/Users/mengjiexu/Googledrive/Influencers_post/'

headers = {
    "Origin": "https://www.instagram.com/",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/58.0.3029.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "accept-encoding": "gzip, deflate, sdch, br",
    "accept-language": "zh-CN,zh;q=0.8",
    "X-Instragram-AJAX": "1",
    "X-Requested-With": "XMLHttpRequest",
    "Upgrade-Insecure-Requests": "1",
}

def parsepics(filename):
    data = pd.read_excel(postpath + filename)
    influencer = filename.split('_postlist')[0]
    print(influencer)
    inpicpath = '%s%s_pic/'%(picpath,influencer)
    print(inpicpath)
    #os.makedirs(inpicpath)
    os.chdir(inpicpath)
    postlink = data['postlink']
    piclink = data['img_urls']
    multipic = data['multipic']
    for i in range(len(postlink)):
        postid = postlink[i].split('p/')[-1].split('/')[0]
        postindex = len(postlink) - i
        print('This is the %s post of %s' % (i, influencer))
        while multipic[i]:
            pics = piclink[i].split(',')[:-1]
            for j in range(len(pics)):
                try:
                    picresp = requests.get(pics[j], headers=headers, timeout=10)
                    with open('%s%s_%s_%s_%s.jpeg' % (inpicpath, influencer, postindex, j, postid), 'wb') as f:
                        f.write(picresp.content)
                    time.sleep(float(random.randint(0, 2)))
                except:
                    pass
            break
        else:
            try:
                picresp = requests.get(piclink[i], headers=headers, timeout=10)
                with open('%s%s_%s_%s.jpeg' % (inpicpath, influencer, postindex, postid), 'wb') as f:
                    f.write(picresp.content)
                time.sleep(float(random.randint(0, 1)))
            except:
                pass


for filename in os.listdir(postpath):
    ticks = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    with open(picpath + 'Processinghistory.txt','a') as f:
        f.write('Start to process %s, starting time is %s'%(filename, ticks)+'\r')
        parsepics(filename)
        f.write('End processing %s, ending time is %s'%(filename, ticks)+'\r')
        time.sleep(float(random.randint(0, 1)))

后续

  • 将程序放到服务器或colab上跑
  • 爬所有的followers

主要参考链接

https://blog.csdn.net/qq_27297393/article/details/82915102

  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值