爬虫:Instagram信息爬取

10 篇文章 0 订阅
8 篇文章 3 订阅

这是一个关于Instagram爬虫的介绍。

GitHub源码参考(代码和爬取数据):https://github.com/hilqiqi0/crawler/tree/master/simple/instagram

 

爬取的每个数据保存格式:{ 图片的访问路径,评论数,点赞数,帖子的内容 }
eg:{
        "img_url": "https://scontent-sin6-2.cdninstagram.com/vp/0e345bfd870f2fb489f091ed5507397f/5C1A8CB6/t51.2885-15/e35/40949123_1104283529724860_6046749716819964824_n.jpg",
        "comment_count": 12932,
        "like_count": 1321753,
        "text": "Featured photo by @maomay__\\nWeekend Hashtag Project: #WHPperspective\\nThis weekend, the goal is to take photos and videos from a different point of view, as in this featured photo by Mao May (@maomay__). Here are some tips to get you started:\\nCapture a familiar subject or scene from an unexpected angle. Get up close and let a face cover the entire frame, or make a puppy look large by shooting from ground-level as she stares down. Find a high vantage point to show the wider context of a festival scene or bustling market.\\nUse geometry to your advantage. Look for graphic lines — in bridges or telephone wires — that converge to a vanishing point in your composition. Find a new way to capture patterns in everyday places, like the wheels of bicycles lined up in a rack, or symmetrical bricks in an unruly garden.\\nPlay an eye trick. Defy gravity with simple editing, like rotating the frame. Recruit a friend to make a well-timed leap, that, when rotated, looks like they’re flying through air. Or turn a dandelion into a human-size parasol by playing with scale and distance.\\n\\nPROJECT RULES: Please add the #WHPperspective hashtag only to photos and videos shared over this weekend and only submit your own visuals to the project. If you include music in your video submissions, please only use music to which you own the rights. Any tagged photo or video shared over the weekend is eligible to be featured next week."
    }

技术难点总结:1、需要翻墙;2、Instagram在8、9月份之前是没有反扒,之后ajax请求加了反扒。

反扒算法:(请求头加了'X-Instagram-GIS'字段)
        1、将rhx_gis和queryVariables进行组合
        2、然后进行md5哈希

 

代码说明和修改:0、默认下载120个,若想下载更多可以删除数量判断或者修改阈值
       1、该代码使用的是蓝灯,代理端口为52212;若是其他的翻墙工具,请修改代理端口号
       2、该代码爬取的是https://www.instagram.com网站中instagram博主的信息;若想爬取其他博主的信息,需要修改博主名
       3、该代码仅是测试,尚未进行代码模块化、封装等

 

关于流程和分析:1、参见文章最后参考;2、直接分析代码

import re
import json
import time
import random
import requests
from pyquery import PyQuery as pq
import hashlib

url_base = 'https://www.instagram.com/instagram/'
uri = 'https://www.instagram.com/graphql/query/?query_hash=a5164aed103f24b03e7b7747a2d94e3c&variables=%7B%22id%22%3A%22{user_id}%22%2C%22first%22%3A12%2C%22after%22%3A%22{cursor}%22%7D'

headers = {
'Connection':'keep-alive',
'Host':'www.instagram.com',
'Referer':'https://www.instagram.com/instagram/',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
'X-Requested-With':'XMLHttpRequest'
}

proxy = {
    'http': 'http://127.0.0.1:52212',
    'https': 'http://127.0.0.1:52212'
}

def hashStr(strInfo):
    h = hashlib.md5()
    h.update(strInfo.encode("utf-8"))
    return h.hexdigest()

def get_html(url):
    try:
        response = requests.get(url, headers=headers, proxies=proxy)
        if response.status_code == 200:
            return response.text
        else:
            print('请求网页源代码错误, 错误状态码:', response.status_code)
    except Exception as e:
        print(e)
        return None

def get_json(headers,url):
    try:
        response = requests.get(url, headers=headers,proxies=proxy, timeout=10)
        if response.status_code == 200:
            return response.json()
        else:
            print('请求网页json错误, 错误状态码:', response.status_code)
    except Exception as e:
        print(e)
        time.sleep(60 + float(random.randint(1, 4000))/100)
        return get_json(headers,url)

def get_samples(html):
    samples = []
    user_id = re.findall('"profilePage_([0-9]+)"', html, re.S)[0]
    GIS_rhx_gis = re.findall('"rhx_gis":"([0-9a-z]+)"', html, re.S)[0]

    print('user_id:' + user_id)
    print(GIS_rhx_gis)
    doc = pq(html)
    items = doc('script[type="text/javascript"]').items()
    for item in items:
        if item.text().strip().startswith('window._sharedData'):
            # window._sharedData 的内容转换为字典
            js_data = json.loads(item.text()[21:-1], encoding='utf-8')
            
            # 12 张初始页面图片信息
            edges = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"]["edges"]
            # 网页页面信息
            page_info = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"]['page_info']
            # 下一页的索引值AQCSnXw1JsoV6LPOD2Of6qQUY7HWyXRc_CBSMWB6WvKlseC-7ibKho3Em0PEG7_EP8vwoXw5zwzsAv_mNMR8yX2uGFZ5j6YXdyoFfdbHc6942w
            cursor = page_info['end_cursor']
            # 是否有下一页
            flag = page_info['has_next_page']
            
            # 节点信息筛选
            for edge in edges:               
                
                # 如果是视频直接跳过
                if edge['node']['is_video'] == "true":
                    continue
                
                time.sleep(1)
                # 图片信息筛选
                sample = {}
                if edge['node']['display_url']:
                    display_url = edge['node']['display_url']
#                    print(display_url)
                    sample["img_url"] = display_url
                    sample["comment_count"] = edge['node']['edge_media_to_comment']["count"]
                    sample["like_count"] = edge['node']['edge_liked_by']["count"] 
                    print(sample["img_url"])
                    print(sample["comment_count"])
                    print(sample["like_count"])
                                                            
                if edge['node']['shortcode']:
                    shortcode = edge['node']['shortcode']
                    # https://www.instagram.com/p/{shortcode}/?__a=1
                    textUrl = 'https://www.instagram.com/p/' + shortcode + '/?__a=1'
                    textRespose = get_json(headers,textUrl)
#                    print(textRespose)
#                    print(type(textRespose))    
                    textDict = textRespose['graphql']['shortcode_media']['edge_media_to_caption']['edges'][0]['node']
                    sample["text"] = str(textDict)[10:-2]
                    print(sample["text"])
                    
                samples.append(sample)
                
            print(cursor, flag)
            
    # AJAX 请求更多信息                     
    while flag:
        url = uri.format(user_id=user_id, cursor=cursor)
        print(url)
        queryVariables = '{"id":"' + user_id + '","first":12,"after":"' +cursor+ '"}'
        print(queryVariables)
        headers['X-Instagram-GIS'] = hashStr(GIS_rhx_gis + ":" + queryVariables)
        print(headers)
        js_data = get_json(headers,url)
#        print(js_data)
        infos = js_data['data']['user']['edge_owner_to_timeline_media']['edges']
        cursor = js_data['data']['user']['edge_owner_to_timeline_media']['page_info']['end_cursor']
        flag = js_data['data']['user']['edge_owner_to_timeline_media']['page_info']['has_next_page']
        
#        print(infos)
        for info in infos:
            if info['node']['is_video']:
                continue
            else:
                sample = {}
                display_url = info['node']['display_url']
#                print(display_url)
                sample["img_url"] = display_url
                sample["comment_count"] = info['node']['edge_media_to_comment']["count"]
                sample["like_count"] = info['node']['edge_media_preview_like']["count"]                    
                                                        
                if info['node']['shortcode']:
                    time.sleep(1)
                    shortcode = info['node']['shortcode']
                    # https://www.instagram.com/p/{shortcode}/?__a=1
                    textUrl = 'https://www.instagram.com/p/' + shortcode + '/?__a=1'
                    textRespose = get_json(headers,textUrl)
#                    print(textRespose)
#                    print(type(textRespose))    
                    textDict = textRespose['graphql']['shortcode_media']['edge_media_to_caption']['edges'][0]['node']
                    sample["text"] = str(textDict)[10:-2]
                                        
                print(sample["img_url"])
                print(sample["comment_count"])
                print(sample["like_count"])  
                print(sample["text"])
                samples.append(sample)
                
        print(cursor, flag)
        
        # 下载120个 返回
        if len(samples) > 120:
            return samples

    return samples

def main():
    url = url_base
    html = get_html(url)
    samples = get_samples(html)
#    print(samples)
    with open("./samples.txt","a",encoding='utf-8') as f:
        f.write(str(samples))

if __name__ == '__main__':
    start = time.time()
    main()

 

参考1:https://www.jianshu.com/p/985c2b4e8f6c

参考2:https://blog.csdn.net/geng333abc/article/details/79403395

 

  • 12
    点赞
  • 46
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
Instagram爬虫是一种通过程序自动化获Instagram上的数据的方法。以下是一个简单的Instagram爬虫的实现方法: 1.首先,需要安装Python和Selenium库。 2.使用Selenium库打开一个浏览器窗口,并访问Instagram网站。 3.输入用户名和密码,登录Instagram账户。 4.使用Selenium库模拟用户在Instagram上的操作,例如搜索用户、获用户信息、获用户发布的图片和视频等。 5.使用BeautifulSoup库解析网页内容,提所需的数据。 6.将数据保存到本地文件或数据库中。 以下是一个简单的Instagram爬虫的代码示例: ```python from selenium import webdriver from bs4 import BeautifulSoup # 打开浏览器窗口 driver = webdriver.Chrome() driver.get("https://www.instagram.com/") # 登录Instagram账户 username = driver.find_element_by_name("username") password = driver.find_element_by_name("password") username.send_keys("your_username") password.send_keys("your_password") login_button = driver.find_element_by_xpath("//button[@type='submit']") login_button.click() # 搜索用户 search_box = driver.find_element_by_xpath("//input[@placeholder='Search']") search_box.send_keys("user_name") search_box.submit() # 获用户信息 user_info = driver.find_element_by_xpath("//div[@class='v1Nh3 kIKUG _bz0w']") user_info.click() html = driver.page_source soup = BeautifulSoup(html, 'html.parser') user_name = soup.find('h2', {'class': 'BrX75'}).text user_description = soup.find('div', {'class': '-vDIg'}).text # 获用户发布的图片和视频 images = soup.find_all('div', {'class': 'v1Nh3 kIKUG _bz0w'}) for image in images: image_url = image.find('a')['href'] # 下载图片或视频 # 关闭浏览器窗口 driver.quit() ```
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值