分析ajax爬取头条街拍数据

最新推荐文章于 2022-11-10 20:57:59 发布

WindSearcher

最新推荐文章于 2022-11-10 20:57:59 发布

阅读量1.2k

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/qq_40511966/article/details/100136524

版权

python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

首先我们需要对该链接中的数据进行爬取：https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D

1.先试试发起请求，服务器会返回什么数据

import requests


def get_html(url):
    response = requests.get(url)
    print(response.text)

if __name__ == '__main__':
    url = 'https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D'
    get_html(url)

你会发现你得到的只是js代码，why?数据跑哪去了，此时你可能需要想到该网页可能是使用ajax动态加载填充数据，也就是在页面加载完毕后，发起ajax请求，服务器返回json数据来动态填充，所以你这样请求是无法获取数据的。那如何获取数据呢？

首先打开需要爬取数据的网址，然后打开控制台，点击network或者网络，如图所示

这里我们可以来分析这些请求，首先我们可以先点开第一个请求进行分析，然后点击响应菜单，如图所示

我们可以发现所返回的json数据不正好是我们所需的数据嘛。很好，下面开始准备来想该API发起请求试试，点击消息头，可以看到该API：https://www.toutiao.com/api/search/content/?aid=24

当然我们需要构造相应参数才能发起请求，打开参数看看，需要哪些参数

下面准备开始构造请求参数

data = {
       'aid':24,
       'app_name':'web_search',
       'offset':20,
       'format':'json',
       'keyword':'街拍',
       'autoload':'true',
       'count':20,
       'en_qc':1,
       'cur_tab':1,
       'from':'search_tab',
       'pd':'synthesis',
       'timestamp':int(time.time()*1000)
    }

import requests
from urllib.parse import urlencode
from requests.exceptions import RequestException
import time
import json
def get_page_index():
    
    data = {
       'aid':24,
       'app_name':'web_search',
       'offset':20,
       'format':'json',
       'keyword':'街拍',
       'autoload':'true',
       'count':20,
       'en_qc':1,
       'cur_tab':1,
       'from':'search_tab',
       'pd':'synthesis',
       'timestamp':int(time.time()*1000)
    }
   
 
    #urlencode会自动把字典变成url格式
    url = 'https://www.toutiao.com/api/search/content/?'+urlencode(data)
    try:
        print(url)
        response = requests.get(url)
        #自动编码转换
        response.encoding = response.apparent_encoding
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print('请求失败')

def parse_page_index(html):
    data = json.loads(html)

    print(data)
def get_html(url):
    response = requests.get(url)
    #自动解决乱码
    response.encoding = response.apparent_encoding
    print(response.text)

if __name__ == '__main__':
    url = 'https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D'
    html = get_page_index()
    parse_page_index(html)

现在你以为可以获取数据嘛？不信，看我运行上面这段代码，结果如下：

发现并没有数据返回，So你以为人家头条没点反爬的技术嘛，判断下你的请求头，发现是python发起的，并不是浏览器发起的，当然拒绝给你返回数据啦。当然这里貌似还要在请求头中构造一个Cookie，大家可以根据请求头来构造

import requests
from urllib.parse import urlencode
from requests.exceptions import RequestException
import time
import json
def get_page_index():
    headers = {
        'Content-Type':'application/x-www-form-urlencoded',
        'Cookie':'tt_webid=6730223250619663876; WEATHER_CITY=%E5%8C%97%E4%BA%AC; __tasessionId=zg6odfpo71567002221051; tt_webid=6730223250619663876; csrftoken=8dc53ba17517e71021fdc2d65123a3d8; s_v_web_id=09d77accf1eeb81d903bdbee56ffe9d0',
        'X-Requested-With': 'X-Requested-With',
        'User - Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Firefox/68.0'
    }
    data = {
       'aid':24,
       'app_name':'web_search',
       'offset':20,
       'format':'json',
       'keyword':'街拍',
       'autoload':'true',
       'count':20,
       'en_qc':1,
       'cur_tab':1,
       'from':'search_tab',
       'pd':'synthesis',
       'timestamp':int(time.time()*1000)
    }

    url = 'https://www.toutiao.com/api/search/content/?'+urlencode(data)
    try:
        print(url)
        response = requests.get(url,headers=headers)
        #自动编码转换
        response.encoding = response.apparent_encoding
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print('请求失败')

def parse_page_index(html):
    data = json.loads(html)

    print(data)
def get_html(url):
    response = requests.get(url)
    #自动解决乱码
    response.encoding = response.apparent_encoding
    print(response.text)

if __name__ == '__main__':
    url = 'https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D'
    #get_html(url)
    html = get_page_index()
    parse_page_index(html)

现在才真正能获取ajax加载的动态数据

WindSearcher

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
分析ajax爬取头条街拍数据

首先我们需要对该链接中的数据进行爬取：https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D1.先试试发起请求，服务器会返回什么数据import requestsdef get_html(url): response = requests.get(url) print(response.text)if...
复制链接

扫一扫