笔者有话说:爬取虎扑网站浏览数时,本以为只是一个平平无奇的动态加载,没想到用selenium抓取依然如此,后面才发现,这是一个异步加载…ajakx
发现这个小秘密后,唯一的问题就是tid后面的字段是啥,如何获取。(目测时间戳,果不其然…)
ps:之前爬取过类似的网站,同样的配方,不变的味道,请查看往期节目“腾讯招聘网页的爬取”
# -*- coding: utf-8 -*-
"""
Created on Tue Nov 10 17:54:04 2020
@author: Yuka
"""
import requests
import time
from lxml import etree
url = 'https://msa.hupu.com/thread_hit?tid=39096961&_=1605002160898'
url1 = 'https://bbs.hupu.com/39096961.html'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'
}
res = requests.get(url=url1,headers=headers).text
html = etree.HTML(res)
tid = html.xpath('//div[@class="bbs-hd-h1"]/h1/@tid')[0]
#把时间戳转化为当前时间
import time
timestemp = int(time.time())*1000
#二次拼接异步加载的url
url_real = "https://msa.hupu.com/thread_hit?tid={}&_={}".format(tid,timestemp)
print(url_real)
look_record = requests.get(url=url_real,headers=headers).text
print(look_record)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
最终结果:
https://msa.hupu.com/thread_hit?tid=39096961&_=1605005211000
913072
- 1
- 2