python爬虫今日头条街拍美图开发背景_分析Ajax请求并抓取今日头条街拍美图：爬取详情页的url与实际页面上显示不符...

weixin_39729272

于 2020-12-15 13:23:20 发布

阅读量57

点赞数

文章标签： python爬虫今日头条街拍美图开发背景

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_39729272/article/details/111428942

版权

该博客详细介绍了如何利用Python的requests、BeautifulSoup和正则表达式库来抓取今日头条搜索结果中图集板块的索引页和详情页信息，包括设置User-Agent、处理HTTP请求、解析JSON数据、提取文章URL及图片信息等步骤。

摘要由CSDN通过智能技术生成

from urllib.parse import urlencode

import re

from requests.exceptions import RequestException

from bs4 import BeautifulSoup

import requests

import json

def get_page_index(offset,keyword):#定义一个函数用于获取索引页信息

data = {

'offset': offset,

'format': 'json',

'keyword': keyword,

'autoload': 'true',

'count': '20',

'cur_tab': '3'

}

# cur_tab为3指的是图集板块，数过来第三个,若为1则指代综合板块

#count 数量

url = 'http://www.toutiao.com/search_content/?' + urlencode(data) #urlencode可将字典对象转化为url的请求参数

headers = {

'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'

}

try:

response = requests.get(url,headers=headers)

if response.status_code == 200:

return response.text

else:

return None

except RequestException:

print('请求索引页出错')

return None

def parse_page_index(html):#定义一个函数解析索引页信息，返回一个包含详情页url的迭代器

data = json.loads(html)#将字符串转化为一个对象(字典)

if data and 'data' in data.keys():#判断data是否为空，同时要满足键里面有‘data’

for item in data.get('data'):#获取字典中key为‘data’的对应的值，这个值data.get('data')为一个容量为20的列表，列表的元素为字典

yield item.get('article_url')#获取字典中key为‘article_url’的对应的值，即网址

def get_page_detail(url):#定义一个函数用于得到详情页下的信息

try:

response = requests.get(url)

if response.status_code == 200:

return response.text

else:

return None

except RequestException:

print('请求详情页出错')

return None

def parse_page_detail(html):#定义解析详情页的方法

soup = BeautifulSoup(html,'lxml')

title = soup.select('title')[0].get_text()

print(title)

'''

image_pattern = re.compile('gallery: (.*?),\n siblingList',re.S)

result = re.search(image_pattern,html)

if result:

print(result.group(1))

'''

def main():

html = get_page_index('0','街拍')

for url in parse_page_index(html):#parse_page_index(html)返回的是一个迭代器，每次输出一个网址

print(url)

html = get_page_detail(url)

if html:

parse_page_detail(html)

#print(url)

if __name__ == '__main__':

main()

weixin_39729272

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫今日头条街拍美图开发背景_分析Ajax请求并抓取今日头条街拍美图：爬取详情页的url与实际页面上显示不符...

from urllib.parse import urlencodeimport refrom requests.exceptions import RequestExceptionfrom bs4 import BeautifulSoupimport requestsimport jsondef get_page_index(offset,keyword):#定义一个函数用于获取索引页信息dat...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。