用python爬取今日头条上的图片_Python 爬虫爬取今日头条街拍上的图片

最新推荐文章于 2024-08-20 14:01:33 发布

weixin_39797264

最新推荐文章于 2024-08-20 14:01:33 发布

阅读量234

点赞数

文章标签：用python爬取今日头条上的图片

# 今日头条--街拍

import requests

from urllib.parse import urlencode

import os

from hashlib import md5

from multiprocessing.pool import Pool

# 根据 offset 得到每一个 ajax 请求返回的 json

def get_json(offset):

base_url = 'https://www.toutiao.com/search_content/?'

params = {

'offset': offset,

'format': 'json',

'keyword': '街拍',

'autoload': 'true',

'count': '20',

'cur_tab': '1',

'from': 'search_tab',

'pd': 'synthesis'

}

url = base_url + urlencode(params)

try:

response = requests.get(url)

if response.status_code == 200:

return response.json()

except requests.ConnectionError as e:

print('Errors', e.args)

return None

# 根据 josn 提取出相应内容的标题、图片链接

def get_images(json):

if json.get('data'):

for item in json.get('data'):

if item.get('cell_type') is not None:

continue

title = item.get('title')

images = item.get('image_list')

for image in images:

yield {

'title': title,

#获取大图片

'image': image.get('url').replace('list', 'large').replace('//', 'Https://')

}

# 根据 item 中的 title 创建文件夹，图片的名称可以用其内容的 MD5 值，防止重复

# 这里有个小问题，那就是在 windows 路径下创建文件夹是不可以有英文的':'，这些标题大多是中文符号，但是偶尔也会含有

# 英文的':',这样会导致创建文件夹失败，所以要将windows下不允许的英文符号(\/:*?"<>|)转换成相应的中文标点

def save_images(item):

title = item.get('title')

intab = r'\/:*?"<>|'

outtab = '、、：-？“《》-'

trantab = str.maketrans(intab, outtab)

# 将windows下不允许的英文符号(\/:*?"<>|)转换成相应的中文标点

title = title.translate(trantab)

if not os.path.exists(title):

os.mkdir(title)

try:

response = requests.get(item.get('image'))

if response.status_code == 200:

file_path = '{0}/{1}.{2}'.format(title, md5(response.content).hexdigest(), 'jpg')

if not os.path.exists(file_path):

with open(file_path, 'wb') as f:

f.write(response.content)

else:

print('Already Downloaded', file_path)

except requests.ConnectionError:

print('Failed to save image')

def main(offset):

json = get_json(offset)

for item in get_images(json):

print(item)

save_images(item)

# 定义开始页数

GROUP_START = 1

# 定义结束页数

GROUP_END = 5

if __name__ == '__main__':

pool = Pool()

offsets = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)])

# 利用多线程的线程池的map方法下载图片

pool.map(main, offsets)

pool.close()

pool.join()

weixin_39797264

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。