python 爬贴吧_Python爬取贴吧中的图片

weixin_39862985

于 2020-12-19 14:11:56 发布

阅读量108

点赞数

文章标签： python 爬贴吧

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_39862985/article/details/111529837

版权

本文介绍如何使用Python的Scrapy框架爬取百度贴吧中特定帖子的图片。首先创建Scrapy项目，然后定义爬虫并提取图片URL，接着利用ImagesPipeline下载图片，并配置settings以保存到本地。最后运行爬虫，成功抓取并存储图片。

摘要由CSDN通过智能技术生成

#看到贴吧大佬在发图，准备盗一下

#只是爬取一个帖子中的图片

1、先新建一个scrapy项目

scrapy startproject TuBaEx

2、新建一个爬虫

scrapy genspider tubaex https://tieba.baidu.com/p/4092816277

3、先写下items

#保存图片的url

img_url=scrapy.Field()

4、开始写爬虫

# -*- coding: utf-8 -*-

import scrapy

from TuBaEx.items import TubaexItem

class TubaexSpider(scrapy.Spider):

name = "tubaex"

#allowed_domains = ["https://tieba.baidu.com/p/4092816277"]

baseURL="https://tieba.baidu.com/p/4092816277?pn="

#拼接地址用实现翻页

offset=0

#要爬取的网页

start_urls = [baseURL+str(offset)]

def parse(self, response):

#获取最后一页的数字

end_page=response.xpath("//div[@id=‘thread_theme_5‘]/div/ul/li[2]/span[2]/text()").extract()

#通过审查元素找到图片的类名，用xpath获取

img_list=response.xpath("//img[@class=‘BDE_Image‘]/@src").extract()

for img in img_list:

item=TubaexItem()

item[‘img_url‘]=img

yield item

url=self.baseURL

#进行翻页

if self.offset < int(end_page[0]): #通过xpath返回的是list

self.offset+=1

yield scrapy.Request(self.baseURL+str(self.offset),callback=self.parse)

5、使用ImagesPipeline，这个没什么说的，我也不太懂

#-*- coding: utf-8 -*-

importrequestsfrom scrapy.pipelines.images importImagesPipelinefrom TuBaEx importsettingsclassTubaexPipeline(ImagesPipeline):defget_media_requests(self,item,info):

img_link= item[‘img_url‘]yieldscrapy.Request(img_link)defitem_completed(self,results,item,info):

images_store="C:/Users/ll/Desktop/py/TuBaEx/Images/"img_path=item[‘img_url‘]return item

6、配置下settings

IMAGES_STORE = ‘C:/Users/ll/Desktop/py/TuBaEx/Images/‘

#Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = ‘TuBaEx (+http://www.yourdomain.com)‘

USER_AGENT="User-Agent,Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"

#Obey robots.txt rules

ROBOTSTXT_OBEY =False#开启管道

ITEM_PIPELINES ={‘TuBaEx.pipelines.TubaexPipeline‘: 300,

}

7、执行

scrapy crawl tubaex

8、收获果实

原文：http://www.cnblogs.com/lljh/p/7341080.html

weixin_39862985

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。