抓取西刺代理IP+验证是否可用+存储mongodb

最新推荐文章于 2020-02-17 20:47:19 发布

星空永恒&&卡利达

最新推荐文章于 2020-02-17 20:47:19 发布

阅读量2k

点赞数

分类专栏： python-爬虫

本文链接：https://blog.csdn.net/qq_24683561/article/details/53980931

版权

该博客介绍了一个使用Scrapy爬虫项目，旨在抓取西刺网站上的代理IP，通过requests库测试其可用性，然后将验证过的有效IP存储到MongoDB数据库中。代码包括spider文件，用于获取和验证IP；items.py定义数据结构；pipeline.py处理并存储数据到MongoDB；settings.py配置爬虫参数。

摘要由CSDN通过智能技术生成

spider文件的代码：
import scrapy
import requests	#用于测试抓取过来的IP是否可用

class XiciSpider(scrapy.Spider):
	name = "xici"
	allowed_domains = ["xicidaili.com",]

	def start_requests(self):
		urls = ["http://www.xicidaili.com/nn/1/",
			"http://www.xicidaili.com/nn/2",
		      ]
		for url in urls:
			yield scrapy.Request(url,callback=self.parse)

	def parse(self,response):
		table = response.xpath("//table[@id='ip_list']")[0]	#定位那个装满IP的大框
		trs = table.xpath("//tr")[1:]		#过滤掉第一行的标题栏  国家 IP地址 端口 服务器地址 是否匿名 类型 速度 连接时间 存活时间 验证时间
		for tr in trs:
			pagetest = "http://www.baidu.com.cn/"	#用于测试的网页
			ip = tr.xpath("td[2]/text()").extract()[0]
			port = tr.xpath("td[3]/text()").extract()[0]
			PROXY = "http://" + ip + ":" + port
			proxies = {
					"http":PROXY
				}
			try:
				response = requests.

最低0.47元/天解锁文章

星空永恒&&卡利达

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
抓取西刺代理IP+验证是否可用+存储mongodb

spider文件的代码：import scrapyimport requests #用于测试抓取过来的IP是否可用class XiciSpider(scrapy.Spider): name = "xici" allowed_domains = ["xicidaili.com",] def start_requests(self): urls = ["http://www.xic
复制链接

扫一扫

专栏目录