爬去代理ip网站

最新推荐文章于 2024-09-14 14:25:22 发布

weixin_38380721

最新推荐文章于 2024-09-14 14:25:22 发布

阅读量117

点赞数

分类专栏：爬虫文章标签：爬虫

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_38380721/article/details/87871619

版权

爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

import requests

from lxml import etree

from selenium import webdriver

from selenium.webdriver.chrome.options import Options

import time

import re

from multiprocessing.dummy import Pool

"""

爬取http://www.goubanjia.com/ ip代理网站

此网站的反爬机制有在显示ip的标签中伪造了dispaly:none的误导信息,使用了js来更改端口号

采取的破解策略为使用selenium无头浏览器,然后使用xpath解析过滤掉误导信息

"""

chrome_options = Options()

chrome_options.add_argument('--headless')

chrome_options.add_argument('--disable-gpu')

browser = webdriver.Chrome(chrome_options=chrome_options)

# 上网

url = 'http://www.goubanjia.com'

browser.get(url)

# time.sleep(3)

page_text = browser.page_source

tree = etree.HTML(page_text)

browser.quit()

pool = Pool(10)

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36',

}

ip_list = []

right_list = []

# parser = etree.HTMLParser(encoding="utf-8")

# tree = etree.parse("ip.html", parser=parser) # 将html文档或者xml文档转换成一个etree对象

# tree = etree.HTML(page_text) # 读取字符串

tree_list = tree.xpath('//*[@id="services"]/div/div[2]/div/div/div/table//tr')

for i in tree_list[1:]:

ip = "".join(i.xpath('./td[1]//*[@style!="display: none"]/text() | ./td[1]/text() | ./td[1]/span[last()]/text()'))

level = "".join(i.xpath('./td[2]//text()'))

type = "".join(i.xpath('./td[3]//text()'))

address = "".join(i.xpath('./td[4]/a//text()')).replace(" ", "")

ip_list.append({"ip": ip, "level": level, "type": type, "address": address})

print(ip_list)

# 使用线程池爬取

def test_ip(dic):

test_url = 'http://www.baidu.com/s?ie=UTF-8&wd=ip'

try:

response = requests.get(test_url, headers=headers, proxies={dic["type"]: dic["ip"]})

tree = etree.HTML(response.text)

li = tree.xpath('//div[@id="1"]/div[1]/div[1]/div[2]/table//tr/td//text()')

ip = "".join(li).replace(' ', '')

if re.findall('[\d\.]+', ip)[0] == dic["ip"].split(":")[0]:

right_list.append(dic)

except Exception as e:

print(e)

'''

这是一开始不使用线程池爬取的,很慢,加了线程池之后还不是很快

for dic in ip_list:

test_url = 'http://www.baidu.com/s?ie=UTF-8&wd=ip'

try:

response = requests.get(test_url, headers=headers, proxies={dic["type"]: dic["ip"]})

tree = etree.HTML(response.text)

li = tree.xpath('//div[@id="1"]/div[1]/div[1]/div[2]/table//tr/td//text()')

ip = "".join(li).replace(' ', '')

if re.findall('[\d\.]+', ip)[0] == dic["ip"].split(":")[0]:

right_list.append(dic)

except Exception as e:

print(e)

'''

pool.map(test_ip,ip_list)

print(right_list)

weixin_38380721

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。