网址:
#https://www.xicidaili.com/nn/1
vscode爬取:
xpath提取数据:ip,port,local,hidden,kind,check_time
程序是有用, 只是由于过多的请求爬取,最后电脑访问不了西刺代理网站了,但有兴趣的朋友可以复制到自己电脑试试。应该是可以的
还是希望你朋友们对网站温柔点吧!
上代码:
需要的库:
import requests
from lxml import etree
import csv
加载页面:
def get_page(url):
header = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3676.400 QQBrowser/10.4.3469.400"}
resp = requests.get(url=url, headers=header)
html = resp.text
#print(html)
return html
解析网站提取数据:
def parse_page(url):
source = get_page(url)
item = etree.HTML(source)
trs = item.xpath('//*[@id="ip_list"]//tr')
for tr in trs[1:]:
ip = tr.xpath('./td[2]/text()')[0]
port = tr.xpath('./td[3]/text()')[0]
local = tr.xpath('./td[4]/a/text()')
if local:
local = local[0]
else:
local = 'null'
hidden = tr.xpath('./td[5]/text()')[0]
kind = tr.xpath('./td[6]/text()')[0]
check_time = tr.xpath('./td[10]/text()')[0]
info = [ip,port,local,hidden,kind,check_time]
save(info)
#print(ip,port,local,hidden,kind,check_time)
对数据保存为CSV:
def save(info):
with open('xici_prox.csv','a')as f:
write = csv.writer(f)
write.writerow(info)
爬取多页 运行爬虫:
if __name__ == "__main__":
for i in range(1,20):
url = "https://www.xicidaili.com/nn/{}".format(i)
parse_page(url)
由于测试时多次把保存的文件删除重新爬取,最后删除了源文件没有备份,ip被封了,因此就没有爬取结果了展示了,有兴趣的朋友可以跑下代码,再用代理ip跑下,避免被封!!!!