Python爬虫实例——基于Xpath爬取西刺网站ip、端口信息

最新推荐文章于 2022-04-06 20:23:23 发布

谷曰十鑫

最新推荐文章于 2022-04-06 20:23:23 发布

阅读量1k

点赞数

分类专栏： Python 文章标签： xpath

本文链接：https://blog.csdn.net/weixin_43636302/article/details/103096023

版权

Python 专栏收录该内容

49 篇文章 8 订阅

订阅专栏

import requests
from parsel import Selector
# from bs4 import BeautifulSoup

def getOneHtmlPage(page):
    url=f'https://www.xicidaili.com/nn/{page}'
    url='https://www.xicidaili.com/nn/%s'%(page)
    header={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'
    }
    response=requests.get(url,headers=header)
    return response.text

def parseOneHtmlPage(text):
    selectors=Selector(text=text)
    #最好手工选取xpath路径，直接copy有可能出错，比如此案例中copy下来的/body实际网页源代码中没有
    selectors=selectors.xpath('//table[@id="ip_list"]/tr')
    for selector in selectors:
        ip=selector.xpath('./td[2]/text()').get()
        port=selector.xpath('./td[3]/text()').get()
        print(port,ip)
        item=f"{ip},{port}"
        if 'none' not in item:
            pipelinesCSV(item)

def pipelinesCSV(item):
    with open('filename.csv','a',encoding='utf-8') as fp:
        fp.write(item+'\n')
        return item

def main():
	#爬取太频繁可能会被封ip,可以设置时间间隔或切换ip
    for page in range(1,10):
        text=getOneHtmlPage(page)
        parseOneHtmlPage(text)

if __name__=='__main__':
    main()

谷曰十鑫

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫实例——基于Xpath爬取西刺网站ip、端口信息

import requestsfrom parsel import Selector# from bs4 import BeautifulSoupdef getOneHtmlPage(page): url=f'https://www.xicidaili.com/nn/{page}' url='https://www.xicidaili.com/nn/%s'%(page) ...
复制链接

扫一扫

专栏目录