python 爬虫基础 -- 综合示例

最新推荐文章于 2024-04-16 16:39:50 发布

午后阳光送给你

最新推荐文章于 2024-04-16 16:39:50 发布

阅读量474

点赞数

分类专栏： python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_25022577/article/details/117994573

版权

python 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

python 爬虫基础 – 综合示例

题目：爬取某网站提供ip地址，并测试其是否可用

技术list：

requests
re
BeautifulSoup
telnetlib 用以测试ip地址是否可用

打开待爬取数据的网站，按F12，查看我们要截取的数据的格式
每一行数据中，我们只关心ip地址及其端口号。
编写可能用到的正则表达式

ex = '<tr><td>(?P<ip>.*?)</td><td>(?P<port>.*?)</td><td>(?P<att1>.*?)</td><td>(?P<att2>.*?)</td><td>(?P<time>.*?)</td></tr>'
regaxEx = re.compile(ex)

整体代码

def ip_test():
   # 编写格式解析的正则表达式
    ex = '<tr><td>(?P<ip>.*?)</td><td>(?P<port>.*?)</td><td>(?P<att1>.*?)</td><td>(?P<att2>.*?)</td><td>(?P<time>.*?)</td></tr>'
    regaxEx = re.compile(ex)

    url = 'https://www.89ip.cn/'  #+index_1.html
    # UA 伪装
    header = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"
        }

    ip_list = []
    # 分页获取 ---- 也可以通过命令行输入待爬取的页面数
    for i in [1,2,3,4,5,6]:
        new_url = url + 'index_' + str(i) + '.html'
        print('current url : ', new_url)
        response = requests.get(url = new_url, headers = header)
        # 页面解析
        soup = BeautifulSoup(response.text, 'lxml')
        ip_table = soup.find('table', class_='layui-table').find('tbody')
        for row in ip_table.find_all('tr'):
            data = re.match(regaxEx, str(row).replace('\n','').replace('\t','').strip())
           # 测试 ip 是否可用
            try:
                telnetlib.Telnet(data['ip'],port=data['port'],timeout = 3)
            except:
                print('ip无效 ' , data['ip'])
            else:
                print('Good ip ', data['ip'])
                ip_list.append(data['ip']+':'+data['port'])

    print(ip_list)
    print('over.')

以上示例，综合运用到了之前学习的几部分内容，爬取的ip地址，可以为之后的代理所用。

午后阳光送给你

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python 爬虫基础 -- 综合示例

python 爬虫基础 – 综合示例题目：爬取某网站提供ip地址，并测试其是否可用技术list：requestsreBeautifulSouptelnetlib 用以测试ip地址是否可用打开待爬取数据的网站，按F12，查看我们要截取的数据的格式每一行数据中，我们只关心ip地址及其端口号。编写可能用到的正则表达式ex = '<tr><td>(?P<ip>.*?)</td><td>(?P<port>.*
复制链接

扫一扫

专栏目录