【正则表达式】使用多行的正则表达式匹配多行的网页数据

最新推荐文章于 2023-06-14 09:30:00 发布

weixin_34162228

最新推荐文章于 2023-06-14 09:30:00 发布

阅读量359

点赞数

原文链接：https://my.oschina.net/MasterLi161307040026/blog/856190

版权

2019独角兽企业重金招聘Python工程师标准>>>

对于正则表达式的语法，这里不做详解。只是提一下学习正则表达式时，只需要了解元字符表示的意义、编译函数和编译标志、re模块包含的顶级方法和matchobject的实例方法即可。

目标：

从指定页面抓到的数据中提取目标数据，这里要提取的就是代理服务器的ip和端口

注意：这里使用的是多行的正则表达式，当然可以使用re.X标志进行编译，但是由于html对空白字符要求不严格，所以经常出现页面对齐格式不良好，为了解决这一问题，相比其他正则表达式，改进之处在于对html页面的每一行都在其行头和行尾加上[\s]*来匹配无效的空白字符。关于这一细节，还需提醒一点的是并不能直接拿页面的内容作为正则表达式，特别是要匹配的内容是多行的时候。

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import urllib2
import re

'''
    http请求体的内容格式如下：
    <div class="proxylistitem" name="list_proxy_ip">
            <div style="float:left; display:block; width:630px;">
            <span class="tbBottomLine" style="width:140px;">
                66.104.77.20
            </span>
            <span class="tbBottomLine" style="width:50px;">
                    3128
            </span>
            <span class="tbBottomLine " style="width:70px;">
                高匿
            </span>
            <span class="tbBottomLine " style="width:70px;">
                美国
            </span>
            <span class="tbBottomLine " style="width:80px;">
                09月05日
            </span>
            <span class="tbBottomLine " style="width:80px;">
                2.70(61票)
            </span>
            <span class="tbBottomLine " style="width:60px;">
                2.70
            </span>
            <span class="tbBottomLine " style="width:30px;">
                10天
            </span>
            </div>
        </div>
    目标页面：
    http://www.proxy360.cn/Region/America
'''

def get_proxy_from_cnproxy():
    reStr = '<span class="tbBottomLine" style="width:140px;">[\s]*\
            [\s]*(.+?)[\s]*\
            [\s]*</span>[\s]*\
            [\s]*<span class="tbBottomLine" style="width:50px;">[\s]*\
            [\s]*(.+?)[\s]*\
            [\s]*</span>'
    req_reObj = re.compile(reStr)
    target = r'http://www.proxy360.cn/Region/America'
    seq_page = urllib2.urlopen(target)
    seq_page_html = seq_page.read()
    proxy_address = req_reObj.findall(seq_page_html)
    for address in proxy_address:
        print address

转载于:https://my.oschina.net/MasterLi161307040026/blog/856190