数据提取之特殊字符

汪宝儿

已于 2022-11-24 20:15:05 修改

阅读量130

点赞数 1

分类专栏：网络爬虫文章标签： python ipython

于 2022-11-22 21:09:01 首次发布

本文链接：https://blog.csdn.net/weixin_48353691/article/details/127989526

版权

网络爬虫专栏收录该内容

11 篇文章 1 订阅

订阅专栏

数据提取之特殊字符

1.^

(1)在中括号内表示取反->> [^0-9a-zA-Z_]
(2)在中括号外表示以指定字符串开始->> ^a[a-z]

2.$

以……结尾->> [\w]+@[a-z0-9]+[.]com$

3.|

（1）中括号里面认为是单个字符->> [https|http|ftp]
（2）小括号里面认为是字符串 ->> （https|http|ftp）

4.贪婪模式

正则表达式尽可能多的匹配字符【默认为贪婪模式】

5.非贪婪模式

正则表达式尽可能少的匹配字符【?】

e.g:

text = \
"""
<tr class="pythons">
    <td class="a">python1</td>
    <td class="b">python2</td>
</tr>   
"""
result = re.match('\s<tr[\w\W]+>',text)
print(result.group())

以上正则表达式得到的结果：

<tr class="pythons">
    <td class="a">python1</td>
    <td class="b">python2</td>
</tr>

我们获取的数据应该是标签里的属性，不需要的内容；这时非贪婪模式就起了很大的作用，因此必须掌握：

text = \
"""
<tr class="pythons">
    <td class="a">python1</td>
    <td class="b">python2</td>
</tr>   
"""
result = re.match('\s<tr[\w\W]+?>',text)
print(result.group())