python抓取页面中主要信息的方法（二）

最新推荐文章于 2022-08-30 12:16:56 发布

依剑仗天涯

最新推荐文章于 2022-08-30 12:16:56 发布

阅读量154

点赞数

分类专栏： python 爬虫文章标签：抓取数据

本文链接：https://blog.csdn.net/sun_daming/article/details/90238310

版权

python 同时被 2 个专栏收录

47 篇文章 0 订阅

订阅专栏

爬虫

12 篇文章 0 订阅

订阅专栏

天下的网站没有我爬不到的，只有不想爬的（有吹牛逼之嫌）。

Python2慢慢被Python3所代替了，主要以3为主，话不多说，直接看技术点吧

正则表达式re（难）

获取<tr></tr>标签之间内容
获取<a href..></a>超链接之间内容
获取URL最后一个参数命名图片或传递参数
爬取网页中所有URL链接
爬取网页标题title两种方法
定位table位置并爬取属性-属性值
过滤<span></span>等标签
获取<script></script>等标签内容
通过replace函数过滤<br />标签
获取<img ../>中超链接及过滤<img>标签

代码：

import re
content = 
<td> 
<a href="https://www.baidu.com/articles/zj.html" title="山西省">山西煤多</a> 
<a href="https://www.baidu.com//articles/gz.html" title="北京市">北京人多</a> 
</td> 
# 获取<a href></a>之间的内容
print(u'获取链接文本内容:')
res = r'<a .*?>(.*?)</a>'
mm = re.findall(
    res, content, re.S | re.M)
for value in mm:
    print(value)
# 获取所有<a href></a>链接所有内容
print(u'\n获取完整链接内容:')
urls = re.findall(r"<a.*?href=.*?<\/a>", content, re.I | re.S | re.M)
for i in urls:
    print(i)
# 获取<a href></a>中的URL
print(u'\n获取链接中URL:')
res_url = r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')"
link = re.findall(res_url, content, re.I | re.S | re.M)
for url in link:
    print(url)

结果：

获取链接文本内容:
山西人多
北京人多
获取完整链接内容:
< a
href = "https://www.baidu.com/articles/zj.html"
title = "山西省" > 山西人多 < / a >
< a
href = "https://www.baidu.com//articles/gz.html"
title = "北京市" > 北京人多 < / a >
获取链接中URL:
https: // www.baidu.com / articles / zj.html
https: // www.baidu.com // articles / gz.html

基于bs4的BeautifulSoup模块（中）

import urllib.request,
lxml.html

url = 'http://www.nmc.cn'
html = urllib.request.urlopen(url).read()
tree = lxml.html.fromstring(html)
content = tree.cssselect('li.waring > a')
for n in content:
    link = n.get('href')
    title = n.get('title')
    tag = n.text
    print(tag, url + link, title)

基于lxml的etree模块（中）

from bs4 import BeautifulSoup
import urllib.request

url = 'http://www.nmc.cn'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,'lxml')
content = soup.select('#alarmtip > ul > li.waring > a')
######### 添加到列表中
link = []
title = []
tag = []
for n in content:
    link.append(url+n.get('href'))
    title.append(n.get('title'))
    tag.append(n.text)
######## 添加到字典中
for n in content:
    data = {
        'tag'   : n.text,
        'link'  : url+n.get('href'),
        'title' : n.get('title')
    }

我没有习惯用的方法，有别的方法，请大家留言