Pyhon 网络爬虫--简单的爬取功能

最新推荐文章于 2023-08-08 14:48:44 发布

默默的沉默者

最新推荐文章于 2023-08-08 14:48:44 发布

阅读量833

点赞数

分类专栏： Python 文章标签：网络爬虫

本文链接：https://blog.csdn.net/M_WBCG/article/details/70232780

版权

Python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

从网页上爬取内容大致分为三步：

1、获取整个网页信息（源代码）

2、通过正则匹配，获取指定标签中的内容

3、将获取到的内容写到本地中

一、获取整个网页信息（源代码）

# coding utf-8
import urllib.request

def getHtml(url):
    html = urllib.request.urlopen(url).read()
    return html

html = getHtml("http://www.weather.com.cn/weather/101190401.shtml")
print(html)

urllib.request.urlopen() 方法用于打开一个 URL 地址。

read()用于读取URL中的数据

二、通过正则匹配，获取指定标签中的内容

# coding utf-8
import urllib.request
import re
def getHtml(url):
    html = urllib.request.urlopen(url).read()
    return html
def getImg(html):
    reg = 'src="(.+?\.png)"'
    imgre = re.compile(reg)
    html = html.decode('utf-8')#不加这句话，否则会报TypeError: cannot use a string pattern on a bytes-like object错误
    imglist = imgre.findall(html)
    return imglist
html = getHtml("http://www.weather.com.cn/weather/101190401.shtml")
print(getImg(html))

（按F12打开开发者工具，在里面可以查看源代码，看你所需要筛选内容的格式）

通过正则表达式对html中进行筛选，获得图片链接

re.compile() 可以把正则表达式编译成一个正则表达式对象.

正则表达式对象.findall() 方法读取 html 中包含 imgre （正则表达式）的数据。

三、将页面筛选的数据保存到本地

# coding utf-8
import urllib.request
import re
def getHtml(url):
    html = urllib.request.urlopen(url).read()
    return html
def getImg(html):
    reg = 'src="(.+?\.png)"'
    imgre = re.compile(reg)
    html = html.decode('utf-8')
    imglist = imgre.findall(html)
    x = 0
    for imgurl in imglist:
        urllib.request.urlretrieve(imgurl, '%s.png' % x)
        x += 1
    return imglist
html = getHtml("http://www.weather.com.cn/weather/101190401.shtml")
print(getImg(html))

urllib.request.urlretrieve() 方法，直接将远程数据下载到本地。

默默的沉默者

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Pyhon 网络爬虫--简单的爬取功能

从网页上爬取内容大致分为三步：1、获取整个网页信息（源代码） 2、通过正则匹配，获取指定标签中的内容 3、将获取到的内容写到本地中一、获取整个网页信息（源代码）# coding utf-8import urllib.requestdef getHtml(url): html = urllib.request.urlopen(url).read() retu
复制链接

扫一扫

专栏目录