Python3爬虫【壹】静态网页

最新推荐文章于 2023-12-22 10:19:09 发布

xinjiyuan97

最新推荐文章于 2023-12-22 10:19:09 发布

阅读量347

点赞数

分类专栏： Python爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/xinjiyuan97/article/details/60788935

版权

Python爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

1、包

1、在python3中，urllib和urllib2合并为urllib.request和urllib.parse。
2、同时需要使用bs4（BeautifulSoup）包对已经抓取的网页进行筛选。

2、函数

urlopen函数。使用urllib.request包中的urlopen(url, [timeout])打开网页。但需注意如果网页不存在或者无法访问，函数会抛出异常，所以打开部分的函数需要在try…except…模块中调用。
BeautifulSoup函数。BeautifulSoup函数返回一个bs4对象，需要使用read()函数再去读取网页。例如bsObj = BeautifulSoup(urlopen(url).read())
- 可以直接访问bsObj的子元素如bsObj.h1、bsObj.p等。
- findAll和find函数。findAll(tag, attributes, recursive, text, limit, keywords)和find(tag, attributes, recursive, text, keywords)用来寻找网页代码中所需要的代码块。
- get_text()函数。get_text()函数，用于获取已获取的代码块中的内容。
- children对象，使用children获取当前结点下所有的子节点。（注意这里的子节点定义与以往不大一样）
- next_siblings对象，类似于children，返回所有兄弟结点。与之类似的还有parent和parents对象。
- 正则表达式。再urllib.request 包中，可以调用BeautifulSoupimport和re用以使用正则表达式对数据进行筛选。

示例代码：

import urllib.parse
from urllib.error import HTTPError
from bs4 import BeautifulSoup

def getPages(url): #获取所需要的网页
    try:
        page = urlopen(url, timeout = 100)
    except HTTPError as e:
        print(e)
    except socket.timeout as e: 
        print(e)
    else:
        if page is None:
            print(url + 'is not find')
        else:
            returnObj = BeautifulSoup(page.read())
        return returnObj
    return None

def loadPages(url): #循环强制获取网页防止访问超时对数据采集造成损失
    while True:
        print('Is downloading on ' + url)
        try:
            html = getPages(url)
        except BaseException as e:
            continue
        else:
            return html

xinjiyuan97

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python3爬虫【壹】静态网页

1、包1、在python3中，urllib和urllib2合并为urllib.request和urllib.parse。 2、同时需要使用bs4（BeautifulSoup）包对已经抓取的网页进行筛选。2、函数urlopen函数。使用urllib.request包中的urlopen(url, [timeout])打开网页。但需注意如果网页不存在或者无法访问，函数会抛出异常，所以打开部分的函数需
复制链接

扫一扫

专栏目录