Python——网络数据采集（一）

最新推荐文章于 2019-07-26 11:02:00 发布

Frank_0415

最新推荐文章于 2019-07-26 11:02:00 发布

阅读量285

点赞数

分类专栏： Python网络数据采集

本文链接：https://blog.csdn.net/Frank_0415/article/details/84664583

版权

Python网络数据采集专栏收录该内容

3 篇文章 0 订阅

订阅专栏

使用原始的urllib库请求网站访问；
使用beautifulsoup库解析网站的元素；
考虑是否会有报错的情况，做优化。

from urllib.request import urlopen 
from bs4 import BeautifulSoup 
html = urlopen("http://www.baidu.com") 
bsObj = BeautifulSoup(html.read()) 
print(bsObj.div)

常见的错误提示一：

当网页在服务器上不存在（或者获取页面的时候出现错误）时，程序会返回 HTTP 错误。HTTP 错误可能是“404 Page Not Found”“500 Internal Server Error”等。

try:     
    html = urlopen("http://www.pythonscraping.com/pages/page1.html") 
except HTTPError as e:     
    print(e)     # 返回空值，中断程序，或者执行另一个方案 
else:     
    # 程序继续。注意：如果你已经在上面异常捕捉那一段代码里返回或中断（break），
    # 那么就不需要使用else语句了，这段代码也不会执行

常见的错误提示二：

当服务器不存在时，（就是说链接 http://www.pythonscraping.com/ 打不开，或者是 URL 链接写错了），urlopen 会返回一个 None 对象。这个对象与其他编程语言中的 null 类似。我们可以增加一个判断语句检测返回的 html 是不是 None。

if html is None:     
    print("URL is not found") 
else:     
    # 程序继续

于是，完整的网站页面请求为：

from urllib.request import urlopen 
from urllib.error import HTTPError 
from bs4 import BeautifulSoup     
    
def getTitle(url):         
    try:
        html = urlopen(url)         
    except HTTPError as e:             
        return None         
    try:             
        bsObj = BeautifulSoup(html.read())             
        title = bsObj.div#u1.text()
    except AttributeError as e:             
        return None         
    return title 
title = getTitle("http://www.baidu.com") 
if title == None:     
    print("Title could not be found") 
else:     
    print(title)

Frank_0415

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python——网络数据采集（一）

使用原始的urllib库请求网站访问；使用beautifulsoup库解析网站的元素；考虑是否会有报错的情况，做优化。from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://www.baidu.com") bsObj = BeautifulSoup(html.r...
复制链接

扫一扫