python—网络数据采集------初见网络爬虫

最新推荐文章于 2022-08-20 23:39:18 发布

Deep,dark,fantasy

最新推荐文章于 2022-08-20 23:39:18 发布

阅读量161

点赞数 1

分类专栏： python 数据采集

本文链接：https://blog.csdn.net/qq_43709590/article/details/86513275

版权

python 数据采集专栏收录该内容

2 篇文章 0 订阅

订阅专栏

一、一个最简单的爬虫

from urllib.request import urlopen
html = urlopen("http://baidu.com/pages/page1.html")
print(html.read())

执行这段代码后，将会输出http://baidu.com/pages/page1.html这个网页的全部HTML代码,urlopen用来打开并读取一个从网络获取的远程对象。

二、BeautifulSoup
BeautifilSoup的名字来自于《爱丽丝梦游仙境》里面的同名诗歌，这首诗的大意是化平淡为神奇，BeautifulSoup通过定位HTML标签来格式化和组织复杂的网络信息，用简单易用的Python对象为我们展现XML结构信息。

三、安装BeautifulSoup
打开pycharm,新建一个python文件，点击File，点击settings，点击project，点击projectInterpreter，点击pip，在查询栏中输入BeautifulSoup。

四、运行BeautifulSoup

from urllib.request import urlopen
from bs4 import BeautifulSoup
html=urlopen("http://www.baidu.com/pages/page1.html")
bsObj=BeautifulSoup(html.read())
print(bsObj)

这段代码与前面的代码功能相同，都是返回这个网页的HTML代码，在使用BeautifulSoup后返回的结果的格式变了，结果更加明显

五、返回异常

网络是复杂的，每个网站的格式也有所不同，网络数据采集会发生各种各样的情况，为了避免做无用功，我们需要在程序中添加返回异常的代码，以便及时观察到代码运行状况。
一般会返回三种异常：1.网页不存在，或获取页面时发生错误
2.服务器不存在
3.服务器成功获取，但程序的需求标签错误。
第一种异常发生时，程序会返回HTTP错误
可以用以下代码处理：

try :
html=urlopen("http://www.baidu.com/pages/page1.html")
except HttpError as e:
       print(e)  #返回空值或执行另一解决方案
else:  #程序继续执行

第二种异常发生时，通常网页会提示：链接打不开/URL错误
可以用以下代码处理：

if html is None:
   print ("URL is not found")
else:  #程序继续执行

第三种异常发生时可在程序中添加以下代码：

try:
    badContent=bsObj.nonExistingTag.anotherTag
except AttributeError as e:
    print ("Tag was not found ")
else:
    if badContent == None:
       print("Tag was not found")
   else:
       print(badContent)

六、最终版代码

from urllib.request import urlopen
from urllib.error import HTTPError,URLError
from bs4 import BeautifulSoup
def getDaima(url):
      try:
          html=urlopen(url)
      except(HTTPError,URLError) as e:
          return None
      try:
         bsObj=BeautifulSoup(html.read())
         daima=bsObj
      except AttributeError as e:
         return None
      return daima
daima=getDaima("https://www.w3.org/1999/xhtml")
if daima ==None:
    print("daima could not be found")
else:
    print(daima)

注："https://www.w3.org/1999/xhtml"是百度的地址。

Deep,dark,fantasy

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python—网络数据采集------初见网络爬虫

一、一个最简单的爬虫from urllib.request import urlopenhtml = urlopen(&quot;http://baidu.com/pages/page1.html&quot;)print(html.read())执行这段代码后，将会输出http://baidu.com/pages/page1.html这个网页的全部HTML代码,urlopen用来打开并读取一个从网络获取的远...
复制链接

扫一扫