python数据采集1-初见爬虫-CSDN博客

初见网络爬虫

网络连接

注解

当我们在访问百度(www.baidu.com/)，在你敲入网址并按下回车之后，将会发生以下神奇的事情：

现在本地的hosts文件中找url对应的ip，找不到旧区DNS服务器中找

在DNS协议中，PC会向你的本地DNS服务器求助（一般是路由器），希望从本地DNS服务器那里得到百度的IP，得到就好，得不到还得向更高层次的DNS服务器求助，最终总能得到百度的IP。

根据ip找到服务器，建立TCP连接

在TCP协议中，建立TCP需要与百度服务器握手三次，你先告诉服务器你要给服务器发东西（SYN），服务器应答你并告诉你它也要给你发东西（SYN、ACK），然后你应答服务器（ACK），总共来回了3次，称为3次握手。

将url后面的一坨请求发送给服务器
服务器根据收到的请求，将对应的资源发送给客户端

让我们看看 Python 是如何实现的

# -*- coding: utf-8 -*-
"""
Created on Sun Jan 21 18:47:08 2018

@autho
"""

from urllib.request import urlopen
html = urlopen("http://www.baidu.com")
print(html.read())
复制代码

返回的结果如下

b'<!DOCTYPE html>\n<!--STATUS OK-->\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\t\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\t\r\n        \r\n\t\t\t        \r\n\t\r\n\t\t\t        \r\n\t\r\n\t\t\t        \r\n\t\r\n\t\t\t        \r\n\t\t\t    \r\n\r\n\t\r\n        \r\n\t\t\t        \r\n\t\r\n\t\t\t        \r\n\t\r\n\t\t\t        \r\n\t\r\n\t\t\t        \r\n\t\t\t    \r\n\r\n\r\n\r\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\r\n\n<html>\n<head>\n    \n    <meta http-equiv="content-type" content="text/html;charset=utf-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=Edge">\n\t<meta content="always" name="referrer">\n    <meta name="theme-color" content="#2932e1">\n    <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />\n    <link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" title="\xe7\x99\xbe\xe5\xba\xa6\xe6\x90\x9c\xe7\xb4\xa2" />\n    <link rel="icon" sizes="any" mask href="//www.baidu.com/img/baidu_85beaf5496f291521eb75ba38eacbd87.svg">\n\t\n\t\n\t<link rel="dns-prefetch" href="//s1.bdstatic.com"/>\n\t<link rel="dns-prefetch" href="//t1.baidu.com"/>\n\t<link rel="dns-prefetch" href="//t2.baidu.com"/>\n\t<link rel="dns-prefetch" href="//t3.baidu.com"/>\n\t<link rel="dns-prefetch" href="//t10.baidu.com"/>\n\t<link rel="dns-prefetch" href="//t11.baidu.com"/>\n\t<link rel="dns-prefetch" href="//t12.baidu.com"/>\n\t<link rel="dns-prefetch" href="//b1.bdstatic.com"/>\n    \n    <title>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93</title>\n    \r\n\r\n<style id="css_index" index="index" type="text/css">html,body{height:100%}\nhtml{overflow-y:auto}\nbody{font:12px arial;text-align:;background:#fff}\nbody,p,form,ul,li{margin:0;padding:0;list-style:none}\nbody,form,#fm{position:relative}\ntd{text-align:left}\nimg{border:0}\na{color:#00c}\na:active{color:#f60}\ninput{border:0;padding:0}\n#wrapper{position:relative;_position:;min-height:100%}\n#head{padding-bottom:100px;text-align:center;*z-index:1}\n#ftCon{height:50px;position:absolute;bottom:47px;text-align:left;width:100%;margin:0 auto;z-index:0;overflow:hidden}\n.ftCon-Wrapper{overflow:hidden;margin:0 auto;text-align:center;*width:640px}\n.qrcodeCon{text-align:center;position:absolute;bottom:140px;height:60px;width:100%}\n#qrcode{display:inline-block;*float:left;*margin-top:4px}\n#qrcode .qrcode-item{float:left}\n#qrcode .qrcode-item-2{margin-left:33px}\n#qrcode .qrcode-img{width:60px;height:60px}\n#qrcode .qrcode-item-1 .qrcode-img{background:url(http://s1.bdstatic.com/r/www/cache/static/home/img/qrcode/zbios_efde696.png) 0 0 no-repeat}\n#qrcode .qrcode-item-2 .qrcode-img{background:url(http://s1.bdstatic.com/r/www/cache/static/home/img/qrcode/nuomi_365eabd.png) 0 0 no-repeat}\n@media only screen and (-webkit-min-device-pixel-ratio:2){#qrcode .qrcode-item-1 .qrcode-img{background-image:url(http://s1.bdstatic.com/r/www/cache/static/home/img/qrcode/zbios_x2_9d645d9.png);background-size:60px 60px}\n#qrcode .qrcode-item-2 .qrcode-img{background-image:url(http://s1.bdstatic.com/r/www/cache/static/home/img/qrcode/nuomi_x2_55dc5b7.png);background-size:60px 60px}}\n#qrcode .qrcode-text{color:#999;line-height:23px;margin:3px 0 0 5px}\n#qrcode .qrcode-text a{color:#999;text-decoration:none}\n#qrcode .qrcode-text p{text-align:left}\n#qrcode .qrcode-text b{color:#666;font-weight:700}\n#qrcode .qrcode-text span{letter-spacing:1px}\n#ftConw{display:inline-block;text-align:left;margin-left:33px;line-height:22px;position:relative;top:-2px;*float:right;*margin-left:0;*position:static}\n#ftConw,#ftConw a{color:#999}\n#ftConw{text-align:center;margin-left:0}\n.bg{background-image:url(http://s1.bdstatic.com/r/www/cache/static/global/img/icons_5859e57.png);background-repeat:no-repeat;_background-image:url(http://s1.bdstatic.com/r/www/cache/static/global/img/icons_d5b04cc.gif)}\n.c-icon{display:inline-block;width:14px;height
复制代码

由于返回信息过多,部分展示

这将会输出 www.baidu.com/ 这个网页的全部 HTML 代码。

from urllib.request import urlopen
复制代码

它查找 Python 的 request 模块（在 urllib 库里面），只导入一个 urlopen 函数。

urlopen 用来打开并读取一个从网络获取的远程对象。因为它是一个非常通用的库（它可以轻松读取 HTML 文件、图像文件，或其他任何文件流），所以我们将在本书中频繁地使用它。

BeautifulSoup简介

BeautifulSoup 库的名字取自刘易斯 ·卡罗尔在《爱丽丝梦游仙境》里的同名诗歌

BeautifulSoup 尝试化平淡为神奇。它通过定位 HTML 标签来格式化和组织复杂的网络信息，用简单易用的 Python 对象为我们展现 XML 结构信息。

安装

Linux

$sudo apt-get install python-bs4

Mac

$sudo easy_install pip

$pip install beautifulsoup4

运行


from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.baidu.com")
bsObj = BeautifulSoup(html.read())
print(bsObj.img)

复制代码

返回结果如下

<img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" usemap="#mp" width="270"/>
复制代码

可以看出，我们从网页中提取的<img>标签被嵌在 BeautifulSoup 对象 bsObj 结构的第二层（html → body → img）。但是，当我们从对象里提取 img 标签的时候，可以直接调用它： bsObj.h1

bsObj.html.body.img
bsObj.body.img
bsObj.html.img

复制代码

也可以达到同样的效果

异常处理

网络是十分复杂的。网页数据格式不友好，网站服务器宕机，目标数据的标签找不到，都是很麻烦的事情。网络数据采集最痛苦的遭遇之一，就是爬虫运行的时候你洗洗睡了，梦想着明天一早数据就都会采集好放在数据库里，结果第二天醒来，你看到的却是一个因某种数据格式异常导致运行错误的爬虫，在前一天当你不再盯着屏幕去睡觉之后，没过一会儿爬虫就不再运行了。那个时候，你可能想骂发明互联网（以及那些奇葩的网络数据格式）的人，但是你真正应该斥责的人是你自己，为什么一开始不估计可能会出现的异常！

html = urlopen("http://www.baidu.com")
复制代码

这行代码主要可能会发生两种异常：

网页在服务器上不存在（或者获取页面的时候出现错误）
服务器不存在

第一种异常发生时，程序会返回 HTTP 错误。HTTP 错误可能是“404 Page Not Found”“500 Internal Server Error”等。所有类似情形， urlopen 函数都会抛出“HTTPError”异常。我们可以用下面的方式处理这种异常：

try:
html = urlopen("http://www.baidu.com")
except HTTPError as e:
print(e)
# 返回空值，中断程序，或者执行另一个方案
else:
# 程序继续。注意：如果你已经在上面异常捕捉那一段代码里返回或中断（break），
# 那么就不需要使用else语句了，这段代码也不会执行
复制代码

如果程序返回 HTTP 错误代码，程序就会显示错误内容，不再执行 else 语句后面的代码。

if html is None:
print("URL is not found")
else:
# 程序继续
复制代码

如果你想要调用的标签不存在，BeautifulSoup 就会返初见网络爬虫｜ 9 回 None 对象。不过，如果再调用这个 None 对象下面的子标签，就会发生 AttributeError错误

下面这行代码（ nonExistentTag 是虚拟的标签，BeautifulSoup 对象里实际没有）

print(bsObj.nonExistentTag)

复制代码

会返回一个 None 对象。处理和检查这个对象是十分必要的。如果你不检查，直接调用这个 None 对象的子标签，麻烦就来了。如下所示。

print(bsObj.nonExistentTag.someTag)
复制代码

这时就会返回一个异常：


AttributeError: 'NoneType' object has no attribute 'someTag'
复制代码

try:
    badContent = bsObj.nonExistingTag.anotherTag
except AttributeError as e:
    print("Tag was not found")
else:
    if badContent == None:
        print ("Tag was not found")
    else:
        print(badContent)
复制代码

初看这些检查与错误处理的代码会觉得有点儿累赘，但是，我们可以重新简单组织一下代码，让它变得不那么难写（更重要的是，不那么难读）。例如，下面的代码是上面爬虫的另一种写法：


from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
import sys


def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
        return None
    try:
        bsObj = BeautifulSoup(html.read())
        title = bsObj.body.img
    except AttributeError as e:
        return None
    return title

title = getTitle("http://www.baidu.com")
if title == None:
    print("Title could not be found")
else:
    print(title)
    
    
复制代码

在这个例子中，我们创建了一个 getTitle 函数，可以返回网页的标题，如果获取网页的时候遇到问题就返回一个 None 对象。在 getTitle 函数里面，我们像前面那样检查了 HTTPError ，然后把两行 BeautifulSoup 代码封装在一个 try 语句里面。这两行中的任何一行有问题， AttributeError 都可能被抛出（如果服务器不存在， html 就是一个 None 对象， html.read() 就会抛出 AttributeError ）。其实，我们可以在 try 语句里面放任意多行代码，或者放一个在任意位置都可以抛出 AttributeError 的函数。

import warnings
warnings.filterwarnings("ignore")
复制代码