Python爬虫(1)

最新推荐文章于 2023-01-29 20:55:46 发布

Tbxsx

最新推荐文章于 2023-01-29 20:55:46 发布

阅读量559

点赞数

分类专栏： python爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/shengge01/article/details/72797150

版权

python爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

1. Urllib的使用.

官方文档
使用python3.5.注意python3中,urllib2已经被拆分到urllib中,分为urllib.request和urllib.error.
1.没有Request的初级用法.

    import urllib.request
    response = urllib.request.urlopen("http://www.baidu.com")
    print(response.read())

结果:
这里写图片描述
urlopen的原型:

    urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

第一个参数为网址,第二个为传递给url的数据.
第三个设置的是超时时间,timeout默认为 socket._GLOBAL_DEFAULT_TIMEOUT
2.介入Request的用法.
同样的例子,还是使用百度的首页:

    import urllib.request
    req = urllib.request.Request('http://www.baidu.com')
    res = urllib.request.urlopen(req)
    print(res.read())

这里写图片描述
函数原型:

urllib.request.Request(url, data=None, headers={},origin_req_host=None, unverifiable=False, method=None)

注意print(res.read())而不是print(res),否则会打印处:

>>> print(res)
<http.client.HTTPResponse object at 0x7fb5f2365898>

3.设置request的data.

import urllib.parse
import urllib.request
url = 'https://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn'
value = {"username":"shenge01","password":XXXXXX}
data = urllib.parse.urlencode(value)
data = data.encode('utf-8')
req = urllib.request.Request(url,data)
res = urllib.request.urlopen(req)
print(res.read())

4.设置代理:
不太明白,贴个链接用来参考吧,以后再来更改
python2:静觅
python3:代理
官方:示例
5.设置timeout
以秒为单位

import urllib.request
    response = urllib.request.urlopen("http://www.baidu.com",timeout=10)
    print(response.read())

6.ERROR:
1.URLError

解释下URLError可能产生的原因：
网络无连接，即本机无法上网
连接不到特定的服务器
服务器不存在

from urllib import error
from urllib import request
requset = request.Request('http://www.xxxxx.com')
try:
    request.urlopen(request)
except error.URLError as e:
    print(e.reason)

2.HTTPError

HTTPError是URLError的子类，在你利用urlopen方法发出一个请求时，服务器上都会对应一个应答对象response，其中它包含一个数字”状态码”。
状态码:
    100：继续  客户端应当继续发送请求。客户端应当继续发送请求的剩余部分，或者如果请求已经完成，忽略这个响应。

    101： 转换协议  在发送完这个响应最后的空行后，服务器将会切换到在Upgrade 消息头中定义的那些协议。只有在切换新的协议更有好处的时候才应该采取类似措施。

    102：继续处理   由WebDAV（RFC 2518）扩展的状态码，代表处理将被继续执行。

    200：请求成功      处理方式：获得响应的内容，进行处理

    201：请求完成，结果是创建了新资源。新创建资源的URI可在响应的实体中得到    处理方式：爬虫中不会遇到

    202：请求被接受，但处理尚未完成    处理方式：阻塞等待

    204：服务器端已经实现了请求，但是没有返回新的信 息。如果客户是用户代理，则无须为此更新自身的文档视图。    处理方式：丢弃

    300：该状态码不被HTTP/1.0的应用程序直接使用， 只是作为3XX类型回应的默认解释。存在多个可用的被请求资源。    处理方式：若程序中能够处理，则进行进一步处理，如果程序中不能处理，则丢弃
    301：请求到的资源都会分配一个永久的URL，这样就可以在将来通过该URL来访问此资源    处理方式：重定向到分配的URL

    302：请求到的资源在一个不同的URL处临时保存     处理方式：重定向到临时的URL

    304：请求的资源未更新     处理方式：丢弃

    400：非法请求     处理方式：丢弃

    401：未授权     处理方式：丢弃

    403：禁止     处理方式：丢弃

    404：没有找到     处理方式：丢弃

    500：服务器内部错误  服务器遇到了一个未曾预料的状况，导致了它无法完成对请求的处理。一般来说，这个问题都会在服务器端的源代码出现错误时出现。

    501：服务器无法识别  服务器不支持当前请求所需要的某个功能。当服务器无法识别请求的方法，并且无法支持其对任何资源的请求。

    502：错误网关  作为网关或者代理工作的服务器尝试执行请求时，从上游服务器接收到无效的响应。

    503：服务出错   由于临时的服务器维护或者过载，服务器当前无法处理请求。这个状况是临时的，并且将在一段时间以后恢复。

实例:


from urllib import request
from urllib import error

req = request.Request("http://blog.csdn.net/cqcre")
try:
    res = request.urlopen(req)
    print(res.read())
except error.HTTPError as e1:
    if hasattr(e1, "reason"):
        print("Reason:")
        print(e1.reason)

    if hasattr(e1, "code"):
        print("Code:")
        print(e1.code)

except error.URLError as e2:
    print(e2.reason)


Reason:
Forbidden
Code:
403

7.Cookie的使用
因为没有具体的使用,所以参考静觅

2.Requests

使用urllib实在是太痛苦了,首先发送请求
官方例子:

>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u'{"type":"User"...'
>>> r.json()
{u'private_gists': 419, u'total_private_repos': 77, ...}

一个词就是简洁.
1.发送请求:
在urllib中,不支持head,delete这几个方法,还需要单独配置,但是在Requests中:

>>> r = requests.get('https://api.github.com/events')
>>> r = requests.put('http://httpbin.org/put', data = {'key':'value'})
>>> r = requests.delete('http://httpbin.org/delete')
>>> r = requests.head('http://httpbin.org/get')
>>> r = requests.options('http://httpbin.org/get')

>>> payload = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.get('http://httpbin.org/get', params=payload)
>>> print(r.url)
http://httpbin.org/get?key2=value2&key1=value1

方法简单,可以使用get方法发送参数.
官网例子
2. 验证登录
Requests提供了方便的验证方式:
Basic Authentication

>>> from requests.auth import HTTPBasicAuth
>>> requests.get('https://api.github.com/user', auth=HTTPBasicAuth('user', 'pass'))
<Response [200]>

由于实在太常用,可以简化为:

>>> requests.get('https://api.github.com/user', auth=('user', 'pass'))
<Response [200]>

netrc Authentication

Digest Authentication

>>> from requests.auth import HTTPDigestAuth
>>> url = 'http://httpbin.org/digest-auth/auth/user/pass'
>>> requests.get(url, auth=HTTPDigestAuth('user', 'pass'))
<Response [200]>

OAuth 1 Authentication:

>>> import requests
>>> from requests_oauthlib import OAuth1

>>> url = 'https://api.twitter.com/1.1/account/verify_credentials.json'
>>> auth = OAuth1('YOUR_APP_KEY', 'YOUR_APP_SECRET',
...               'USER_OAUTH_TOKEN', 'USER_OAUTH_TOKEN_SECRET')

>>> requests.get(url, auth=auth)
<Response [200]>

OAuth 2 and OpenID Connect Authentication:
需要使用requests-oauthlib,因为没有使用过,就只提供他的官网了,感兴趣的自己可以去了解了解
3.Cookie
获得Cookie:

 >>> url = 'http://example.com/some/cookie/setting/url'
>>> r = requests.get(url)

>>> r.cookies['example_cookie_name']
'example_cookie_value'

提交Cookie:

>>> url = 'http://httpbin.org/cookies'
>>> cookies = dict(cookies_are='working')

>>> r = requests.get(url, cookies=cookies)
>>> r.text
'{"cookies": {"cookies_are": "working"}}'

r.cookies是一个RequestsCookieJar对象,RequestCookieJar可以做为一个字典使用.因此可以这样设置:

>>> jar = requests.cookies.RequestsCookieJar()
>>> jar.set('tasty_cookie', 'yum', domain='httpbin.org', path='/cookies')
>>> jar.set('gross_cookie', 'blech', domain='httpbin.org', path='/elsewhere')
>>> url = 'http://httpbin.org/cookies'
>>> r = requests.get(url, cookies=jar)
>>> r.text
'{"cookies": {"tasty_cookie": "yum"}}'

返回的Response可以有多种格式:
Response Content:

>>> import requests

>>> r = requests.get('https://api.github.com/events')
>>> r.text
u'[{"repository":{"open_issues":0,"url":"https://github.com/..

Binary Response Content:

>>> r.content
b'[{"repository":{"open_issues":0,"url":"https://github.com/...

Json:

>>> import requests

>>> r = requests.get('https://api.github.com/events')
>>> r.json()
[{u'repository': {u'open_issues': 0, u'url': 'https://github.com/...

Raw Response Content:

>>> r = requests.get('https://api.github.com/events', stream=True)

>>> r.raw
<requests.packages.urllib3.response.HTTPResponse object at 0x101194810>

>>> r.raw.read(10)
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

获得原始字符
4.session对象
在前面的访问中,我们直接使用request.get/post等方法,这相当于每一次请求都是一个新的请求,即相当于这一次的访问和上一次没有任何关系,相当于重新打开一次浏览器.如:

import requests

requests.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = requests.get("http://httpbin.org/cookies")
print(r.text)

{
  "cookies": {}
}

cookie为空,第二次请求时,第一册请求的cookie已经销毁了.


s = requests.Session()

s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get('http://httpbin.org/cookies')

print(r.text)
# '{"cookies": {"sessioncookie": "123456789"}}'

使用session的时候,能够获得cookies.
5.上传文件
如果HTML是:

">```
我们可以看到可以上传多个文件,文件名称为names:
因此可以用这种方式上传文件:

 >>> url = 'http://httpbin.org/post'
>>> multiple_files = [
    ('images', ('foo.png', open('foo.png', 'rb'), 'image/png')),
    ('images', ('bar.png', open('bar.png', 'rb'), 'image/png'))]
>>> r = requests.post(url, files=multiple_files)
>>> r.text
{
  ...
  'files': {'images': 'data:image/png;base64,iVBORw ....'}
  'Content-Type': 'multipart/form-data; boundary=3131623adb2043caaeb5538cc7aa0b3a',
  ...
 }
 post提交一个文件可以直接使用post的data参数: 
 
 with open('massive-body', 'rb') as f:
    requests.post('http://some.url/streamed', data=f)

with open('massive-body', 'rb') as f:
    requests.post('http://some.url/streamed', data=f)

7.设置代理
直接使用Requests的proxies参数即可

proxies = {
  "https": "http://41.118.132.69:4433"
}
r = requests.post("http://httpbin.org/post", proxies=proxies)
print r.text

3.BeautilfulSoup

1.安装与了解BeautifulSoup
BeautifulSoup现如今移植到bs4项目了,所以使用的时候正确姿势:

from bs4 import BeautifulSoup

安装:
1. 对于*nux系统可以直接使用apt安装:

$ apt-get install python-bs4 (for Python 2)
$ apt-get install python3-bs4 (for Python 3)

2.如果你不是使用*nux的,可以使用pip或者easy_install安装

$ easy_install beautifulsoup4
$ pip install beautifulsoup4

3.还有一种方式
下载后到下载目录直接安装:

python setup.py install

BeautifulSoup是什么?
官方介绍:

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.
它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.
先来感受一下BeautifulSoup能做什么:

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""


from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...

安装解析器:
Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml:

$ apt-get install Python-lxml
$ easy_install lxml
$ pip install lxml

另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib:

$ apt-get install Python-html5lib
$ easy_install html5lib
$ pip install html5lib

更多解析器对比
可以注意到:
BeautifulSoup会补全,修正一下文档树
BeautifulSoup的几种对象:
1.tag对象:

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

这里,每一个title,a,p都是一个tag.对于每一个tag,主要有两个属性:

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>

1.2Name
这是标签的名字
    >>> tag.name
    # u'b'
    >>>tag.name = "blockquote"
    >>>tag
    # <blockquote class="boldest">Extremely bold</blockquote>
可以看到soup结构变化了
1.1 Attributes
    一个tag可能有很多个属性. tag <b class="boldest"> 有一个 “class” 的属性,值为 “boldest” . tag的属性的操作方法与字典相同:
    >>>tag['class']
    # u'boldest'
    >>> tag['class'] = 'verybold'
    >>> tag['id'] = 1
    >>> tag
    # <blockquote class="verybold" id="1">Extremely bold</blockquote>

    >>> del tag['class']
    >>> del tag['id']
    >>> tag
    # <blockquote>Extremely bold</blockquote>

    >>>tag['class']
    # KeyError: 'class'
    >>> print(tag.get('class'))
    # None
    多值属性,那么返回的将是一个list
    >>> css_soup = BeautifulSoup('<p class="body strikeout"></p>')
    >>> css_soup.p['class']
    # ["body", "strikeout"]
    >>> css_soup = BeautifulSoup('<p class="body"></p>')
    >>> css_soup.p['class']
    # ["body"]

2.BeautifulSoup对象,这个对象可以看做是整个文档的一个tag

    >>> soup.name
    # u'[document]'

3.NavigableString
可以遍历的字符串,可以通过soup.string得到

    >>> print(soup.p.string)
    The Dormouse's story

4.Comment及其他类型

>>> markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
>>> soup = BeautifulSoup(markup)
>>> comment = soup.b.string
>>> print(comment)
# Hey, buddy. Want to buy a used parser?
>>> type(comment)
# <class 'bs4.element.Comment'

可以看到soup.b.string还可以返回Comment类型.
Comment 对象是一个特殊类型的 NavigableString 对象，其实输出的内容仍然不包括注释符号。

3.遍历文档树
(1) 访问子节点:
方法:

.contents和.children 直接子节点,前者是列表形式,后者是可遍历的非list

.descendants 所有子孙节点
(2)节点内容
.string 单个节点内容,返回可能有Comment和NavigableString类型等.如果tag包含了多个子节点,tag就无法确定，string 方法应该调用哪个子节点的内容, .string 的输出结果是 None.
.strings 和 .stripped_strings 多个内容,后者是前者去掉空格空行之后的结果
(3) 父节点
.parent 直接父节点
.parents 所有父节点
(4)兄弟节点
.next_sibling .previous_slibling 前一个兄弟节点和后一个兄弟节点
.next_siblings .previous_siblings 前面所有兄弟节点,后面所有兄弟节点
(5) 前后结点
.next_element .previous_element 前一个节点,可能是兄弟节点,也可能是父节点,后者同理
比如:

>>> ma = "<head><title>The Dormouse's story</title></head>"
>>> soup = BeautifulSoup(ma,"lxml")
>>> print(soup.head.next_element)
    #<title>The Dormouse's story</title>