python3 中爬虫学习之urllib

最新推荐文章于 2024-08-17 14:15:26 发布

yangxiaodong88

最新推荐文章于 2024-08-17 14:15:26 发布

阅读量262

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/yangxiaodong88/article/details/80729603

版权

爬虫专栏收录该内容

7 篇文章 0 订阅

订阅专栏

说说urllib 忘记urllib2 吧

python 3 中的urllib 和python2 中不一样， Python3 中urllib 是 Python2 中urllib 和urllib2 的合并

Python3 和Python 2 中的urllib 对比

在Pytho2.x中使用import urllib2——-对应的，在Python3.x中会使用import urllib.request，urllib.error。
在Pytho2.x中使用import urllib——-对应的，在Python3.x中会使用import urllib.request，urllib.error，urllib.parse。
在Pytho2.x中使用import urlparse——-对应的，在Python3.x中会使用import urllib.parse。
在Pytho2.x中使用import urlopen——-对应的，在Python3.x中会使用import urllib.request.urlopen。
在Pytho2.x中使用import urlencode——-对应的，在Python3.x中会使用import urllib.parse.urlencode。
在Pytho2.x中使用import urllib.quote——-对应的，在Python3.x中会使用import urllib.request.quote。
在Pytho2.x中使用cookielib.CookieJar——-对应的，在Python3.x中会使用http.CookieJar。
在Pytho2.x中使用urllib2.Request——-对应的，在Python3.x中会使用urllib.request.Request。

以下都是介绍Python3 中的urllib

官方文档是学习的最好资料
https://docs.python.org/3/library/urllib.request.html#module-urllib.request

获取头信息和访问的内容

import urllib.request
import random

# user-agent 列表 伪装使用
import urllib.request
import random

# user-agent 列表 伪装使用
ua_agent = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 6.1; rv2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
    "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"
]

# 在user-agent 中随机选择一个
user_agent = random.choice(ua_agent)
header = {'User-Agent': user_agent}
url = "http://www.baidu.com/"
# 构造一个请求
request = urllib.request.Request(url, headers=header)
# request.add_header("User-Agent", user_agent)
print(request.get_header("User-agent"))
response = urllib.request.urlopen(request)
text = response.read().decode("utf8")
print(text)
print("=" * 80)
print(response.getcode())  # 返回状态200
print("=" * 80)
print(response.geturl())  # 
print("=" * 80)
print(response.info())
print("=" * 80)

urllib 包含的包
urllib,request 发送http请求

urllib.error 处理请求过程中的出现的异常
urllib.parse解析url
urllib.robotparser 解析robots.txt 文件

查询文档

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)



class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

这两种方式配合使用比较好，虽然urlopen 里面可以直接添加url 等参数但是不能添加头信息等。使用Request 类作为urlopen的参数传递，比较好，这样也可以动态的添加头信息啦。

 request = urllib.request.Request(url, headers=header)
 request.add_header("User-Agent", user_agent)

response

urllib.request.urlopen() 返回对象有这样几种方法获得基本的返回信息
- geturl()
- info() 服务器返回的头信息
- getcode() 状态码
- read() 返回文本数据

当有get 方法传递参数时候需要警醒编码，字段形式转化为传递参数需要的形式

wd = {'spam': 1, 'eggs': 2, 'bacon': 0}
wd = urllib.parse.urlencode(wd)
print(wd) # spam=1&eggs=2&bacon=0

使用urllib 爬取百度贴吧中的页面（还没有对页面数据进行处理）

import urllib.request


def loadPage(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"}
    request = urllib.request.Request(url, headers=headers)
    return urllib.request.urlopen(request).read().decode("UTF-8")


def writePage(html, filename):
    with open(filename, 'w', encoding="utf8") as f:
        f.write(html)


def sprider(url, beginPage, endPage):
    for page in range(beginPage, endPage + 1):
        pn = (page - 1) * 50
        filename = "第" + str(page) + "页"
        url = url + "&pn=" + str(pn)
        html = loadPage(url)
        writePage(html, filename)
        print("结束")


if __name__ == '__main__':
    kw = "权力的游戏"
    beginPage = 1
    endPage = 10
    url = "http://tieba.baidu.com/f?"
    data = urllib.parse.urlencode({"kw": kw})
    url = url + data
    sprider(url, beginPage, endPage)