爬虫系列番外篇（三）：从零开始写脚本爬虫

最新推荐文章于 2025-01-31 10:00:00 发布

taczeng

最新推荐文章于 2025-01-31 10:00:00 发布

阅读量450

点赞数

分类专栏：爬虫小白从入门到精通

本文链接：https://blog.csdn.net/taczeng/article/details/102914879

版权

爬虫小白从入门到精通专栏收录该内容

25 篇文章

订阅专栏

时长：2h

一.urllib库：

urllib.request用于访问和读取URLS（urllib.request for opening and reading URLs），就像在浏览器里输入网址然后回车一样，只需要给这个库方法传入URL和其他参数就可以模拟实现这个过程。
urllib.error包括了所有urllib.request导致的异常（urllib.error containing the exceptions raised by urllib.request），我们可以捕捉这些异常，然后进行重试或者其他操作以确保程序不会意外终止。
urllib.parse用于解析URLS（urllib.parse for parsing URLs），提供了很多URL处理方法，比如拆分、解析、合并、编码。
urllib.robotparser用于解析robots.txt文件（urllib.robotparser for parsing robots.txt files），然后判断哪些网站可以爬，哪些网站不可以爬。

1.基本的方法打开网页

适用于get请求

response = urllib.request.urlopen('https://blog.csdn.net/taczeng/article/category/8612467')

2.携带data参数

适用于post请求

def urllib_demo_post():
    """
    带data的参数是用来请求post的
    常见的请求是GET和POST
    :return:
    """
    data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
    print(data)
    response = urllib.request.urlopen('http://httpbin.org/post', data=data)
    print(response.read())
    # my_data = {"pv_referer": "https://search.jd.com/Search?keyword=%25E6%25B2%2599%25E5%258F%2591&enc=utf-8&wq=%25E6%25B2%2599%25E5%258F%2591&pvid=772b87bf47924bc89ed691dd510db543"}
    # my_data = bytes(urllib.parse.urlencode(my_data))
    # response = urllib.request.urlopen('https://blog.csdn.net/taczeng/article/category/8612467', data=my_data)
    # text = response.read().decode('utf-8')
    pass

3.通过构建Request打开网页

from urllib.request import Request
from urllib.request import urlopen
from urllib.parse import urlencode

url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/4.0(compatibe;MSIE 5.5;Windows NT)',
    'Host': 'httpbin.org'
}
dict = {'name': 'Germey'}
data = bytes(urlencode(dict), encoding='utf8')
req = Request(url=url, data=data, headers=headers, method='POST')
response = urlopen(req)
print(response.read().decode('utf-8'))

二.requests库：

requests库的八个主要方法

方法	描述
requests.request()	构造一个请求，支持以下各种方法
requests.get()	向html网页提交get请求的方法
requests.post()	向html网页提交post请求的方法
requests.head()	获取html头部信息的主要方法
requests.put()	向html网页提交put请求的方法
requests.options()	向html网页提交options请求的方法
requests.patch()	向html网页提交局部修改的请求
requests.delete()	向html网页提交删除的请求

请求之后，服务器通过response返回数据，response具体参数如下图：

属性	描述
r.status_code	http请求的返回状态，若为200则表示请求成功
r.text	http响应内容的字符串形式，即返回的页面内容
r.encoding	从http header 中猜测的相应内容编码方式
r.apparent_encoding	从内容中分析出的响应内容编码方式（备选编码方式）
r.content	http响应内容的二进制形式

def url_requests():
    """
    使用requests库进行数据抓取
    豆瓣站点为例：js加载的页面，使用网络链接是拿不到数据的
    response的content（二进制文件的时候需要用到），status_code，text
    post请求案例：http://example.com
    :return:
    """
    # response = requests.get(
    #     'https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&sort=recommend&page_limit=20&page_start=0')
    # requests.post('', {})
    url = "http://example.com"
    data = {
        'a': 1,
        'b': 2,
    }
    # 1
    # response = requests.post(url, data=json.dumps(data))
    # 2-json参数会自动将字典类型的对象转换为json格式
    response = requests.post(url, json=data)
    pass