爬虫第八期 #task 1

最新推荐文章于 2022-03-16 18:16:59 发布

weixin_44593278

最新推荐文章于 2022-03-16 18:16:59 发布

阅读量266

点赞数

分类专栏： datawhale

本文链接：https://blog.csdn.net/weixin_44593278/article/details/98613421

版权

datawhale 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

爬虫基础

get和post请求

get 和 post 请求是http 协议中本地计算机与服务器的交互方式，在爬虫中我们需要用脚本模拟本机计算机向服务器发出请求并解析发回的html文件。get和post有使用上的区别。

区别

get多用于搜索、排序，目的是获取数据。post可以用于修改、写入数据。
post更安全，提交数据位于实体区中（get读取数据后参数会显示于url中，保存在浏览器记录中）
get能够缓存数据而post不能
post能够有请求头（许多浏览器在响应请求时会查看请求头，以确保是由浏览器发出的）：post在真正接收数据之前会先将请求头发送给服务器进行确认，然后才真正发送数据
post获取的数据量能比get大（get提交数据有限制：1024字节）

两者请求过程

get

（1）浏览器请求tcp连接
（2）服务器答应进行tcp连接
（3）浏览器确认，并发送get请求头和数据
（4）服务器返回200 OK响应

post

（1）浏览器请求tcp连接
（2）服务器答应进行tcp连接
（3）浏览器确认，并发送post请求头
（4）服务器返回100 Continue响应
（5）浏览器发送数据
（6）服务器返回200 OK响应

常见http状态码

100 ——接收指令成功，等待继续操作
200 ——请求成功
301 ——资源被永久转移到另一url（重新定向）
404 ——资源不存在（本地错误，包含语法错误）
500 ——内部服务器错误（服务器错误）

利用urllib对百度进行一次请求

get请求

from urllib import request

# GET 请求
url = 'https://www.baidu.com/'
test = request.urlopen(url)
Html = test.read()
print (Html)

返回结果为

b'<html>\r\n<head>\r\n\t<script>\r\n\t\tlocation.replace(location.href.replace("https://","http://"));\r\n\t</script>\r\n</head>\r\n<body>\r\n\t<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>\r\n</body>\r\n</html>'

由于对此对象read（）读出的数据是二进制格式，需要对其decode（）一下转换为字符串。则结果为：

<html>
<head>
	<script>
		location.replace(location.href.replace("https://","http://"));
	</script>
</head>
<body>
	<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>

若断网，则报错为：

urllib.error.URLError: <urlopen error [Errno 11001] getaddrinfo failed>

请求头

urlopen（）默认为get请求，若在urlopen（）中传入第二个参数data时，就会发送post请求。传递的data必须是byte格式。第三个参数timeout可以设置超时时间，若超出时间则抛出异常。

添加header

通过urllib发起的请求会有默认的一个Headers：“User-Agent”:“Python-urllib/3.6”，指明请求是由urllib发送的。
遇到一些需要验证User-Agent的网站，我们就需要自定义一个请求头。

from urllib import request

# post 请求 添加请求头

url = 'https://www.baidu.com/'
header = {'User-Agent':'python:chrome'}

test = request.Request(url,headers=header)
resp = request.urlopen(test)
Html = resp.read().decode()
print (Html)

对有些header要特别留意，服务器会针对这些header做检查，例如：

User-Agent：有些服务器或Proxy会通过该值来判断是否是浏览器发出的请求。
Content-Type：在使用REST接口时，服务器会检查该值，用来确定HTTP Body中的内容该怎样解析。
Referer：服务器有时候会检查防盗链。

爬虫任务：豆瓣电影top250

正则表达式

re模块熟知，由于之前学过，故不做笔记，直接上代码：

import re
import requests
import csv
#[\u4e00-\u9fa5] 表示匹配中文字符

#url = 'https://movie.douban.com/top250'

def movie_Top250(url):
    header = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/545.31'}

    resp = requests.get(url,headers=header)
    ranks = re.findall(r'<em class="">(\d*)</em>',resp.text,re.S)
    names = re.findall(r'<span class="title">([\u4e00-\u9fa5]+/?)</span>',resp.text,re.S)
    countries = re.findall(r'/&nbsp;([\u4e00-\u9fa5]+/?)&nbsp;/&nbsp',resp.text,re.S)
    text = re.sub('导演：',"",resp.text)
    directors = re.findall(r'<p class="">(.*?)&nbsp;&nbsp;&nbsp',text,re.S)
    scores = re.findall(r'<span class="rating_num" property="v:average">(.*?)</span>',resp.text,re.S)

    for rank,name,country,director,score in zip(ranks,names,countries,directors,scores):
        writer.writerow([rank,name,country,director,score])

if __name__ =='__main__' :

    file = open('D:/data learn/practice/untitled/movie.csv','w+',encoding='utf-8',newline='')
    writer = csv.writer(file)
    writer.writerow(['rank', 'name', 'country', 'director', 'score'])

    for i in range(0,275,25):
        url = 'https://movie.douban.com/top250?start={}&filter='.format(i)
        movie_Top250(url)

file.close()

结果中出现问题：