Python-爬虫（一）--urllib库使用，爬取拉勾网数据

最新推荐文章于 2024-06-07 13:51:38 发布

姑苏冷

最新推荐文章于 2024-06-07 13:51:38 发布

阅读量2.5k

点赞数

分类专栏： Python 爬虫文章标签： python

本文链接：https://blog.csdn.net/A7_A8_A9/article/details/107299437

版权

Python 同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

爬虫

7 篇文章 0 订阅

订阅专栏

一：简介

urllib库是python中最基本的网络请求库，可以模拟浏览器的行为，向指定服务器发送一个请求，并保存服务器返回的数据。

二：urlopen函数

在pyhon3的urllib库中，所有的网络请求相关的方法，都被集中到urllib.request模块下。

函数原型如下：urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)　

url: 需要打开的网址

data：Post提交的数据

timeout：设置网站的访问超时时间

直接用urllib.request模块的urlopen（）获取页面，page的数据格式为bytes类型，需要decode（）解码，转换成str类型。

1. url 参数：目标资源在网路中的位置。可以是一个表示URL的字符串（如：http://www.pythontab.com/）；也可以是一个urllib.request对象，详细介绍请跳转

2. data参数：data用来指明发往服务器请求中的额外的参数信息（如：在线翻译，在线答题等提交的内容），data默认是None，此时以GET方式发送请求；当用户给出data参数的时候，改为POST方式发送请求。

3. timeout：设置网站的访问超时时间

4. cafile、capath、cadefault 参数：用于实现可信任的CA证书的HTTP请求。（基本上很少用）

5. context参数：实现SSL加密传输。（基本上很少用）

三. 返回处理方法详解
urlopen返回对象提供方法：

read() , readline() ,readlines() , close() ：对HTTPResponse类型数据进行操作

info()：返回HTTPMessage对象，表示远程服务器返回的头信息

getcode()：返回Http状态码。如果是http请求，200请求成功完成;404网址未找到

geturl()：返回请求的url

四. 版本区别，注意事项
python2和python3在导入urlrequest的方式都不一样。

python2是这样：import urllib2

而python3里面把urllib分开了，分成了urlrequest和urlerror，在这里我们只需导入urlrequest即可。from urllib.request import urlopen

三：urlretrieve函数

这个函数可以方便的将网页上的文件保存到本地。

req.urlretrieve("https://bkimg.cdn.bcebos.com/pic/cdbf6c81800a19d85413042d3cfa828ba61e4682?x-bce-process=image/watermark,g_7,image_d2F0ZXIvYmFpa2UxMTY=,xp_5,yp_5"
                ,"song.jpg")

第一个参数是图片的地址，第二参数是文件的名称，如果参数中带文件路径则会下载具体的目录下，如果没有就下载到py文件同目录下。

req.urlretrieve("https://bkimg.cdn.bcebos.com/pic/cdbf6c81800a19d85413042d3cfa828ba61e4682?x-bce-process=image/watermark,g_7,image_d2F0ZXIvYmFpa2UxMTY=,xp_5,yp_5"
                ,"D:\\1340109116\\workspacet\\songHui.jpg")

四：urllib的参数编码和解码函数urlencode，parse_qs

用浏览器发送请求的时候，如果url中包含了中文或者其它特殊字符，那么浏览器会自动给我们进行编码，(用%加十六进制数字%)。如果我们用代码请求，就需要特殊处理下进行参数编码。这个时候就需要使用urlencode函数。

注意这个函数是在urllib.parse下的。

parm={"name":"福尔摩斯","age":14,"tianshi":"hello"}
res= parse.urlencode(parm)
print(res)

具体用法：

如果我们在百度首页输入汉字:刘德华我们只留下关键的url信息如下，

如果我们在代码中直接使用地址栏的这个url进行urlopen看能正常结果吗

详细错误信息如下：我们可以看到底层编码是用的ascii，中文并不支持。所以我们需要进行编码

D:\Python\crawler_1\venv\Scripts\python.exe D:/Python/crawler_1/com/crawler/learn_1/Urllib_1.py
Traceback (most recent call last):
  File "D:/Python/crawler_1/com/crawler/learn_1/Urllib_1.py", line 3, in <module>
    res = req.urlopen("https://www.baidu.com/s?wd=刘德华")
  File "D:\Python\PaythonInstall\lib\urllib\request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "D:\Python\PaythonInstall\lib\urllib\request.py", line 466, in open
    response = self._open(req, data)
  File "D:\Python\PaythonInstall\lib\urllib\request.py", line 484, in _open
    '_open', req)
  File "D:\Python\PaythonInstall\lib\urllib\request.py", line 444, in _call_chain
    result = func(*args)
  File "D:\Python\PaythonInstall\lib\urllib\request.py", line 1297, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "D:\Python\PaythonInstall\lib\urllib\request.py", line 1254, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "D:\Python\PaythonInstall\lib\http\client.py", line 1106, in request
    self._send_request(method, url, body, headers)
  File "D:\Python\PaythonInstall\lib\http\client.py", line 1141, in _send_request
    self.putrequest(method, url, **skips)
  File "D:\Python\PaythonInstall\lib\http\client.py", line 983, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-12: ordinal not in range(128)

Process finished with exit code 1

注意，百度会把http协议转成https，代码中直接调用https也会不通的。使用http协议然后把中文转码之后可以正常请求了。

有了编码就有转码:parse_qs

五:对url路径中的各部分进行分割urlparse,urlsplit

有时候要对url中的各个组成部分就行分割，我们就可以用urlparse,urlsplit。在parse模块下。实例如下：

可以看到url被分成：协议，域名，路径，参数，锚。我们可以分别获取各部分内容进行使用。

六：使用Request类抓取职位信息

对于有些反爬虫的网站，我们直接urlopen是得不到信息的，比如拉钩网。如果我们直接根据下图找到的url请求的话看下得到什么：

得到一小段看不懂的返回,结果前面看到是以b开头的，

url="https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput="
res=req.urlopen(url)
print(res.read())

******************************************************
b'\n<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Document</title>\n\t<style>\n\t\t* {\n\t\t\tmargin: 0;\n\t\t\tpadding: 0;\n\t\t}\n\t\tbody {\n\t\t\tfont-family: "Hiragino Sans GB", "Microsoft Yahei", "SimSun", Arial, "Helvetica Neue", Helvetica;\n\t\t\tbackground: #f8f9fc;\t\t\t\n\t\t}\n\t\t.i_error {\n\t\t\tposition: relative;\n\t\t\t/*width: 654px;*/\n\t\t\twidth: 34.0625%;\n\t\t\t/*height: 467px;*/\n\t\t\tmargin: 67px auto 0;\t\n\t\t\t/*background: url(/lagouhtml/blocked_404.png) 0 0 no-repeat;*/\n\t\t}\n\t\t.i_logo {\n\t\t\tposition: absolute;\n\t\t\t/*top: 116px;*/\n\t\t\ttop: 24.8394%;\n\t\t\t/*left: 68px;*/\n\t\t\tleft: 9.785933%;\n\t\t\t/*width: 110px;*/\n\t\t\twidth: 16.819572%;\n\t\t\t/*height: 41px;*/\n\t\t\t/*background: url(/lagouhtml/lagou_logo.png) 0 0 no-repeat;*/\n\t\t}\n\t\t.tip {\n\t\t\tmargin-top: 49px;\n\t\t\tfont-size: 20px;\n\t\t\tline-height: 20px;\n\t\t\ttext-align: center;\n\t\t\tcolor: #333;\n\t\t}\n\t\t.msg {\n\t\t\tmargin-top: 15px;\n\t\t\ttext-align: center;\n\t\t\tfont-size: 16px;\n\t\t\tline-height: 16px;\n\t\t\tcolor: #777;\n\t\t}\n\t\t.qq {\n\t\t\tmargin-top: 15px;\n\t\t\tfont-size: 18px;\n\t\t\ttext-align: center;\n\t\t}\n\t\t.qq a {\n\t\t\tdisplay: inline-block;\n\t\t\twidth: 100px;\n\t\t\theight: 30px;\n\t\t\tborder-radius: 2px;\n\t\t\tline-height: 30px;\n\t\t\ttext-decoration: none;\n\t\t\tcolor: #fff;\n\t\t\tbackground: #00b38a;\n\t\t}\n\t</style>\n\t<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.0.0/jquery.min.js"></script>\n</head>\n<body>\n\t<div class="i_error">\n\t\t<img src="/lagouhtml/blocked_404.png" alt="404" width="100%">\n\t\t<div class="i_logo"><img src="/lagouhtml/lagou_logo.png" alt="logo" width="100%"></div>\n\t</div>\t\n\t<div class="tip">\xe5\xbd\x93\xe5\x89\x8d\xe8\xaf\xb7\xe6\xb1\x82\xe5\xad\x98\xe5\x9c\xa8\xe6\x81\xb6\xe6\x84\x8f\xe8\xa1\x8c\xe4\xb8\xba\xe5\xb7\xb2\xe8\xa2\xab\xe7\xb3\xbb\xe7\xbb\x9f\xe6\x8b\xa6\xe6\x88\xaa\xef\xbc\x8c\xe6\x82\xa8\xe7\x9a\x84\xe6\x89\x80\xe6\x9c\x89\xe6\x93\x8d\xe4\xbd\x9c\xe8\xae\xb0\xe5\xbd\x95\xe5\xb0\x86\xe8\xa2\xab\xe7\xb3\xbb\xe7\xbb\x9f\xe8\xae\xb0\xe5\xbd\x95\xef\xbc\x81</div>\n</body>\n</html>'

我们可以用decode转码一下得到我们能看懂的内容。

url="https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput="
res=req.urlopen(url)
print(res.read().decode("utf-8"))

***************************************
***此处省略若干****
</div>
	</div>	
	<div class="tip">当前请求存在恶意行为已被系统拦截，您的所有操作记录将被系统记录！</div>
</body>
</html>

因此我们需要再伪装一下，这时我们需要设置请求头。我们先设置代理请求头参数

url="https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput="
header={
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"
         }
#res=req.urlopen(url)
request= req.Request(url,headers=header)
resStr= req.urlopen(request)
print(resStr.read().decode("utf-8"))

这样能正常返回结果，但是这种返回的结果和想象中不一样，并没有什么职位信息：

我们在开发者模式下，看页面的请求信息，发现了页面发起的一个ajax请求，返回了职位信息。

我们请求查看这个请求信息来模拟访问。

但是如果我们直接urlopen这个请求地址，即使加上user-agent也不能返回结果。如下：

其实并不是我们操作频繁，而是我们需要再进行伪装一下，因为这个请求是页面内再次发起的ajax请求，因此我们需要在请求头中加上referer参数，来指定此次请求从那个页面过来的。

但是有时候设置这两个就可以了，但是最近发现设置这两个还是不行，还是报那个错，于是把cookie参数也添加进去，再请求就可以了，

但是注意，data参数要进行编码处理。

from urllib import request as req
from urllib import parse


url="https://www.lagou.com/jobs/positionAjax.json?city=%E4%B8%8A%E6%B5%B7&needAddtionalResult=false"
header={
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36",
    "referer":"https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=",
    "origin":"https://www.lagou.com",
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'cookie':'user_trace_token=20200531194606-fc1f541a-e125-4963-9a9c-2b1d957e1216; _ga=GA1.2.1544368802.1590925567; LGUID=20200531194607-c7d6efeb-2ce7-4401-94bc-3e65916eb2a0; LG_LOGIN_USER_ID=b5a507f581fa09a8dca0626c07ed6a0bd39ed30461d10c2d; LG_HAS_LOGIN=1; RECOMMEND_TIP=true; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221726a8db339436-0a16f694cce861-3a65420e-1049088-1726a8db33a31%22%2C%22%24device_id%22%3A%221726a8db339436-0a16f694cce861-3a65420e-1049088-1726a8db33a31%22%2C%22props%22%3A%7B%22%24latest_utm_source%22%3A%22m_cf_cpt_baidu_pcbt%22%2C%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1594649478; gate_login_token=bc261c1a099219b9ea5417e5f508b118fcbf1b4d5f4d1e9a; JSESSIONID=ABAAABAABAGABFACB950A00325E11FB9B186A4767973E5D; WEBTJ-ID=20200713221208-1734884c952121-086ddd59679675-3a65420e-1049088-1734884c95514b; _putrc=EFB19A27B7811D99; login=true; unick=%E5%AD%94%E7%BB%B4%E6%8C%AF; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; hasDeliver=42; privacyPolicyPopup=false; index_location_city=%E4%B8%8A%E6%B5%B7; TG-TRACK-CODE=index_search; LGSID=20200714222259-26727ec8-1214-4b9e-9b24-3ad9f63c3e14; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist%5Fpython%3FlabelWords%3D%26fromSearch%3Dtrue%26suginput%3D; _gat=1; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1594736579; LGRID=20200714222259-a84b3e83-ac2c-4709-91c9-9957d5f3f7fd; _gid=GA1.2.842365217.1594736579; X_HTTP_TOKEN=ef0d6e7a5aae0da90856374951662facca1a5ab6a4; SEARCH_ID=ee5057c6ac5f42189cbf599cd3be9bb0'
         }
data={'first':'true','pn':1,'kd':'python'}
request= req.Request(url,data=parse.urlencode(data).encode(),headers=header,method="POST")
resStr= req.urlopen(request)
print(resStr.read().decode("utf-8"))

姑苏冷

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
Python-爬虫（一）--urllib库使用，爬取拉勾网数据

一：简介 urllib库是python中最基本的网络请求库，可以模拟浏览器的行为，向指定服务器发送一个请求，并保存服务器返回的数据。二：urlopen函数在pyhon3的urllib库中，所有的网络请求相关的方法，都被集中到urllib.request模块下。函数原型如下：urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, con...
复制链接

扫一扫