学习爬虫
提前准备:
软件:pycharm2019.1.2(破解版)
链接:https://pan.baidu.com/s/1Nq0h7JmiRorMXQXdh8xYwQ
提取码:sbnq
复制这段内容后打开百度网盘手机App,操作更方便哦
失效了dd我。
先安装python,再安装pycharm,不懂得怎么破解的可以直接网上搜一下。
再提供一个思路,官网下载pycharm专业版
打开网址:
http://lookdiv.com/index/index/indexcodeindex.html
输入获取持续更新的Activate Code:
lookdiv.com
安装插件
打开命令窗口(win键+R),输入cmd,确认后输入:
pip install requests
requests 和 request的区别:
看网上大佬们说request这个东西是一位郭老师传上去的。
正版我只支持requests,requests里面会调用正版的request。
开始雷霆嘎巴:
有请我们的第一位受害者(Dedicator):
以 https://818ps.com/ (图吧)为例:
requests里面首先出场的get函数
先来看看代码吧:
def get(url, params=None, **kwargs):
r"""Sends a GET request.
:param url: URL for the new :class:`Request` object.
:param params: (optional) Dictionary, list of tuples or bytes to send
in the query string for the :class:`Request`.
:param \*\*kwargs: Optional arguments that ``request`` takes.
:return: :class:`Response <Response>` object
:rtype: requests.Response
"""
kwargs.setdefault('allow_redirects', True)
return request('get', url, params=params, **kwargs)
def request(method, url, **kwargs):
"""Constructs and sends a :class:`Request <Request>`.
:param method: method for the new :class:`Request` object: ``GET``, ``OPTIONS``, ``HEAD``, ``POST``, ``PUT``, ``PATCH``, or ``DELETE``.
:param url: URL for the new :class:`Request` object.
:param params: (optional) Dictionary, list of tuples or bytes to send
in the query string for the :class:`Request`.
:param data: (optional) Dictionary, list of tuples, bytes, or file-like
object to send in the body of the :class:`Request`.
:param json: (optional) A JSON serializable Python object to send in the body of the :class:`Request`.
:param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
:param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
:param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
to add for the file.
:param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
:param timeout: (optional) How many seconds to wait for the server to send data
before giving up, as a float, or a :ref:`(connect timeout, read
timeout) <timeouts>` tuple.
:type timeout: float or tuple
:param allow_redirects: (optional) Boolean. Enable/disable GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD redirection. Defaults to ``True``.
:type allow_redirects: bool
:param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
:param verify: (optional) Either a boolean, in which case it controls whether we verify
the server's TLS certificate, or a string, in which case it must be a path
to a CA bundle to use. Defaults to ``True``.
:param stream: (optional) if ``False``, the response content will be immediately downloaded.
:param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
:return: :class:`Response <Response>` object
:rtype: requests.Response
Usage::
>>> import requests
>>> req = requests.request('GET', 'https://httpbin.org/get')
>>> req
<Response [200]>
"""
# By using the 'with' statement we are sure the session is closed, thus we
# avoid leaving sockets open which can trigger a ResourceWarning in some
# cases, and look like a memory leak in others.
with sessions.Session() as session:
return session.request(method=method, url=url, **kwargs)
是不是有点头皮发麻,hh。
好,听咱一言,可以看到get接受包括几个数据:
1.url , 指代网站的域名;
2.params,要发送的元组或字节列表
import requests
params1 = {'recharge':36,'fee_id':'ireader_nonrenew_vip'}
res = requests.get("https://818ps.com/",params = params1)
print(res.status_code)# 打印状态码 以后有大作用
print(res.url)
结果如下:
3.**kwargs这里面包含了许多:
1)method: 该参数的名字,例如 ‘get’ ‘post’之类的
data:字典、字节、或文件对象,作为request
json:JSON格式的数据,作为request的内容
cookies:字典或CookieJar,request中的cookie
header:字典,http定制头
auth:元组,支持HTTP认证功能
files:字典类型,传输文件
timeout:设定时间
(1)如果你想用写成 method = “get”:
response = requests.request(url="https://movie.douban.com/top250",method="get")
or
response = requests.request("get","https://movie.douban.com/top250")
(2)如果你觉得太麻烦了,用下面这个方法,默认method = get:
response = requests.get("https://movie.douban.com/top250")
先就将到这里,后面用到了再给大家说明。
大家在网页中,点击鼠标右键,选项中能看到检查选项,然后就能看到一大堆html代码。
html里面放了大量的文本,文本链接,图片,图片链接,视频链接等等等。
因此,所有的爬虫都是首先爬取到该url的后台html代码。
1.简单的无保护的网页。
import requests
def run():
response = requests.get("https://818ps.com/")
print(response)
print(response.text)
if __name__ == "__main__":
run()
先了解状态码这个东西:
当浏览者访问一个网页时,浏览者的浏览器会向网页所在服务器发出请求。当浏览器接收并显示网页前,此网页所在的服务器会返回一个包含HTTP状态码的信息头(server header)用以响应浏览器的请求。
常用的http状态码:
200:请求成功
301:资源转移到其他url
404:请求网页不存在
500:内部服务错误
更多状态码可以查看下方链接:
https://www.runoob.com/http/http-status-codes.html
返回状态码:200,表示请求成功。
2.带保护的
import requests
def run():
response = requests.get("https://movie.douban.com/top250")
print(response)
print(response.text)
if __name__ == "__main__":
run()
返回状态码:418
查阅状态码表,418是IETF在1998年愚人节发布的一个玩笑RFC,在RFC 2324超文本咖啡壶控制协议中定义的,并不需要在真实的HTTP服务器中定义。当一个控制茶壶的HTCPCP收到BREW或POST指令要求其煮咖啡时应当回传此错误。它的含义是当客户端给一个茶壶发送泡咖啡的请求时,那就就返回一个错误的状态码表示:I’m a teapot,即:“我是一个茶壶”。这个HTTP状态码在某些网站中用作彩蛋,另外也用于一些爬虫警告。
该怎么解决呢,得假装自己是个人了。引入UA了,UA是个啥勒。浏览器标识(baiUA)可以使得服务器能够识别du客户使用的操作系统及版本、CPU 类型zhi、浏览dao器及版本、浏览器渲染引擎、浏览器语言、浏览器插件,从而判断用户是使用电脑浏览还是手机浏览,让网页作出自动的适应。可理解为网站通过对ua标示的判别,可按相应的格式进行网页的布局调整,使用户获得更好的浏览体验。
import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"}
def run():
response = requests.get("https://movie.douban.com/top250",headers = headers)
print(response)
print(response.text)
if __name__ == "__main__":
run()
大哥,搞定!