Python3 爬虫知识梳理(基础篇)

最新推荐文章于 2024-06-25 23:28:16 发布

程序员阿城

最新推荐文章于 2024-06-25 23:28:16 发布

阅读量532

点赞数

分类专栏： python 文章标签：互联网程序员 python

本文链接：https://blog.csdn.net/zhoulei124/article/details/89813939

版权

这篇博客详述了Python3爬虫的基础知识，包括正则表达式、requests库、Selenium、lxml、BeautifulSoup、PyQuery等。讲解了爬虫的基本原理、Urllib库的使用、Requests库的进阶应用以及如何处理JavaScript渲染的问题。还介绍了如何存储数据，包括文件、数据库和二进制文件。最后，探讨了Selenium的使用，包括模拟浏览器行为、等待策略和异常处理。

摘要由CSDN通过智能技术生成

import urllib
import urllib.request
urllib.request.urlopen("http://www.baidu.com")

2.re

3.requests

4.selenimu

这个库是配合一些驱动去爬取动态渲染网页的库

(1)chromedriver

我们使用的时候需要先下载一个 chromedriver.exe ，下载好了以后放在 chrome.exe 的相同目录下（默认安装路径），然后将这个目录放作为 PATH

import selenium
from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://www.baidu.com")
driver.page_source

这种方式的唯一的缺点是会出现浏览器界面，这可能是我们不需要的,所以我们可以使用 headless 的方式来隐藏 web 界面(其实就是使用 options() 对象的 add_argument 属性去设置 headless 参数 )

我自己是一名高级python开发工程师，我有整理了一套最新的python系统学习教程，想要学习Python的小伙伴可以加我的python学习交流群：835017344，找管理员即可免费领取。

import os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
import time

chrome_options = Options()
chrome_options.add_argument("--headless")

base_url = "http://www.baidu.com/"
#对应的chromedriver的放置目录
driver = webdriver.Chrome(executable_path=(r'C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe'), chrome_options=chrome_options)

driver.get(base_url + "/")

start_time=time.time()
print('this is start_time ',start_time)

driver.find_element_by_id("kw").send_keys("selenium webdriver")
driver.find_element_by_id("su").click()
driver.save_screenshot('screen.png')

driver.close()

end_time=time.time()
print('this is end_time ',end_time)

(2)phantomJS

这是另一种无界面的实现方法，虽然说不维护了，并且在使用的过程中会出现各种玄学，但是还是要介绍一下

和 Chromedriver 一样，我们首先要去下载 phantomJS,然后将其放在 PATH 中方便我们后面的调用

import selenium
from selenium import webdriver

driver = webdriver.phantomJS()
driver.get("http://www.baidu.com")
driver.page_source

5.lxml

这个是为 XPATH 的使用准备的库

6.beautifulsoup

pip 安装的时候注意一下要安装 beautifulsoup4,表示第四个版本，并且这个库是依赖于 lxml 的，所以安装之前请先安装 lxml

from bs4 import BeautifulSoup
soup = BeautifulSoup('`<html></html>','lxml')

7.pyquery

和 BeautifulSoup 一样也是一个网页解析库，但是相对来讲语法简单一些（语法是模仿 jQuery 的）

from pyquery import PyQuery as pq

page = pq('`<html>hello world</html>`')
result = page('html').text()
result

8.pymysql

这个库是 py 操纵 Mysql 的库

import pymysql

conn = pymysql.connect(host='localhost',user='root',password='root',port=3306,db='test')
cursor = conn.cursor()
result = cursor.execute('select * from user where id = 1')
print(cursor.fetchone())

9.pymango

import pymango

client = pymango.MongoClient('localhost')
db = client('newtestdb')
db['table'].insert({'name':'Bob'})
db['table'].find_one({'name':'Bob'})

10.redis

import redis

r = redis.Redis('localhost',6379)
r.set("name","Bob")
r.get('name')

11.flask

flask 在后期使用代理的时候可能会用到

from flask import Flask

app = Flask(__name__)

@app.route('/')

def hello():
    return "hello world"

if __name__ == '__main__':

    app.run(debug=True)

12.django

在分布式爬虫的维护方面可能会用到 django

13.jupyter

网页端记事本

0X02 基础部分

1.爬虫基本原理

(1)爬虫是什么

爬虫就是请求网页并且提取数据的自动化工具

(2)爬虫的基本流程

1.发起请求：

通过 HTTP 库向目标网站发起请求，即发送一个 request（可以包含额外的header信息），然后等待服务器的响应

2.获取响应内容

如果服务器正常响应，会得到一个 Response.其内容就是所要获取的页面的内容，类型可以是 HTML、JSON、二进制数据(图片视频)等

3.解析内容

对 HTML 数据可以使用正则表达式、网页解析库进行解析。如果是 Json 则可以转化成 JSON 对象解析，如果是二进制数据可以保存或者进一步处理

4.保存数据

保存的形式多样，可以是纯文本，也可以保存成数据库，或者保存为特定格式的文件

(3)请求的基本元素

1.请求方法

2.请求 URL

3.请求头

4.请求体(POST 方法独有)

(4)请响应的基本元素

1.状态码

2.响应头

3.响应体

(5)实例代码：

1.请求网页数据

import requests

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
res = requests.get("http://www.baidu.com",headers=headers)

print(res.status_code)
print(res.headers)
print(res.text)

当然这里使用的是 res.text 这种文本格式，如果返回的是一个二进制格式的数据(比如图片)，那么我们应该使用 res.content

2.请求二进制数据

import requests

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
res = requests.get("https://ss2.bdstatic.com/lfoZeXSm1A5BphGlnYG/icon/95486.png",headers=headers)
print(res.content)

with open(r'E:\桌面\1.png','wb') as f:
    f.write(res.content)
    f.close()

(6)解析方式

1.直接处理

2.转化成 json对象

3.正则匹配

4.BeautifulSoap

5.PyQuery

6.XPath

(7)response 的结果为什么和浏览器中的看到的不同

我们使用脚本去请求(只是一次请求)网页得到的是最原始的网页的源码，这个源码里面会有很多的远程的 js 和 css 的加载，我们的脚本是没法解析的，但是浏览器能对这些远程的链接进行再次的请求，然后利用加载到的数据对页面进行进一步的加载和渲染，于是我们在浏览器中看到的页面是很多请求渲染得到的结果，因此和我们一次请求的到的页面肯定是不一样的。

(8)如何解决 JS 渲染的问题

解决问题的方法本质上就是模拟浏览器的加载渲染，然后将渲染好的页面进行返回

1.分析Ajax 请求

2.selenium+webdriver(推荐)

3.splash

4.PyV8、Ghostpy

(9)如何存储数据

1.纯本文

2.关系型数据库

3.非关系型数据库

4.二进制文件

0X03 Urllib 库

1.什么是 Urllib 库

这个库是 python 的内置的一个请求库

urllib.request —————–>请求模块

urllib.error——————–>异常处理模块

urllib.parse——————–>url解析模块

urllib.robotparser ————>robots.txt 解析模块

2.urllib 库的基本使用

(1)函数调用原型

urllib.request.urlopen(url,data,timeout...)

(2)实例代码一：GET 请求

import urllib.request

res = urllib.request.urlopen("http://www.baidu.com")
print(res.read().decode('utf-8'))

(3)实例代码二：POST 请求

import urllib.request
import urllib.parse
from pprint import pprint

data = bytes(urllib.parse.urlencode({'world':'hello'}),encoding = 'utf8')
res = urllib.request.urlopen('https://httpbin.org/post',data = data)
pprint(res.read().decode('utf-8'))

(4)实例代码三：超时设置

import urllib.request

res = urllib.request.urlopen("http://httpbin.org.get",timeout = 1)
print(res.read().decode('utf-8'))

(5)实例代码：获取响应状态码、响应头、响应体

import urllib.request

res = urllib.request.urlopen("http://httpbin.org/get")
print(res.status)
print(res.getheaders())
print(res.getheader('Server'))
#获取响应体的使用 read() 的结果是 Bytes 类型，我们还要用 decode('utf-8')转化成字符串
print(res.read().decode('utf-8'))

(6) request 对象

from urllib import request,parse
from pprint import pprint

url = "https://httpbin.org/post"
headers = {
    'User-Agent':'hello wolrd',
    'Host':'httpbin.org'
}
dict = {
    'name':'Tom',
}

data = bytes(parse.urlencode(dict),encoding='utf8')
req = request.Request(url=url,data=data,headers=headers,method='POST')
res = request.urlopen(req)
pprint(res.read().decode('utf-8'))

3.urllib 库的进阶使用

(1)代理

import urllib.request

proxy_handler = urllib.request.ProxyHandler({
    'http':'http://127.0.0.1:9743'
})

opener = urllib.request.build_opener(proxy_handler)
res = opener.open('https://www.taobao.com')
print(res.read())

(2)Cookie

1.获取 cookies

import http.cookiejar
import urllib.request

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
for item in cookie:
    print(item.name+"="+item.value)

2.将 cookie 保存成文本文件

格式一：

import http.cookiejar, urllib.request
filename = "cookie.txt"
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

格式二：

import http.cookiejar, urllib.request
filename = 'cookie.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

3.使用文件中的 cookie

import http.cookiejar, urllib.request
cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

(3)异常处理

1.实例代码一：URLError

from urllib import request
from urllib import error

try:
    urllib.request.urlopen("http://httpbin.org/xss")
except error.URLError as e:
    print(e.reason)

2.实例代码二：HTTPError

from urllib import request, error

try:
    response = request.urlopen('http://httpbin.org/xss')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')

3.实例代码三：异常类型判断

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)
except urllib.error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

(4)URL 解析工具类

1.urlparse

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result)

2.urlunparse

from urllib.parse import urlunparse

data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))

3.urljoin

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))

4.urlencode

from urllib.parse import urlencode

params = {
    'name': 'germey',
    'age': 22
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)

0X04 Requests 库

1.什么是 requests 库

这个库是基于 URLlib3 的，改善了 urllib api 比较繁琐的特点，使用几句简单的语句就能实现设置 cookie 和设置代理的功能，非常的方便

2.requests 库的基本使用

(1)获取响应信息

import requests

res = requests.get("http://www.baidu.com")
print(res.status_code)
print(res.text)
print(res.cookies)

(2)各种请求方法

import requests

requests.get("http://httpbin.org/get")
requests.post("http://httpbin.org/post")
requests.put("http://httpbin.org/put")
requests.head("http://httpbin.org/get")
requests.delete("http://httpbin.org/delete")
requests.options("http://httpbin.org/get")

(3)带参数的 get 请求

import requests

params = {
    'id':1,
    'user':'Tom',
    'pass':'123456'
}

res = requests.get('http://httpbin.org/get',params = params )
print(res.text)

(4)解析 json

import requests

res = requests.get("http://httpbin.org/get")
print(res.json())

(5)获取二进制数据

import requests

response = requests.get("https://github.com/favicon.ico")
with open('favicon.ico', 'wb') as f:
    f.write(response.content)
    f.close()

(6)添加 headers

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
response = requests.get("https://www.zhihu.com/explore", headers=headers)
print(response.text)

(7)POST 请求

import requests

data = {
    'id':1,
    'user':'Tom',
    'pass':'123456',
}

res = requests.post('http://httpbin.org/post',data=data)
print(res.text)

(8) response 属性

import requests

data = {
    'id':1,
    'user':'Tom',
    'pass':'123456',
}

res = requests.post('http://httpbin.org/post',data=data)
print(res.text)
print(res.status_code)
print(res.headers)
print(res.cookies)
print(res.history)
print(res.url)

(9)响应状态码

每一个状态码都对应着一个名字，我们只要调用这个名字就可以进行判断了

100: ('continue',),
101: ('switching_protocols',),
102: ('processing',),
103: ('checkpoint',),
122: ('uri_too_long', 'request_uri_too_long'),
200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\\o/', '✓'),
201: ('created',),
202: ('accepted',),
203: ('non_authoritative_info', 'non_authoritative_information'),
204: ('no_content',),
205: ('reset_content', 'reset'),
206: ('partial_content', 'partial'),
207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'),
208: ('already_reported',),
226: ('im_used',),

# Redirection.
300: ('multiple_choices',),
301: ('moved_permanently', 'moved', '\\o-'),
302: ('found',),
303: ('see_other', 'other'),
304: ('not_modified',),
305: ('use_proxy',),
306: ('switch_proxy',),
307: ('temporary_redirect', 'temporary_moved', 'temporary'),
308: ('permanent_redirect',
      'resume_incomplete', 'resume',), # These 2 to be removed in 3.0

# Client Error.
400: ('bad_request', 'bad'),
401: ('unauthorized',),
402: ('payment_required', 'payment'),
403: ('forbidden',),
404: ('not_found', '-o-'),
405: ('method_not_allowed', 'not_allowed'),
406: ('not_acceptable',),
407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),
408: ('request_timeout', 'timeout'),
409: ('conflict',),
410: ('gone',),
411: ('length_required',),
412: ('precondition_failed', 'precondition'),
413: ('request_entity_too_large',),
414: ('request_uri_too_large',),
415: ('unsupported_media_type', 'unsupported_media', 'media_type'),
416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),
417: ('expectation_failed',),
418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),
421: ('misdirected_request',),
422: ('unprocessable_entity', 'unprocessable'),
423: ('locked',),
424: ('failed_dependency', 'dependency'),
425: ('unordered_collection', 'unordered'),
426: ('upgrade_required', 'upgrade'),
428: ('precondition_required', 'precondition'),
429: ('too_many_requests', 'too_many'),
431: ('header_fields_too_large', 'fields_too_large'),
444: ('no_response', 'none'),
449: ('retry_with', 'retry'),
450: ('blocked_by_windows_parental_controls', 'parental_controls'),
451: ('unavailable_for_legal_reasons', 'legal_reasons'),
499: ('client_closed_request',),

# Server Error.
500: ('internal_server_error', 'server_error', '/o\\', '✗'),
501: ('not_implemented',),
502: ('bad_gateway',),
503: ('service_unavailable', 'unavailable'),
504: ('gateway_timeout',),
505: ('http_version_not_supported', 'http_version'),
506: ('variant_also_negotiates',),
507: ('insufficient_storage',),
509: ('bandwidth_limit_exceeded', 'bandwidth'),
510: ('not_extended',),
511: ('network_authentication_required', 'network_auth', 'network_authentication'),

实例代码：

import requests

response = requests.get('http://www.jianshu.com/hello.html')
exit() if not response.status_code == requests.codes.not_found else print('404 Not Found')

3.requests 库的进阶使用

(1)文件上传

import requests

files = {'file':open('E:\\1.png','rb')}

res= requests.post('http://httpbin.org/post',files=files)
print(res.text)

(2)获取 cookies

import requests

res = requests.get("http://www.baidu.com")
for key,value in res.cookies.items():
    print(key + "=" + value)

(3)会话维持

这个用法非常的重要，在我们的模拟登陆的过程中是必然会用到的方法，在 CTF 的写脚本的过程中也经常会用到，所以我们稍微详细解释一下

我们在使用 requests.get 的时候要明确一点就是，我们每使用一个 requests.get 就相当于重新打开了一个浏览器，因此上一个 requests.get 中设置的 cookie 在下面的第二次请求中是不能同步的，我们来看下面的例子

实例代码:

import requests

#这里我们设置了 cookie 
requests.get('http://httpbin.org/cookies/set/number/123456789')
#我们再次发起请求，查看是否能带着我们设置的 cookie 
res = requests.get('http://httpbin.org/cookies')
print(res.text)

运行结果：

{
  "cookies": {}
}

我们发现，正如我们上面分析的，我们第一次访问设置的 cookie 并没有在第二次访问中生效，那么怎么办呢，我们有一个 session() 方法能帮助我们解决这个问

最低0.47元/天解锁文章

程序员阿城

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Python3 爬虫知识梳理(基础篇)

import urllibimport urllib.requesturllib.request.urlopen("http://www.baidu.com")2.re3.requests4.selenimu这个库是配合一些驱动去爬取动态渲染网页的库(1)chromedriver我们使用的时候需要先下载一个chromedriver.exe，下载好了以后放在 chr...
复制链接

扫一扫

专栏目录