爬虫笔记

最新推荐文章于 2020-03-13 02:12:24 发布

陈伟霆

最新推荐文章于 2020-03-13 02:12:24 发布

阅读量135

点赞数

分类专栏： python 爬虫

本文链接：https://blog.csdn.net/weixin_43183295/article/details/98633716

版权

python 同时被 2 个专栏收录

70 篇文章 0 订阅

订阅专栏

爬虫

1 篇文章 0 订阅

订阅专栏

1、爬虫是啥？

爬虫的比喻：

互联网就像一张蜘蛛网，爬虫相当于蜘蛛，数据相当于猎物

爬虫的具体定义：

模拟浏览器向后端发送请求，获取数据，解析并且获得我想要的数据，然后存储

爬虫的价值：

数据的价值

发送请求–获得数据–解析数据–存储数据

bs4,pyquery,re

2、http协议里需要关注的

请求：

URL：指明了我要取哪里

method：

GET：

传递数据：？&拼在url后面

post：

请求体：

formdata

files

json

请求头：

Cookie：

Referer：告诉服务器你从哪里来

User-Agent:告诉服务器你的身份

响应：

Status Code：

2xx:成功

3xx:重定向

响应头：

location：跳转地址

set_cookie:设置cookie

响应体：

1.html代码

2.二进制：图片，视频，音乐

3.json格式

3、常用请求库、解析库、数据库的用法

3.1 常用请求库测试网站：http://httpbin.org/get

requests库

安装：pip install requests

使用：

请求：

①get请求：

				响应对象 = requests.get(......)

				**参数：**

					url：			

					headers = {}       优先级高于cookie

					cookies = {}   

					params = {}	    

					proxies = {'http'：‘http://端口：ip’}

					timeout = NONE

					allow_redirects = False

②post请求：

				响应对象 = requests.post(......)

				**参数：**

					url：

					headers = {}    

					cookies = {}

					data = {}

					json = {}/''/[]

					files = {‘file’：open（...，‘rb’）}

					timeout = 0.5

					allow_redirects = False

自动保存cookie的请求：

			session = requests.session（）

			r = session.get(......)

			r = session.post(......)
	  补充:(保存cookie到本地)
	  	import http.cookiejar as cookielib
	  	session.cookie = cookielib.LWPCookieJar()
	  	session.cookie.save(filename='1.txt')
	  	
	  	session.cookies.load(filename='1.txt')

响应：

			r.url  HTTP：//www.bau.com/fsdfsdf/fdsfdsfsdfsd

			r.text

			r.encoding = 'gbk'

			r.content

			r.json()  json.loads(r.text)

			r.status_code

			r.headers

			r.cookies

			r.history    [响应对象1，响应对象2]

3.2 常用解析语法

css选择器

1、类选择器

.类型

2、id选择器

#id值

3、标签选择器

标签名

4、后代选择器

选择器1 选择器2

5、子选择器

选择器1>选择器2

6、属性选择器

【属性】

【属性=值】

【属性^=值】

【属性$=值】

【属性*=值】

7、群组选择器

选择器1，选择器2.。。 or

8、多条件选择器

选择器1选择器2。。。 and p[pro=‘xx’][class=‘yy’]

xpath选择器

略

3.3 牛逼的requests-html

安装： pip install requests-html

使用：

请求：

			from requests_html import HTMLSession

			session = HTMLSession()

			**参数：**

				browser.args = [

					'--no-sand',

					'--user-agent = XXXXX'

				]

			响应对象 = session.request（......）

			响应对象 = session.get（......）

			响应对象 = session.post（......）

参数和requests模块一毛一样

响应：

			r.url

			**属性和requests模块一毛一样

解析：

html对象属性：

			from requests-html import HTML
			html = HTML(html='')
			
			r.html.absolute_links

				   .links

			           .base_url

			           .html

			           .text

			           .encoding = 'gbk'

			           .raw_html

			           .qp

html对象方法：

			r.html.find('css选择器')     [emement对象，emement对象，emement对象]

				   .find('css选择器'，first = True)    emement对象

				   .xpath(‘xpath选择器’)

				   .xpath('‘xpath选择器'，first = True)

				   .search(‘模板’)         result对象

			           	（‘xxx{}yyy{}’）[0]

					   （‘xxx{name}yyy{pwd}’）[‘name’]

				   .search_all('模板')      【result对象，result对象，result对象】

				   .render(.....)   pyppeteer

				   	**参数：**

					    	scripy：“”“ ( ) => {

										js代码

										js代码

									}

								  ”“”

						    scrolldow：n

						    sleep:n

						    keep_page:True/False

()=>{
Object.defineProperties(navigator,{
        webdriver:{
        get: () => undefined
        }
    })

与浏览器交互 r.html.page.XXX

				asynic def xxx():

					await r.html.page.XXX

				session.loop.run....(xxx())

			.screenshot({'path':路径,'clip':{'x':int,'y':int,'width':int,'height':int}})

			.evaluate('''() =>{js代码}’‘’})    js注入

			.cookies()

			.type('css选择器'，’内容‘，{’delay‘：100})   ms

			.click('css选择器')

			.focus('css选择器')

			.hover('css选择器')

			.waitForSelector('css选择器')     等待元素被加载

			.waitFor(1000)

键盘事件 r.html.page.keyboard.XXX

			.down('Shift')
			
			.up('Shift')

			.press('ArrowLeft'，{‘delay’:100})

			.type('喜欢你啊'，{‘delay’:100})

鼠标事件 r.html.page.mouse.XXX

			.click(x,y,{
                'button'：'left',
                'click':1
                'delay':0
			})
			.down({'button'：'left'})
			.up({'button'：'left'})
			.move(x,y,{'steps'：1})

常用数据库

MongoDB:

爬虫与反爬虫的对抗历史

[外链图片转存失败(img-m6nL3oKU-1565265651427)(C:\Users\oldboy\Desktop\mmexport1565138109214.jpg)]

常见反扒手段

1.检测浏览器headers

2.ip封禁

3.图片验证码

4.滑动模块

5.js轨迹

6.前端反调试

小爬爬

1.爬校花图片

2.爬豆瓣电影

3.爬取校花视频

4.爬取天猫

反爬虫:使用技术手段防止爬虫程序的方法

误伤:反扒技术将普通用户识别为爬虫,如果误伤过高,效果再好也不能用

成本:反爬虫需要的人力和机器成本

拦截:成功拦截爬虫,一般情况下,拦截率越高,误伤率越高

反扒的目的:

5.分析腾讯视频url

6.登录知乎

保存cookie到本地

7.登录拉钩

验证码处理:

1.自己处理

2.在线打码

3.人工打码

scrapy

陈伟霆

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫笔记

1、爬虫是啥？爬虫的比喻：互联网就像一张蜘蛛网，爬虫相当于蜘蛛，数据相当于猎物爬虫的具体定义：模拟浏览器向后端发送请求，获取数据，解析并且获得我想要的数据，然后存储爬虫的价值：数据的价值发送请求–获得数据–解析数据–存储数据 bs4,pyquery,re2、http协议里需要关注的请求： URL：指明了我要取哪里 method： ...
复制链接

扫一扫