爬虫

最新推荐文章于 2024-10-18 00:00:00 发布

weixin_30662109

最新推荐文章于 2024-10-18 00:00:00 发布

阅读量111

点赞数

文章标签：爬虫 json 操作系统

原文链接：http://www.cnblogs.com/jingandyuer/p/11317698.html

版权

1、爬虫是啥？

爬虫的比喻：

互联网就像一张蜘蛛网，爬虫相当于蜘蛛，数据相当于猎物

爬虫的具体定义：

模拟浏览器向后端发送请求，获取数据，解析并且获得我想要的数据，然后存储

爬虫的价值：

数据的价值

发送请求--获得数据--解析数据--存储数据

解析库：bs4,pyquery,re

2、http协议里需要关注的

请求：

URL：指明了我要取哪里

method：

GET：

传递数据：？&拼在url后面

post：

请求体：

formdata 数据

files 文件

json json格式

请求头：

Cookie：

Referer：告诉服务器你从哪里来

User-Agent:告诉服务器你的身份

响应：

Status Code：

2xx:成功

3xx:重定向

响应头：

location：跳转地址

set_cookie:设置cookie

响应体：

1.html代码

2.二进制：图片，视频，音乐

3.json格式

3、常用请求库、解析库、数据库的用法

3.1 常用请求库测试网站：http://httpbin.org/get

requests库

安装：pip install requests

使用：

请求：

①get请求：

                响应对象 = requests.get(......)

•    **参数：**

•     url： 请求的路径  

•     headers = {}     请求头数据，优先级高于cookie

•     cookies = {}    请求携带的cookies

•     params = {}   请求携带的参数

•     proxies = {'http'：‘http://端口：ip’} #代理IP

•     timeout = 0.5   请求超时时间

•     allow_redirects = False 默认为Ture允许重定向

②post请求：

                响应对象 = requests.post(......)

•    **参数：**

•     url： 请求的路径 

•     headers = {}     请求头数据，里面的Cookie优先级高于cookies

•     cookies = {}

•     data = {}   请求携带的参数

•     json = {} 请求携带的json参数

•     files = {‘file’：open（...，‘rb’）}   请求携带的二进制数据(音频，视频)

•     timeout = 0.5  请求超时时间

•     allow_redirects = False  默认为Ture允许重定向

自动保存cookie的请求：

            session = requests.session（）     

•   r = session.get(......)

•   r = session.post(......)

   session.cookies会保存所有请求过的网页的cookie,下次请求会携带过去
   
   r.cookies保存当前请求浏览器的cookie
   

 补充:(保存cookie到本地)
  import http.cookiejar as cookielib
  session.cookie = cookielib.LWPCookieJar()
  session.cookie.save(filename='1.txt') #save存cookie
  
  session.cookies.load(filename='1.txt') #load取cookie

响应：

import requests

session = requests.session()

r = session.get('https://www.baidu.com')

            r.url  获取访问地址：https://www.baidu.com

•   r.text 获取响应的文本数据 

•   r.encoding = 'gbk' 指定解码码格式，编码格式是服务器决定，解码必须知道编码格式,获取编码格式：decument.charset

•   r.content 获取响应体二进制数据,

•   r.json()将响应回来的数据转化成python字典，本质：josn.loads(r.text)

•   r.status_code　获取响应状态码

•   r.headers 获取响应头里面的东西

•   r.cookies 获取当前访问浏览器的cookies

•   r.history [响应对象1，响应对象2]

3.2 常用解析语法

css选择器

1、类选择器

（“.类名”）

2、id选择器

（“#ld名”）

3、标签选择器

（“标签名”）

4、后代选择器

div span{color:red} #用空格隔开--->div里面的所有span标签（span是后代）

5、儿子选择器

div>span{color:red} #用>隔开--->div里面所有的span标签（span仅仅是儿子）

6、属性选择器

1.【属性名】有这个属性名的标签

2.【属性=值1】有这个属性且值等于值1的标签

3【属性^ =值1】有这个属性且以值1开头的标签

4.【属性$=值1】有这个属性且以值1属性结尾的标签

5.【属性*=值1】有这个属性且值包含值1的标签

7、群组选择器

div,span,img{color:red} #用逗号隔开 or

意思是：div,span,img 三个标签都给我变红

8、多条件选择器

        <p class="shuang">1</p>
        <p class="mc">2</p>
        <p class="mc">3</p>
        
        p.shuang {
   color:red
  }
  
  选中的是1，首先所有p标签，然后属性是shuang的标签 选择器与选择器之间是and关系

选择器1选择器2选择器3 (选择器之间没有空格和逗号) and关系

BS4：

html = '''

<html><head><title>the Dormouse's story<title></head>

<body>

<p class='title' name='dromouse'>The Dormouse's story</p>

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

# 获取元素(有多个相同元素，返回第一个元素)

print(soup.title) # <title>the Dormouse's story<title></head>

        print(soup.head) #<head><title>the Dormouse's story<title></head>
     
  #获取标签名称
  
  print(soup.title.name) # title
  
  #获取属性
  print(soup.p.attrs['name']) # dromous
  print(soup.p['name']) #dromous

#获取内容

print(soup.p.string) # the the Dormouse's story

#嵌套选择

print(soup.head.title.string) # the the Dormouse's story

#子节点和子孙节点

print(soup.p.contents) #获取p标签里面所有的标签和子节点(列表的形式)

print(soup.p.children) 获取p标签里面所有的标签和子节(不是列表，需要遍历取出来)

3.3 牛逼的requests-html

安装： pip install requests-html

使用：

请求：

            from requests_html import HTMLSession

•   session = HTMLSession()
   
   r = session.get(url = 'https://www.baidu.com)

•   响应对象 = session.request（url = 'https://www.baidu.com,method='get'）

•   响应对象 = session.get（url = 'https://www.baidu.com/get）

•   响应对象 = session.post（url = 'https://www.baidu.com/post）

参数和requests模块一毛一样

响应：

r = session.get（url = 'https://www.baidu.com/get）

            r.url 获取访问路径：'https://www.baidu.com/get

•   **其他属性和requests模块一毛一样

解析：

html对象属性：

r = session.get（url = 'https://www.baidu.com/get）

            r.html.absolute_links   返回页面所有的绝对链接（包含http的就是绝对路径）

•   r.html.links   返回页面包含的所有原样链接（就是没有做任何改变的链接）

    r = session.get（url = 'https://www.baidu.com/p-1-2077.html'）

•   r.html.base_url --->返回网站根路径 https://www.baidu.com

•   r.html.html  获取页面的Html内容

•   r.html.text 获取页面的文本内容 

•   r.html.encoding = 'gbk' 指定解码格式

•   r.html.raw_html 返回二进制流

•   r.html.pq 获取pyQuery对象

html对象方法：

from requests_html import HTMLSession

session = HTMLSession()

url ='https://www.183xsw.com/6_6116/'

r = session.get（url = 'https://www.baidu.com/get）

r.html.find('css选择器') 【element对象，element对象，element对象】

r.html.find('css选择器'，first = True) element对象，只选取第一条数据

r.html.xpath(‘xpath选择器’)

   r.html.xpath('‘xpath选择器'，first = True)

   r.html.search(‘模板’) #匹配到一个resquest对象就停止匹配【resquest对象】

   列如：r.html.serach('(提示：{name},最新章节可能会{pwd},登录书架即可查看)')

   会去查找(提示：{},最新章节可能会{},登录书架即可查看)这样的一句话，然后name=补全的字

r.html.search_all('模板') #匹配所有的resquest对象【result对象，result对象】

  element对象方法：

            a_element = r.html.find('a'，first = True)

            a_element.absolute_links 获取绝对路径


            <div id="footer" name="egon">呵呵</div>
            a_element = r.html.find('#footer'，first = True)

            a_element.attrs: {'id':'footer','class':'name'}返回一个字典获取标签属性

   <a href='https://www.183xsw.com'>183小说网</a>
            a_element.text: 183小说网获取文本内容

a_element.hetml: <a href='https://www.183xsw.com'>183小说网</a> 获取html内容

a_element.raw_html: <a href='https://www.183xsw.com'>183\xd0\xa1\xcb\xb5\xcd\xf8</a> 获得二进制流

     render()方法：
   from requests_html import HTMLSession

session = HTMLSession()
r = session.get(url='https://www.baidu.com')

   r.html.render() #驱动浏览器内核对页面渲染，第一次用render会下载一个浏览器

   **参数：**

   设置浏览器启动参数：

   session = HTMLSession(

    browser.args = [

'--no-sand',

#--user-agent等号不能有空格

'--user-agent=Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36’
'

]

   )


  scripy参数：js注入
   scripy= '''


            ( ) => {

    大括号里面写js代码，我们称之为js注入
       }

      '''

   r.html.render(scripy=scripy)

  sleep参数
   r.html.render(scripy=scripy，sleep = 10) #渲染页面后停10秒，关闭浏览器


   在浏览器控制台输入：navigator.webdriver

   1.正常情况 undefind

   2.render内核渲染浏览器：ture

   结论：结果为ture的是爬虫

   3.在render内核渲染浏览器的情况下写这段代码：

   scripy= '''
                 ()=>{
             Object.defineProperties(navigator,{
                    webdriver:{
                    get: () => undefined
                    }
                })
   ‘’‘
   r.html.render(scripy=scripy)

   再次在在浏览器控制台输入：navigator.webdriver

   显示undefind;

keep_page:参数与浏览器交互

   浏览器关闭，r.html.page为None,r.html.page无法与浏览器交互,当keep_page为Ture时
   r.html.page不为None，保存page对象


   r.html.render(scripy=scripy，sleep = 10,keep_page=Ture)

**与浏览器交互 r.html.page是浏览器对象

* 开携程与浏览器交互：

from requests_html import HTMLSession

try:

r.html.render(scripy=scripy，sleep = 10,keep_page=Ture)

asynic def main(): #定义携程函数

                  await r.html.page.screenshot({'path':'1.png'}) #截屏操作

•     session.loop.run_until_complete(main()) 

•   finally:
    session.close()
   

  
•   await r.html.page.screenshot({'path':路径}) #截屏

•   await r.html.page.evaluate('''() =>{js代码}’‘’})

•   await r.html.page.cookies() 获取cookies

•   await r.html.page.type('css选择器'，’内容‘，{’delay‘：100}) 在输入框中输入内容，每0.1秒输入一次

•   await r.html.page.click('css选择器') #点击事件

•   await r.html.page.focus('css选择器') #聚焦事件

•   await r.html.page.hover('css选择器') #悬浮事件

•   await r.html.page..waitForSelector('css选择器') 

•   await r.html.page.waitFor(1000) #等待几秒

键盘事件 r.html.page.keyboard 事件只触发一次

            r.html.page.keyboard.down('Shift') #按住shift键

•   r.html.page.keyboard.up('Shift')   #松开shift键

   press是按下再松开，再按下在抬起来 = down,up组合

•   r.html.page.keyboard.press('ArrowLeft') 

•   r.html.page.keyboard.type('喜欢你啊'，{‘delay’:100})

鼠标事件 r.html.page.mouse

            r.html.page.mouse.click(x,y,{
                'button'：'left',
               'click':1
               'delay':0
   })
   r.html.page.mouse.down({'button'：'left'}) #按下左键
   r.html.page.mouse.up({'button'：'left'}) #松开左键
   r.html.page.mouse.down({'button'：'right'}) #按下右键
   r.html.page.mouse.up({'button'：'right'}) #按下右键
   r.html.page.mouse.move(x,y,{'steps'：1}) #鼠标移动 x,y是坐标，steps是步长