爬虫基础

最新推荐文章于 2022-01-12 09:55:30 发布

weixin_30897233

最新推荐文章于 2022-01-12 09:55:30 发布

阅读量59

点赞数

文章标签：爬虫 json 后端

原文链接：http://www.cnblogs.com/yanminggang/p/11347938.html

版权

爬虫基础

爬虫定义

模拟浏览器向后端发送请求，获取数据，解析并且获得我想要的数据，然后存储：
发送请求--获取数据--解析数据--存储数据

http协议里需要关注的

请求：
    URL：指明了我要去哪里
    method:
        GET:传递数据：？&拼在URL后面
        POST:请求体(formdata、files、json)
     请求头：
        Cookie:用于身份验证
        Referer:告诉服务器从哪里来
        User-Agent:告诉服务器你的身份

响应：
    Status Code:
        2xx:成功
        3xx:重定向
    响应头：
        location:跳转地址
        set_cookie:设置cookie
    响应体：
        1.html代码
        2.二进制：图片，视频，音乐
        3.json格式

常用请求库、解析库、数据库的用法

请求库

requests库
    安装:pip install requests
    请求:
        get请求：
            响应对象 = requests.get(...)
            参数:
                url:请求路径
                headers = {}  优先级高于cookie
                cookies = {}  
                params = {}  参数
                proxies = {'http':'http://端口:ip'}  代理
                timeout = 0.5  超时时间(秒级)
                allow_redirects = False  不允许重定向
        post请求:
            响应对象 = requests.post(...)
            参数：
                url:请求地址
                headers = {}
                cookies = {}
                data = {}
                json = {}  可以是列表/字符串
                files = {'files':open('文件名','rb')}
                timeout = 0.5
                allow_redirects = False
        自动保存cookie的请求:
            session = request.session()
            r = session.get(...)
            r = sesison.post(...)
            补充:(保存cookie到本地)
            import http:cookiejar as cookielib
            session.cookies = cookielib.LWPCookieJar()
            session.cookies.save(filename='文件名')  保存
            session.cookies.load(filename='文件名')  解析
    
    响应:
        r.url：路劲
        r.text：文本内容
        r.encoding：编码
        r.content：二进制
        r.json()：转json格式
        r.status_code：状态码
        r.headers：请求头
        r.cookies
        r.history：重定向之前的路径

常用解析语句

css选择器

1.类选择器：
    .类 {}
2.id选择器：
    #id {}
3.标签选择器：
    标签 {}
4.后代选择器：
    标签 标签 {}
5.子选择器:
    标签>标签
6.属性选择器
    [属性] {}   所有属性 
    [属性=值1]  所有属性值等于值1
    [属性^=值]  属性以值开头
    [属性$=值]  属性以值结尾
    [属性*=值]  包含值
7.群组选择器
    标签1，标签2...   or
8.多条件选择器
    标签1标签2...     and

requests-html

安装：pip install requests-html
请求：
    from requests_html import HTMLSession
    session = HTMLSession()
    参数：
        browser.args = [
            '--no-sand',
            '--user-agent = xxxxx'
        ]   
    响应对象 = session.request(...,method='')
    响应对象 = session.get(...)
    响应对象 = session.post(...)
响应、参数跟requests模块一样

解析

html对象属性

r.html.absolute_links   绝对链接(http开头的)/将相对改成绝对/去重
r.html.links            原样链接
r.html.base_url         基础链接
r.html.html             原html文件
r.html.text             获取页面上所有的文本内容
r.html.encoding         解码格式
r.html.raw_html         原生html(二进制数据流)
r.pq                    pyquery对象

html对象方法

r.html.find('css选择器')  [element对象，element对象]
r.html.find('css选择器'，first=True)  element对象
    element对象.absolute_links  绝对路径
    element对象.attrs  返回一个字典(属性名，属性值)
    element对象.find
    element对象.search
    element对象.text
r.html.search('模板')  result对象
    r.html.search('xxx{}yyy{}')[0]  拿到模板中大括号括起来的值
    r.html.search('xxx{name}yyy{pwd}')['name']
r.html.search_all('模板')  列表对象
r.html.render(...)
    参数:
        scrapts = '''
            ()=>{
                js代码
                js代码
            }
        '''                       js注入
        scrolldow:n               翻页
        sleep:none                
        keep_page:True/False      Ture为保持page对象与浏览器交互

scrapts = '''
    ()=>{
Object.defineProperties(navigator,{
        webdriver:{
        get: () => undefined
        }
    })}
'''
这段代码会将浏览器内核的navigator.webdriver为ture改为和浏览器一样的undefined

与浏览器交互 r.html.page.XXX

try:
    r.html.render(script=scrapts,sleep=10,keep_page=True)
    async def main():
        await r.html.page.screenshot({'path':'1.png'})
    asyncio.get_event_loop().run_until_complete(main())

finally:
    session.close()
这样就会执行pyhton的一个协程

r.html.page.screenshot({'path':'路劲','clip':{'x':200,'y':200,'width':400,'height':400}})
r.html.page.eveluate('''
    ()=>{
        js代码
        var a = document.querySelector("#list")
        return {'x':a.offsetLeft}  获取标签位置
    }
''')
r.html.page.cookies()  拿到cookies
r.html.page.type('css选择器','内容',{'delay':1000})  介入，忘框里输内容,输一个字符延迟1秒
r.html.page.click('css选择器')   点击
r.html.page.focus('css选择器')   聚焦
r.html.page.hover('css选择器')
r.html.page.waitForSelector('css选择器')  等待元素被加载
r.html.page.waitFor(1000)             页面等待1秒

键盘事件

r.html.page.keyboard.down('shift')  
r.html.page.keyboard.up('shift')
r.html.page.keyboard.press('ArrowLeft',{'delay':100})
r.html.page.keyboard.type('111',{'delay':100})

鼠标事件

r.html.page.mouse.click(x,y,{
    'button':'left',
    'click':1,
    'delay':0
})
r.html.page.mouse.down({'button':'left'})
r.html.page.mouse.up({'button':'left'})
r.html.page.mouse.move(x,y,{'steps':100})

转载于:https://www.cnblogs.com/yanminggang/p/11347938.html

weixin_30897233

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫基础

爬虫基础爬虫定义模拟浏览器向后端发送请求，获取数据，解析并且获得我想要的数据，然后存储：发送请求--获取数据--解析数据--存储数据http协议里需要关注的请求： URL：指明了我要去哪里 method: GET:传递数据：？&拼在URL后面 POST:请求体(formdata、files、json) 请求头：...
复制链接

扫一扫