基于splash的爬虫_02

function main(splash, args)
  --访问京东商城的首页
  splash:go("https://www.jd.com")
  --等待0.5秒
  splash:wait(0.5)
  --获取京东商城首页的标题，并将标题赋给一个本地变量
  local title=splash:evaljs("document.title")
  --以字典形式返回京东商城首页的标题
  return {
    title=title
  }
end

点击Render me运行Lua脚本，输出结果：

PS：main方法可以返回多种形式的数据，如普通的字符串，或一个字典形式的值

异步处理

go方法是通过异步方式访问页面，不过go方法并不能指定异步回调方法，所以在调用go方法后，需要使用wait方法等待一会，这样给页面加载一定时间

例子：对Lua数组进行迭代，得到数组中的URL，组合完整URL，然后通过go方法访问这些URL，并得到每一个URL的页面

function main(splash, args)
  --定义一个Lua数组，数组元素是URL的一部分
  local urls={"www.baidu.com","www.jd.com"}
	--定义保存截图的字典变量
  local results={}
  --对URL数组进行迭代
  for index,url in ipairs(urls) do
    --访问对应的URL，如果ok为nil，所以有错误，reason是错误原因
    local ok,reason=splash:go("https://"..url)
    if ok then
      --等待1秒
      splash:wait(1)
      --得到当前页面的截图
      	results[url]=splash:png()
		end
	end  
  return results
end

结果输出两张截图

PS：

Lua中的数组和字典使用大括号{...}
go方法是异步执行，不过需要用go方法的返回值ok判断是否请求成功，如果ok为nil，表明请求成功
wait方法相当于Python语言的sleep方法，让线程等待一段时间，单位是秒

splash对象属性

main方法包含2个参数,第1个参数是splash,这个参数相当于Selenium中得WebDriver对象,可以调用Splash对象的一些属性和方法来控制加载过程,本节介绍Splash对象中得常用属性

1.args属性

该属性可以获取加载时配置的参数,如URL
如果是GET请求,还可以获取GET请求参数,如果是POST请求,可以用来获取表单的数据

例子:通过args属性获取URL

function main(splash)
	local url=splash.args.url
end

main方法也支持第2个参数直接作为args

function main(splash,,args)
	local url=args.url
end

2.js_enable属性

该属性控制是否可以执行JavaScript代码,默认为true
如果设定为false,那么无法使用evaljs方法执行JavaScript代码

function main(splash, args)
  splash:go("https://www.jd.com")
  --禁止执行JavaScript代码
  splash.js_enabled=false
  --会抛出异常
  local title=splash:evaljs("document.title")
  return {
    title=title
  }
end

3.resource_timeout属性

该属性可以设置加载的时间,单位是秒
如果设置为0或者nil(相当于python的None),表示不检测超时
在网页加载慢的时候可以加该属性,防止程序等待时间过长

function main(splash, args)
  splash.resource_timeout=0.01
  --必须使用assert方法才能抛出异常
  assert(splash:go("https://www.jd.com"))
  local title=splash:evaljs("document.title")
  return {
    title=title
  }
end

4.images_enabled属性

该属性用于设置图片是否可以加载,默认为true,表示加载图片.如果将该属性设置为false,则在加载页面时不会加载图片,这样显示的速度会更快.不过要注意的是,如果禁止加载图片,页面的布局可能会改变,而且可能会影响JavaScript渲染.因为禁止加载图片后,包含图片的外层DOM节点的高度会受到影响,进而影响DOM节点的位置,所以如果JavaScript对图片节点有操作,就可能会受到影响

另外,由于Splash使用了缓存,所以如果图片一开始加载了出来,那么禁用了图片加载后,再重新加载页面,之前已经加载过的图片可能还会显示出来,这时只需要重启Splash就可以

function main(splash, args)
  --禁止加载图片
  splash.images_enabled=false
  splash:go("https://www.jd.com")
  png=splash.png()
  return {
    png
  }
end

5.plugins_enabled属性

该属性用于控制浏览器插件(如Flask插件)是否启动.在默认情况下,该属性是false,表示不开启动浏览器插件,如果设置为true,表示开启浏览器插件

6.scroll_position属性

该属性用于控制页面上下或者左右滚动.这是一个比较常用的属性.该属性值是一个字典类型,key为x表示页面水平滚动的位置,key为y表示页面垂直滚动的位置

function main(splash, args)
  splash:go("https://www.jd.com")
  --将京东商城首页在垂直方向滚动到500的位置
  splash.scroll_position={y=500}
  return splash:png()
end

go方法

该方法用于请求某个链接,而且可以指定http方法,目前支持get和post,默认是get.
go方法还可以指定http请求头\表单等数据
该方法原型
ok,reason=splash:go(url,baseurl=nil,headers=nil,http_method="GET",body=nil,formdata=nil)

参数说明:

url:请求的url;
baseurl:可选参数,默认为空,表示资源加载相对路径;
headers:可选参数,默认为空,表示请求头;
http_method:可选参数,默认为get,支持post
body:可选参数,默认为空,post请求时的数据(字符串形式);
formdata:可选参数,默认为空,post请求时的表单数据,在发送数据时,会将content-type请求头字段值设为application/x-www-form-urlencoded,而且会将post数据转换为urlencoded格式

例子:通过go函数使用post方法请求http://********/post,并返回HTML代码和Har图表

function main(splash, args)
	local ok,reason=splash:go{"http://**********g/post",http_method="POST",body="name=Bill"}
  if ok then
  	return {
      html=splash:html(),
      har=splash:har()
    }
    end
end

wait方法

该方法用于控制页面的等待时间,方法原型
ok,reason=splash:wait{time,cancel_on_redirect=false,cancel_on_error=true}

参数说明:

time:等待的秒数
cancel_on_redirect:可选参数,默认为false,表示如果发生了重定向就停止等待,并返回重定向结果
cancel_on_error:可选参数,默认为false,表示如果发生了加载错误,就停止等待.
返回结果统一是结果ok和原因reason的组合

function main(splash)
	splash:go("https://www.baidu.com")
  --等待5秒
  splash:wait(5)
  return splash:html()
end

这段代码等待5s后,返回百度首页的HTML代码

jsfunc方法

该方法用于直接调用JavaScript定义的函数,但所谓调用的JavaScript函数必须在一对双中括号内,相当于实现了JavaScript函数到Lua脚本的转换

例子:通过jsfunc方法调用一个JavaScript函数,该函数用于获取当前页面中a节点的总数

function main(splash, args)
  --调用JavaScript函数
  local get_a_count=splash:jsfunc([[
  function(){
    var body=document.body;
    var a_list=body.getElementsByTagName("a");
    return a_list.length;
  }
    ]])
  --加载完页面后会自动调用前面指定的JavaScript函数
  splash:go("https://www.jd.com")
--返回a节点的数据
  return ("these are is a node"):format(get_a_count())
end

evaljs方法

该方法可以执行JavaScript代码,并将最后一条JavaScript语句的结果返回

function main(splash, args)
  splash:go("https://www.jd.com")
  local title=splash:evaljs("document.title")
  return title
end

调用evaljs方法执行了document.title,这条JavaScript代码用于返回页面的标题

runjs方法

该方法可以执行JavaScript代码,功能与evaljs方法类似,但偏向于执行某些动作或者声明某些方法

function main(splash, args)
  --定义JavaScript函数get_name
  splash:runjs("get_name=function(){return 'Mike'}")
  --调用事先定义好的JavaScript函数get_name
  local name=splash:evaljs("get_name()")
  return name
end

输出Mike

autoload方法

该方法用于设置每个页面访问时自动加载的JavaScript代码,autoload方法原型
ok,reason=splash:autoload{source_or_url,source=nil,url=nil}

参数说明:

source_or_url:JavaScript代码或JavaScript库链接
source:JavaScript代码
url:JavaScript库链接

但该方法只负责加载JavaScript代码或库,不执行任何操作,如果需要操作需要使用evaljs方法或者runjs方法

例子:使用autoload方法定义了2个JavaScript函数,通过evaljs函数调用了这两个JavaScript函数,然后使用autoload方法加载jQuery库,利用jQuery的API获取页面中a节点的总数

function main(splash, args)
	splash:autoload([[
    --获取页面标题
    function get_document_title(){
    	return document.title;
  }
    --获取页面中div节点的总数
    function get_div_count(){
    	var body=document.body;
    	var div_list=body.getElementsByTagName('div');
    	return div_list.length;
  }
    ]])
  --加载jQuery库
  splash:autoload{url="https://code.jquery.com/jquery-3.3.0.min.js"}
  splash:go("https://www.jd.com")
  --获取jQuery版本
  local version =splash:evaljs("$.fn.jquery")
  --获取页面中a节点的总数
  local a_count=splash:evaljs("$('a').length;")
  return {
    title=splash:evaljs("get_document_title()"),
    div_count=splash:evaljs("get_div_count()"),
    jquery_version=version,
    a_count=a_count
  }
end

call_later方法

该方法用于通过设置任务的延长时间实现任务延时执行,并执行前通过cancel方法重写执行定时任务

例子:使用call_later方法定义一个延迟任务,用于2秒后获取页面截图

function main(splash, args)
	local result={}
  splash:go("https://www.jd.com")
  splash:wait(0.5)
  result["png1"]=splash:png()
  --定义延时操作
  local timer=splash:call_later(function()
    result["png2"]=splash:png()
      end,2)--2是延时时间,单位秒
  splash:wait(3.0)
  return result
end

显示2个京东首页页面,如果把2改为3或者更大的数值,那么该页面只会显示1个京东商城首页截图,因为3秒后main方法退出,延时执行操作还没有执行

http_get方法

该方法可以模拟HTTP GET请求，原型：

response=splash:http_get{url,headers=nil,follow_redirects=true}

参数说明：

url=请求的URL
headers：可选参数，默认为空，请求头
follow_redirects：可选参数，默认是true，表示是否开启自动重定向

例子：使用http_get方法发送一个HTTP GET请求，并输出返回结果和其他信息

function main(splash, args)
  local treat=require("treat")
  local response=splash:http_get("http://httpbin.org/get")
  return {
    html=treat.as_string(response.body),
    url=response.url,
    status=response.status
  }
end

http_post方法

与http_get方法类似，发送HTTP POST请求，多了一个body参数，原型：

response=splash:http_get{url,headers=nil,follow_redirects=true，body=nil}

参数说明：

url=请求的URL
headers：可选参数，默认为空，请求头
follow_redirects：可选参数，默认是true，表示是否开启自动重定向
body：可选参数，默认为空，表单数据

例子：使用http_post方法发送一个HTTP POST请求，并输出返回结果和其他信息

function main(splash, args)
  local treat=require("treat")
  local json=require("json")
  local response=splash:http_post{"http://httpbin.org/post",
  	body=json.encode({name="Mike",age=30,salary=1234.5}),
    headers={["content-type"]="application/json"}
  }
  return {
    html=treat.as_string(response.body),
    url=response.url,
    status=response.status
  }
end

set_content方法

用于设置网页内容

function main(splash, args)
  splash:set_content("<html><body><h1>hello world</h1></body></html>")
  return splash:png()
end

html方法

用于获取网页的源代码

function main(splash, args)
  splash:go("https://www.jd.com")
  return splash:html()
end

png方法

用于获取png格式的页面截图

function main(splash, args)
  splash:go("https://www.jd.com")
  return splash:png()
end

jpeg方法

与png方法类似，获取jpeg格式的页面截图

har方法

HAR，全称是HTTP Archive format（HTTP存档格式），是一种JSON格式的存档文件格式，多用于记录网页浏览器与网站的交互过程。文件扩展名通常为.har。HAR格式的规范定义了一个HTTP事务的存档格式，可用于网页浏览器导出加载网页时的详细性能数据

function main(splash, args)
  splash:go("https://www.jd.com")
  return splash:har()
end

其他方法

1.url方法

获取当前正在访问的URL

2.get_cookies方法

获取当前页面的cookies

3.add_cookie方法

该方法用于为当前页面添加cookie，原型：

cookies=splash:add_cookie{name,value,path=nil,domain=nil,expires=nil,httpOnly=nil,secure=nil}

add_cookie方法的参数代表cookie的各个参数

function main(splash, args)
  splash:add_cookie{"name","Mike","/",domain="http://example.com"}
	splash:go("http://example.com")
  return splash:get_cookies()
end

4.clear_cookie方法

清除所有cookies

5.get_viewport_size方法

获取当前浏览器页面的尺寸，宽度高度

6.set_viewport_size方法

设置当前浏览器页面的尺寸，宽度高度，原型：

splash:set_viewport_size(width,height)

function main(splash, args)
    splash:set_viewport_size(400,800)
    splash:go("https://www.jd.com")
    return splash:png()
end

7.set_viewport_full方法

设置浏览器全屏显示

8.set_user_agent方法

设置浏览器的user-agent

function main(splash, args)
  splash:set_user_agent("test agent")
	splash:go("http://httpbin.org/get")
  return splash:html()
end

9.set_custom_headers方法

该方法用于设置请求头

function main(splash, args)
  splash:set_custom_headers({
      ["User-Agent"]="test agent",
      ["Custom-Header"]="Value"
    })
	splash:go("http://httpbin.org/get")
  return splash:html()
end

CSS选择器

select方法

该方法用于查找第1个符合条件的节点。如果多个节点符合条件，只会返回第1个符合条件的节点。select方法的参数时CSS选择器

例子：使用select方法查找京东商城首页搜索框节点，并输入搜索关键字

function main(splash, args)
	splash:go("https://www.jd.com")
  --查找id属性值为key的节点
  input=splash:select("#key")
  --在搜索框中输出“python从菜鸟到高手”
  input:send_text("python从菜鸟到高手")
  splash:wait(2)
  return splash:png()
end

select_all方法

查找所有符合条件的节点

例子：使用select_all方法查找京东商城首页所有名为a的节点，并返回a节点的文本内容和href属性值

function main(splash, args)
  local treat=require("treat")
  splash:go("https://www.jd.com")
  splash:wait(0.5)
  --查找页面所有的a节点
  local a_list=splash:select_all('a')
  local results={}
  --对所有的a节点进行迭代，得到每一个a节点的文本和href属性
  for index,a in ipairs(a_list) do
    results[index]={text=a.node.innerHTML,href=a.node.attributes.href}
    end
  return treat.as_array(results)
end

模拟鼠标与键盘动作

例子：send_text方法在京东商城搜索文本框中输入关键字，然后使用mouse_click方法模拟单击搜索按钮动作,send_keys("<Return>")代表回车键

function main(splash)
    splash:go("https://www.jd.com")
    input = splash:select("#key")
    input:send_text("Python从菜鸟到高手")
    button = splash:select("#search > div > div.form > button")
    -- 单击搜索按钮
    button:mouse_click()
    splash:wait(1)
    return splash:png()
end

function main(splash, args)
    splash:go("https://www.jd.com")
    input=splash:select("#key")
    input:send_text("python从菜鸟到高手")
    --回车
    input:send_keys("<Return>")
    splash:wait(1)
    return splash:png()
end

Splash HTTP API

Splash HTTP API是splash提供的一组URL，通过为这些URL指定各种参数，完成对页面的各种渲染工作

1.render.html

该接口用于获取JavaScript渲染的页面的HTML代码，接口的URL如下

http://localhost:8050/render.html

这个接口可以接收一个名为url的参数，用于指定待渲染页面的地址，可用curl命令测试这个接口

curl http://localhost:8050/render.html?url=https://www.jd.com

在终端执行上面的代码，会返回首页的代码

python调用：

wait参数,得到响应的时间会变长,wait参数值设为3,表示3秒后,才会获取京东商城首页代码

import requests
url="http://localhost:8050/render.html?url=https://www.jd.com&wait=3"
response=requests.get(url)
print(response.text)

2.render.png

接口可以获取网页截图,参数比render.html多了几个,由于获取的是截图,所以需要指定截图的宽度和高度。两个值分别由width和height设置。有curl命令测试：

curl http://localhost:8050/render.png?url=https://www.jd.com --output jd.png

import requests
url="http://localhost:8050/render.png?url=https://www.jd.com&wait=3&width=800&height=500"
response=requests.get(url)
with open("jd.png",'wb') as f:
    f.write(response.content)

返回并保存一个800 x 500尺寸的png格式的图像

3.render.jpeg

用法与render.png类似，只不过返回的是jpeg格式的二进制数据，比render.png多了参数quality，用来设置图片的质量

4.render.har

curl命令测试类似

import requests
url="http://localhost:8050/render.har?url=https://www.jd.com&wait=3"
response=requests.get(url)
print(response.text)

5.render.json

该接口包含前面介绍的接口的全部功能,如获取HTML代码,HAR数据,PNG格式截图等

如果render.json接口不加任何参数,默认以JSON格式返回请求URL,页面标题,页面尺寸等信息

curl http://localhost:8050/render.json?url=https://httpbin.org

如果要同时获取HTML代码，HAR数据和PNG截图，可以分别将html，har，png参数值设为1

curl http://localhost:8050/render.json?url=https://httpbin.org&html=1&har=1&png=1

如果返回的信息包含二进制数据，会以Base64编码形式返回这些二进制数据，如获取PNG格式的页面截图，会作为JSON对象的png属性返回PNG图像数据

6.execute

该接口的功能十分强大。前面介绍的所有接口只能通过URL的参数实现特定的一些功能，而通过execute接口，可以实现Python与Lua对接，相当于Python代码中直接执行Lua脚本

通过execute接口的lua_source参数，可以指定一段Lua脚本，然后交由Splash执行，执行完后，会返回执行结果返回给python

例子：通过execute接口执行2段Lua脚本，第1段Lua脚本返回一个简单的字符串，第2段Lua脚本访问https://*****.com，然后将PNG格式的截图返回给Python,并保存名为weather.png的图像文件

import requests
from urllib.parse import quote
lua="""
function main(splash)
    return "世界 你好"
end
"""
url="http://localhost:8050/execute?lua_source="+quote(lua)
response=requests.get(url)
print(response.text)

lua="""
function main(splash)
    splash:go("https://weather.com")
    splash:wait(3)
    return {
        html=splash:html(),
        png=splash:png()
    }
end
"""
url="http://localhost:8050/execute?lua_source="+quote(lua)
response=requests.get(url)

import json
import base64
# 将返回的JSON格式的数据转换为JSON对象
json_obj=json.loads(response.text)
# 获取Base64格式的图像文件数据
png_base64=json_obj['png']
png_bytes=base64.b64decode(png_base64)
# 将截图保存为weather.png文件
with open('weather.png','wb') as f:
    f.write(png_bytes)

执行完后返回
世界你好和一个名为weather.png的文件