爬虫 使用 GraphQuery 进行解析

使用GraphQuery进行解析

已知我们想要得到的数据结构如下:

Python
{ title pictype number type metadata { size volume mode resolution } author images [] tags [] }
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{
     title
     pictype
     number
     type
     metadata {
         size
         volume
         mode
         resolution
     }
     author
     images [ ]
     tags [ ]
}
 

GraphQuery 的代码是下面这样的:

Python
{ title `xpath("/html/body/div[4]/div[1]/div/div/div[1]/text()")` pictype `css(".pic-type")` number `css(".detailBtn-down");attr("data-id")` type `regex("文件格式:([a-z]+)")` metadata `css(".main-right p")` { size `regex("尺寸:(.*?)像素")` volume `regex("体积:(.*? MB)")` mode `regex("模式:([A-Z]+)")` resolution `regex("分辨率:(\d+dpi)")` } author `css(".user-name")` images `css("#show-area-height img")` [ src `attr("src")` ] tags `css(".mainRight-tagBox .fl")` [ tag `text()` ] }
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
{
     title ` xpath ( "/html/body/div[4]/div[1]/div/div/div[1]/text()" ) `
     pictype ` css ( ".pic-type" ) `
     number ` css ( ".detailBtn-down" ) ; attr ( "data-id" ) `
     type ` regex ( "文件格式:([a-z]+)" ) `
     metadata ` css ( ".main-right p" ) ` {
         size ` regex ( "尺寸:(.*?)像素" ) `
         volume ` regex ( "体积:(.*? MB)" ) `
         mode ` regex ( "模式:([A-Z]+)" ) `
         resolution ` regex ( "分辨率:(\d+dpi)" ) `   
     }
     author ` css ( ".user-name" ) `
     images ` css ( "#show-area-height img" ) ` [
         src ` attr ( "src" ) `
     ]
     tags ` css ( ".mainRight-tagBox .fl" ) ` [
         tag ` text ( ) `
     ]
}
 

通过对比可以看出, 它只是在我们设计的数据结构之中添加了一些由反引号包裹起来的函数。惊艳的是,它能完全还原我们上面在 PythonGolang 中的解析逻辑,而且从它的语法结构上,更能清晰的读出返回的数据结构。这段 GraphQuery 的执行结果如下:

Python
{ "data": { "author": "Ice bear", "images": [ "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a0", "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a1024", "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a2048", "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a3072" ], "metadata": { "mode": "RGB", "resolution": "200dpi", "size": "4724×6299", "volume": "196.886 MB" }, "number": "32504070", "pictype": "原创", "tags": ["大侠", "海报", "黑白", "金庸", "水墨", "武侠", "中国风"], "title": "大侠海报金庸武侠水墨中国风黑白", "type": "psd" }, "error": "", "timecost": 10997800 }
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{
     "data" : {
         "author" : "Ice bear" ,
         "images" : [
             "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a0" ,
             "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a1024" ,
             "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a2048" ,
             "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a3072"
         ] ,
         "metadata" : {
             "mode" : "RGB" ,
             "resolution" : "200dpi" ,
             "size" : "4724×6299" ,
             "volume" : "196.886 MB"
         } ,
         "number" : "32504070" ,
         "pictype" : "原创" ,
         "tags" : [ "大侠" , "海报" , "黑白" , "金庸" , "水墨" , "武侠" , "中国风" ] ,
         "title" : "大侠海报金庸武侠水墨中国风黑白" ,
         "type" : "psd"
     } ,
     "error" : "" ,
     "timecost" : 10997800
}
 

GraphQuery 是一个文本查询语言,它不依赖于任何后端语言,可以被任何后端语言调用,一段 GraphQuery 查询语句,在任何语言中可以得到相同的解析结果。 它内置了 xpath选择器,css选择器,jsonpath 选择器和 正则表达式 ,以及足量的文本处理函数,结构清晰易读,能够保证 数据结构解析代码返回结果 结构的一致性。

项目地址: github.com/storyicon/g…

GraphQuery 的语法简洁易懂, 即使你是第一次接触它, 也能很快的上手, 它的语法设计理念之一就是 符合直觉, 我们应该如何执行它呢:

1. 在 Golang 中调用 GraphQuery

golang 中,你只需要首先使用 go get -u github.com/storyicon/graphquery 获得 GraphQuery 并在代码中调用即可:

Python
package main import ( "log" "github.com/axgle/mahonia" "github.com/parnurzeal/gorequest" "github.com/storyicon/graphquery" ) func decoderConvert(name string, body string) string { return mahonia.NewDecoder(name).ConvertString(body) } func main() { request := gorequest.New() _, body, _ := request.Get("http://www.58pic.com/newpic/32504070.html").End() body = decoderConvert("gbk", body) response := graphquery.ParseFromString(body, "{ title `xpath(\"/html/body/div[4]/div[1]/div/div/div[1]/text()\")` pictype `css(\".pic-type\")` number `css(\".detailBtn-down\");attr(\"data-id\")` type `regex(\"文件格式:([a-z]+)\")` metadata `css(\".main-right p\")` { size `regex(\"尺寸:(.*?)像素\")` volume `regex(\"体积:(.*? MB)\")` mode `regex(\"模式:([A-Z]+)\")` resolution `regex(\"分辨率:(\\d+dpi)\")` } author `css(\".user-name\")` images `css(\"#show-area-height img\")` [ src `attr(\"src\")` ] tags `css(\".mainRight-tagBox .fl\")` [ tag `text()` ] }") log.Println(response) } 复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
package main
 
import (
     "log"
 
     "github.com/axgle/mahonia"
     "github.com/parnurzeal/gorequest"
     "github.com/storyicon/graphquery"
)
 
func decoderConvert ( name string , body string ) string {
     return mahonia . NewDecoder ( name ) . ConvertString ( body )
}
 
func main ( ) {
     request : = gorequest . New ( )
     _ , body , _ : = request . Get ( "http://www.58pic.com/newpic/32504070.html" ) . End ( )
     body = decoderConvert ( "gbk" , body )
     response : = graphquery . ParseFromString ( body , "{ title `xpath(\"/html/body/div[4]/div[1]/div/div/div[1]/text()\")` pictype `css(\".pic-type\")` number `css(\".detailBtn-down\");attr(\"data-id\")` type `regex(\"文件格式:([a-z]+)\")` metadata `css(\".main-right p\")` { size `regex(\"尺寸:(.*?)像素\")` volume `regex(\"体积:(.*? MB)\")` mode `regex(\"模式:([A-Z]+)\")` resolution `regex(\"分辨率:(\\d+dpi)\")` } author `css(\".user-name\")` images `css(\"#show-area-height img\")` [ src `attr(\"src\")` ] tags `css(\".mainRight-tagBox .fl\")` [ tag `text()` ] }" )
     log . Println ( response )
}
复制代码
 

我们的 GraphQuery 表达式以 单行 的形式, 作为函数 graphquery.ParseFromString 的第二个参数传入,得到的结果与预期完全相同。

2. 在Python中调用GraphQuery

Python 等其他后端语言中,调用 GraphQuery 需要首先启动其服务,服务已经为 windowsmaclinux 编译好,到 GraphQuery-http 中下载即可。
在解压并启动服务后,我们就可以愉快的使用 GraphQuery 在任何后端语言中对任何文档以图形的方式进行解析了。Python调用的示例代码如下:

Python
import requests def GraphQuery(document, expr): response = requests.post("http://127.0.0.1:8559", data={ "document": document, "expression": expr, }) return response.text response = requests.get("http://www.58pic.com/newpic/32504070.html") conseq = GraphQuery(response.text, r""" { title `xpath("/html/body/div[4]/div[1]/div/div/div[1]/text()")` pictype `css(".pic-type")` number `css(".detailBtn-down");attr("data-id")` type `regex("文件格式:([a-z]+)")` metadata `css(".main-right p")` { size `regex("尺寸:(.*?)像素")` volume `regex("体积:(.*? MB)")` mode `regex("模式:([A-Z]+)")` resolution `regex("分辨率:(\d+dpi)")` } author `css(".user-name")` images `css("#show-area-height img")` [ src `attr("src")` ] tags `css(".mainRight-tagBox .fl")` [ tag `text()` ] } """) print(conseq) 复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import requests
 
def GraphQuery ( document , expr ) :
     response = requests . post ( "http://127.0.0.1:8559" , data = {
         "document" : document ,
         "expression" : expr ,
     } )
     return response . text
 
response = requests . get ( "http://www.58pic.com/newpic/32504070.html" )
conseq = GraphQuery ( response . text , r """
    {
        title `xpath("/html/body/div[4]/div[1]/div/div/div[1]/text()")`
        pictype `css(".pic-type")`
        number `css(".detailBtn-down");attr("data-id")`
        type `regex("文件格式:([a-z]+)")`
        metadata `css(".main-right p")` {
            size `regex("尺寸:(.*?)像素")`
            volume `regex("体积:(.*? MB)")`
            mode `regex("模式:([A-Z]+)")`
            resolution `regex("分辨率:(\d+dpi)")`  
        }
        author `css(".user-name")`
        images `css("#show-area-height img")` [
            src `attr("src")`
        ]
        tags `css(".mainRight-tagBox .fl")` [
            tag `text()`
        ]
    }
""" )
print ( conseq )
复制代码
 

输出结果为:

Python
{ "data": { "author": "Ice bear", "images": [ "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a0", "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a1024", "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a2048", "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a3072" ], "metadata": { "mode": "RGB", "resolution": "200dpi", "size": "4724×6299", "volume": "196.886 MB" }, "number": "32504070", "pictype": "原创", "tags": ["大侠", "海报", "黑白", "金庸", "水墨", "武侠", "中国风"], "title": "大侠海报金庸武侠水墨中国风黑白", "type": "psd" }, "error": "", "timecost": 10997800 }
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{
     "data" : {
         "author" : "Ice bear" ,
         "images" : [
             "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a0" ,
             "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a1024" ,
             "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a2048" ,
             "http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a3072"
         ] ,
         "metadata" : {
             "mode" : "RGB" ,
             "resolution" : "200dpi" ,
             "size" : "4724×6299" ,
             "volume" : "196.886 MB"
         } ,
         "number" : "32504070" ,
         "pictype" : "原创" ,
         "tags" : [ "大侠" , "海报" , "黑白" , "金庸" , "水墨" , "武侠" , "中国风" ] ,
         "title" : "大侠海报金庸武侠水墨中国风黑白" ,
         "type" : "psd"
     } ,
     "error" : "" ,
     "timecost" : 10997800
}
 

扩展链接 https://github.com/storyicon/graphquery/wiki




  • zeropython 微信公众号 5868037 QQ号 5868037@qq.com QQ邮箱
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值