使用GraphQuery进行解析
已知我们想要得到的数据结构如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
{
title
pictype
number
type
metadata
{
size
volume
mode
resolution
}
author
images
[
]
tags
[
]
}
|
GraphQuery
的代码是下面这样的:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
{
title
`
xpath
(
"/html/body/div[4]/div[1]/div/div/div[1]/text()"
)
`
pictype
`
css
(
".pic-type"
)
`
number
`
css
(
".detailBtn-down"
)
;
attr
(
"data-id"
)
`
type
`
regex
(
"文件格式:([a-z]+)"
)
`
metadata
`
css
(
".main-right p"
)
`
{
size
`
regex
(
"尺寸:(.*?)像素"
)
`
volume
`
regex
(
"体积:(.*? MB)"
)
`
mode
`
regex
(
"模式:([A-Z]+)"
)
`
resolution
`
regex
(
"分辨率:(\d+dpi)"
)
`
}
author
`
css
(
".user-name"
)
`
images
`
css
(
"#show-area-height img"
)
`
[
src
`
attr
(
"src"
)
`
]
tags
`
css
(
".mainRight-tagBox .fl"
)
`
[
tag
`
text
(
)
`
]
}
|
通过对比可以看出, 它只是在我们设计的数据结构之中添加了一些由反引号包裹起来的函数。惊艳的是,它能完全还原我们上面在 Python
和 Golang
中的解析逻辑,而且从它的语法结构上,更能清晰的读出返回的数据结构。这段 GraphQuery
的执行结果如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
{
"data"
:
{
"author"
:
"Ice bear"
,
"images"
:
[
"http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a0"
,
"http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a1024"
,
"http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a2048"
,
"http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a3072"
]
,
"metadata"
:
{
"mode"
:
"RGB"
,
"resolution"
:
"200dpi"
,
"size"
:
"4724×6299"
,
"volume"
:
"196.886 MB"
}
,
"number"
:
"32504070"
,
"pictype"
:
"原创"
,
"tags"
:
[
"大侠"
,
"海报"
,
"黑白"
,
"金庸"
,
"水墨"
,
"武侠"
,
"中国风"
]
,
"title"
:
"大侠海报金庸武侠水墨中国风黑白"
,
"type"
:
"psd"
}
,
"error"
:
""
,
"timecost"
:
10997800
}
|
GraphQuery
是一个文本查询语言,它不依赖于任何后端语言,可以被任何后端语言调用,一段 GraphQuery
查询语句,在任何语言中可以得到相同的解析结果。 它内置了 xpath
选择器,css
选择器,jsonpath
选择器和 正则表达式
,以及足量的文本处理函数,结构清晰易读,能够保证 数据结构
、解析代码
、返回结果
结构的一致性。
项目地址: github.com/storyicon/g…
GraphQuery
的语法简洁易懂, 即使你是第一次接触它, 也能很快的上手, 它的语法设计理念之一就是 符合直觉
, 我们应该如何执行它呢:
1. 在 Golang 中调用 GraphQuery
在 golang
中,你只需要首先使用 go get -u github.com/storyicon/graphquery
获得 GraphQuery
并在代码中调用即可:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
package
main
import
(
"log"
"github.com/axgle/mahonia"
"github.com/parnurzeal/gorequest"
"github.com/storyicon/graphquery"
)
func
decoderConvert
(
name
string
,
body
string
)
string
{
return
mahonia
.
NewDecoder
(
name
)
.
ConvertString
(
body
)
}
func
main
(
)
{
request
:
=
gorequest
.
New
(
)
_
,
body
,
_
:
=
request
.
Get
(
"http://www.58pic.com/newpic/32504070.html"
)
.
End
(
)
body
=
decoderConvert
(
"gbk"
,
body
)
response
:
=
graphquery
.
ParseFromString
(
body
,
"{ title `xpath(\"/html/body/div[4]/div[1]/div/div/div[1]/text()\")` pictype `css(\".pic-type\")` number `css(\".detailBtn-down\");attr(\"data-id\")` type `regex(\"文件格式:([a-z]+)\")` metadata `css(\".main-right p\")` { size `regex(\"尺寸:(.*?)像素\")` volume `regex(\"体积:(.*? MB)\")` mode `regex(\"模式:([A-Z]+)\")` resolution `regex(\"分辨率:(\\d+dpi)\")` } author `css(\".user-name\")` images `css(\"#show-area-height img\")` [ src `attr(\"src\")` ] tags `css(\".mainRight-tagBox .fl\")` [ tag `text()` ] }"
)
log
.
Println
(
response
)
}
复制代码
|
我们的 GraphQuery
表达式以 单行
的形式, 作为函数 graphquery.ParseFromString
的第二个参数传入,得到的结果与预期完全相同。
2. 在Python中调用GraphQuery
在 Python
等其他后端语言中,调用 GraphQuery
需要首先启动其服务,服务已经为 windows
、mac
和 linux
编译好,到 GraphQuery-http 中下载即可。
在解压并启动服务后,我们就可以愉快的使用 GraphQuery
在任何后端语言中对任何文档以图形的方式进行解析了。Python调用的示例代码如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
|
import
requests
def
GraphQuery
(
document
,
expr
)
:
response
=
requests
.
post
(
"http://127.0.0.1:8559"
,
data
=
{
"document"
:
document
,
"expression"
:
expr
,
}
)
return
response
.
text
response
=
requests
.
get
(
"http://www.58pic.com/newpic/32504070.html"
)
conseq
=
GraphQuery
(
response
.
text
,
r
"""
{
title `xpath("/html/body/div[4]/div[1]/div/div/div[1]/text()")`
pictype `css(".pic-type")`
number `css(".detailBtn-down");attr("data-id")`
type `regex("文件格式:([a-z]+)")`
metadata `css(".main-right p")` {
size `regex("尺寸:(.*?)像素")`
volume `regex("体积:(.*? MB)")`
mode `regex("模式:([A-Z]+)")`
resolution `regex("分辨率:(\d+dpi)")`
}
author `css(".user-name")`
images `css("#show-area-height img")` [
src `attr("src")`
]
tags `css(".mainRight-tagBox .fl")` [
tag `text()`
]
}
"""
)
print
(
conseq
)
复制代码
|
输出结果为:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
{
"data"
:
{
"author"
:
"Ice bear"
,
"images"
:
[
"http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a0"
,
"http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a1024"
,
"http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a2048"
,
"http://pic.qiantucdn.com/58pic/32/50/40/70d58PICZfkRTfbnM2UVe_PIC2018.jpg!/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a3072"
]
,
"metadata"
:
{
"mode"
:
"RGB"
,
"resolution"
:
"200dpi"
,
"size"
:
"4724×6299"
,
"volume"
:
"196.886 MB"
}
,
"number"
:
"32504070"
,
"pictype"
:
"原创"
,
"tags"
:
[
"大侠"
,
"海报"
,
"黑白"
,
"金庸"
,
"水墨"
,
"武侠"
,
"中国风"
]
,
"title"
:
"大侠海报金庸武侠水墨中国风黑白"
,
"type"
:
"psd"
}
,
"error"
:
""
,
"timecost"
:
10997800
}
|
扩展链接 https://github.com/storyicon/graphquery/wiki