转载请标明出处:
http://blog.csdn.net/qq_27818541/article/details/112297218
本文出自:【BigManing的博客】
示例介绍
- 给colly设置
Transport
,以便能访问本地文件 - 获取启动文件所在路径,然后访问
/html/index.html
- 抓取链接信息,遍历访问
- 输出
h1
标签内的内容
示例代码
package main
import (
"fmt"
"net/http"
"os"
"path/filepath"
"github.com/gocolly/colly/v2"
)
func main() {
dir, err := filepath.Abs(filepath.Dir(os.Args[0]))
if err != nil {
panic(err)
}
t := &http.Transport{}
t.RegisterProtocol("file", http.NewFileTransport(http.Dir("/")))
c := colly.NewCollector()
c.WithTransport(t)
pages := []string{}
c.OnHTML("h1", func(e *colly.HTMLElement) {
pages = append(pages, e.Text)
})
c.OnHTML("a", func(e *colly.HTMLElement) {
c.Visit("file://" + dir + "/html" + e.Attr("href"))
})
fmt.Println("file://" + dir + "/html/index.html")
c.Visit("file://" + dir + "/html/index.html")
c.Wait()
for i, p := range pages {
fmt.Printf("%d : %s\n", i, p)
}
}
输出
file:///bigmaning/work/code/go/cdsnArticle/_examples/local_files/html/index.html
0 : Index.html
1 : Child Page One
2 : Child Page Two
3 : Child Page Three
附件
文件目录
.
├── html
│ ├── child_page
│ │ ├── one.html
│ │ ├── three.html
│ │ └── two.html
│ └── index.html
└── local_files.go
one.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Document</title>
</head>
<body>
<h1>Child Page One</h1>
</body>
</html>
two.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Document</title>
</head>
<body>
<h1>Child Page Two</h1>
</body>
</html>
three.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Document</title>
</head>
<body>
<h1>Child Page Three</h1>
</body>
</html>