colly 自动抓取资讯

colly 在golang中的地位,比之scrapy在python的作用,都是爬虫界的大佬。本文用其抓取博文资讯, 从收集器实例配置,goQuery进行dom节点数据抓取,自动分页访问,到csv数据持久化,json控制台输出,全程简单直观。

Code

抓取数据入口为社区用户博客列表页,比如 https://learnku.com/blog/pardon

package main

import (
	"encoding/csv"
	"encoding/json"
	"fmt"
	"log"
	"os"
	"regexp"
	"strconv"
	"strings"

	"github.com/gocolly/colly"
)

// Article 抓取blog数据
type Article struct {
	ID       int    `json:"id,omitempty"`
	Title    string `json:"title,omitempty"`
	URL      string `json:"url,omitempty"`
	Created  string `json:"created,omitempty"`
	Reads    string `json:"reads,omitempty"`
	Comments string `json:"comments,omitempty"`
	Feeds    string `json:"feeds,omitempty"`
}

// 数据持久化
func csvSave(fName string, data []Article) error {
	file, err := os.Create(fName)
	if err != nil {
		log.Fatalf("Cannot create file %q: %s\n", fName, err)
	}
	defer file.Close()
	writer := csv.NewWriter(file)
	defer writer.Flush()
	
	writer.Write([]string{"ID", "Title", "URL", "Created", "Reads", "Comments", "Feeds"})
	for _, v := range data {
		writer.Write([]string{strconv.Itoa(v.ID), v.Title, v.URL, v.Created, v.Reads, v.Comments, v.Feeds})
	}
	return nil
}

func main() {
	articles := make([]Article, 0, 200)
   // 1.准备收集器实例
	c := colly.NewCollector(
		// 开启本机debug
		// colly.Debugger(&debug.LogDebugger{}),
		colly.AllowedDomains("learnku.com"),
		// 防止页面重复下载
		// colly.CacheDir("./learnku_cache"),
	)

	// 2.分析页面数据
	c.OnHTML("div.blog-article-list > .event", func(e *colly.HTMLElement) {
		article := Article{
			Title: e.ChildText("div.content > div.summary"),
			URL:   e.ChildAttr("div.content a.title", "href"),
			Feeds: e.ChildText("div.item-meta > a:first-child"),
		}
		// 查找同一集合不同子项
		e.ForEach("div.content > div.meta > div.date>a", func(i int, el *colly.HTMLElement) {
			switch i {
			case 1:
				article.Created = el.Attr("data-tooltip")
			case 2:
				// 用空白切割字符串
				article.Reads = strings.Fields(el.Text)[1]
			case 3:
				article.Comments = strings.Fields(el.Text)[1]
			}
		})
		// 正则匹配替换,字符串转整型
		article.ID, _ = strconv.Atoi(regexp.MustCompile(`\d+`).FindAllString(article.URL, -1)[0])
		articles = append(articles, article)
	})

	// 下一页
	c.OnHTML("a[href].page-link", func(e *colly.HTMLElement) {
		e.Request.Visit(e.Attr("href"))
	})

	// 启动
	c.Visit(“https://learnku.com/blog/pardon”)

	// 输出
	csvSave("pardon.csv", articles)
	enc := json.NewEncoder(os.Stdout)
	enc.SetIndent("", "  ")
	enc.Encode(articles)
	
	// 显示收集器的打印信息
	log.Println(c)
}

Output

控制台输出

....
    "id": 30604,
    "title": "教程: TodoMVC 与 director 路由",
    "url": "https://learnku.com/articles/30604",
    "created": "2019-07-01 12:42:01",
    "reads": "650",
    "comments": "0",
    "feeds": "0"
  },
  {
    "id": 30579,
    "title": "flaskr 进阶笔记",
    "url": "https://learnku.com/articles/30579",
    "created": "2019-06-30 19:01:04",
    "reads": "895",
    "comments": "0",
    "feeds": "0"
  },
  {
    "id": 30542,
    "title": "教程 Redis+ flask+vue 在线聊天",
    "url": "https://learnku.com/articles/30542",
    "created": "2019-06-29 12:19:45",
    "reads": "2760",
    "comments": "1",
    "feeds": "2"
  }
]
2019/12/20 15:50:14 Requests made: 5 (5 responses) | Callbacks: OnRequest: 0, OnHTML: 2, OnResponse: 0, OnError: 0

csv 文本输出

ID,Title,URL,Created,Reads,Comments,Feeds
37991,ferret 爬取动态网页,https://learnku.com/articles/37991,2019-12-15 10:43:03,219,0,3
37803,匿名类 与 索引重建,https://learnku.com/articles/37803,2019-12-09 19:35:09,323,1,0
37476,大话并发,https://learnku.com/articles/37476,2019-12-08 21:17:55,612,0,4
37738,三元运算符,https://learnku.com/articles/37738,2019-12-08 09:44:36,606,0,0
37719,笔试之 模板变量替换,https://learnku.com/articles/37719,2019-12-07 18:30:42,843,0,0
37707,笔试之 连续数增维,https://learnku.com/articles/37707,2019-12-07 13:50:17,872,0,0
37616,笔试之 一行代码求重,https://learnku.com/articles/37616,2019-12-05 12:10:24,792,0,0
....

Colly

  • 简洁API
  • 快速(单个内核上> 1k请求/秒)
  • 管理请求延迟和每个域的最大并发
  • 自动cookie和会话处理
  • 同步/异步/并行抓取
  • 分布式爬虫
  • 自动编码非unicode响应
  • 支持 Robots.txt
  • 支持 Google App Engine
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值