Golang: 分布式爬虫项目

最新推荐文章于 2024-09-13 22:16:48 发布

chao2016

最新推荐文章于 2024-09-13 22:16:48 发布

阅读量5.8k

点赞数 4

分类专栏： L_Golang 文章标签： golang spider crawler

本文链接：https://blog.csdn.net/chao2016/article/details/81697353

版权

L_Golang 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

基于Golang搭建一个抓取某相亲网站内容的爬虫。

源码地址：https://github.com/chao2015/go-crawler

源码分析：

1. 获取网页信息
2. 爬虫的执行引擎
3. 选取内容
4. 解析器模块
5. 单机版爬虫效果

1. 获取网页信息

Fetcher模块，通过一个url来获取该网页的全部内容，返回[]byte格式的文本信息。

// 抓取网页信息并转为urf-8编码
func Fetch(url string) ([]byte, error) {
    //resp, err := http.Get(url)

    // 直接用http.Get(url)进行获取信息会报错：Error: status code 403
    client := &http.Client{}
    req, err := http.NewRequest("GET", url, nil)
    if err != nil {
        log.Fatalln(err)
    }
    // 查看自己浏览器中的User-Agent信息（检查元素->Network->User-Agent）
    req.Header.Set("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36")

    resp, err := client.Do(req)
    if err != nil {
        log.Fatalln(err)
    }

    defer resp.Body.Close()

    if resp.StatusCode != http.StatusOK {
        fmt.Println("Error: status code", resp.StatusCode)
        return nil, fmt.Errorf("wrong status code: %d", resp.StatusCode)
    }

    //all, err := ioutil.ReadAll(resp.Body)  // gbk中文乱码
    // 网页编码转为utf-8的方法：
    // 法1. gbk解码为utf-8
    //utf8Reader := transform.NewReader(resp.Body, simplifiedchinese.GBK.NewDecoder())
    //all, err := ioutil.ReadAll(utf8Reader)

    bodyReader := bufio.NewReader(resp.Body)
    // 法2. 自动识别网页html编码，并转换为utf-8
    e := determineEncoding(bodyReader)
    utf8Reader := transform.NewReader(bodyReader, e.NewDecoder())
    return ioutil.ReadAll(utf8Reader)
}

// 识别网页编码
func determineEncoding(r io.Reader) encoding.Encoding {
    // Peek 返回缓存的一个切片，该切片引用缓存中前 n 字节数据，
    // 该操作不会将数据读出，只是引用，引用的数据在下一次读取操作之前是有效的
    // 如果引用的数据长度小于 n，则返回一个错误信息；如果 n 大于缓存的总大小，则返回 ErrBufferFull
    // 通过 Peek 的返回值，可以修改缓存中的数据，但是不能修改底层 io.Reader 中的数据
    bytes, err := bufio.NewReader(r).Peek(1024)
    if err != nil {
        panic(err)
    }
    e, _, _ := charset.DetermineEncoding(bytes, "")
    return e
}

2. 爬虫的执行引擎

执行引擎Engine模块维护一个任务队列，整个项目的流程如下：

Engine模块从Seed中获得request请求（url+调用的Parser）。
依次处理传来的request请求，将request.url传递给Fecher模块。
待Fecher模块根据url返回网页的文本内容后，Engine模块将文本内容传递给解析器Parser模块。
通过Parser模块得到想要的结构化数据items，并且得到包含下一级的url和对应Parser的requests，返回给Engine模块。

这里写图片描述

3. 选取内容

三种方法：

css选择器
xpath
正则表达式

这里采用正则表达式来获得网页上的结构化数据，例如：

const text = `
My email is chao@gmail.com
My email is chao1@gmail.com@163.com
My email is chao2@gmail.com.cn
`

func main() {
    // . 匹配任意字符
    // .+ 匹配1个或多个字符或者字母
    // .* 匹配0个或多个字符或者字母
    //re := regexp.MustCompile(`[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-zA-Z0-9.]+`)
    //match := re.FindString(text)
    //match := re.FindAllString(text, -1)
    //fmt.Println(match)

    re := regexp.MustCompile(`([a-zA-Z0-9]+)@([a-zA-Z0-9]+)(\.[a-zA-Z0-9.]+)`)
    match := re.FindAllStringSubmatch(text, -1)
    for _, m := range match {
        fmt.Println(m)
    }
}

Output:

[chao@gmail.com chao gmail .com]
[chao1@gmail.com chao1 gmail .com]
[chao2@gmail.com.cn chao2 gmail .com.cn]

本项目中的具体应用见解析器Parser模块。

4. 解析器模块

本项目Parser模块分三级解析器：

城市列表解析器（得到各个城市名称和城市首页url）
城市解析器（得到各个城市首页上的用户名称和用户详情页url）
用户解析器（得到用户详情页的各种结构化数据）

5. 单机版爬虫效果

$ go run main.go 
// 城市列表：
2018/08/15 07:25:48 Fetching http://www.zhenai.com/zhenghun
2018/08/15 07:25:50 Got item City 阿坝
2018/08/15 07:25:50 Got item City 阿克苏
2018/08/15 07:25:50 Got item City 阿拉善盟
2018/08/15 07:25:50 Got item City 阿勒泰
2018/08/15 07:25:50 Got item City 阿里
2018/08/15 07:25:50 Got item City 安徽
2018/08/15 07:25:50 Got item City 安康
2018/08/15 07:25:50 Got item City 安庆
... ...

// 城市首页用户名：
2018/08/15 07:25:50 Fetching http://www.zhenai.com/zhenghun/aba
2018/08/15 07:25:50 Got item User 小顺儿
2018/08/15 07:25:50 Got item User 风中的蒲公英
2018/08/15 07:25:50 Got item User 路漫漫
... ...

// 用户详情：
2018/08/15 07:25:55 Fetching http://album.zhenai.com/u/1995815593
2018/08/15 07:25:55 Got item {小顺儿 女 29 169 52 3001-5000元 未婚 大学本科 会计 四川阿坝 魔羯座 和家人同住 未购车}
2018/08/15 07:25:55 Fetching http://album.zhenai.com/u/1314495053
2018/08/15 07:25:55 Got item {风中的蒲公英 女 41 158 48 3001-5000元 离异 中专 公务员 四川阿坝 处女座 已购房 未购车}
2018/08/15 07:25:55 Fetching http://album.zhenai.com/u/1626200317
2018/08/15 07:25:55 Got item {路漫漫 女 32 158 0 3000元以下 离异 大专 中学教师 四川阿坝 狮子座 和家人同住 未购车}
... ...