爬网程序_使用Go构建Web爬网程序以检测重复的标题

爬网程序

In this article I’ll write a small web crawler. I wasn’t sure if my website had nice page titles site-wide, and if I had duplicate titles, so I wrote this small utility to find out.

在本文中,我将编写一个小型Web搜寻器。 我不确定我的网站是否在整个站点范围内都有漂亮的页面标题,并且标题是否重复,所以我写了这个小实用程序来查找。

I’ll start by writing a command that accepts a starting page from the command line, and follows any link that has the original url as a base.

我将首先编写一个命令,该命令从命令行接受一个起始页,并跟随以原始url为基础的任何链接

Later I’ll add an optional flag to detect if the site has duplicate titles, something that might be useful for SEO purposes.

稍后,我将添加一个可选标志,以检测该网站是否有重复的标题 ,这可能对SEO有用。

介绍golang.org/x/net/html (Introducing golang.org/x/net/html)

The golang.org/x packages are packages maintained by the Go team, but they are not part of the standard library for various reasons.

golang.org/x软件包是Go团队维护的软件包,但是由于各种原因,它们不是标准库的一部分。

Maybe they are too specific, not going to be used by the majority of Go developers. Maybe they are still under development or experimental, so they cannot be included in the stdlib, which must live up to the Go 1.0 promise of no backward incompatible changes - when something goes into the stdlib, it’s “final”.

也许它们过于具体,不会被大多数Go开发人员使用。 也许它们仍处于开发或试验阶段,所以不能将它们包含在stdlib中,它必须符合Go 1.0的承诺,即不存在向后不兼容的更改-当某些东西进入stdlib时,它就是“最终的”。

One of these packages is golang.org/x/net/html.

这些软件包之一是golang.org/x/net/html

To install it, execute

要安装它,执行

go get golang.org/x/net...

In this article I’ll use in particular html.Parse() function, and the html.Node struct:

在本文中,我将特别使用html.Parse()函数和html.Node结构:

package html

type Node struct {
    Type                    NodeType
    Data                    string
    Attr                    []Attribute
    FirstChild, NextSibling *node
}

type NodeType int32

const (
    ErrorNode NodeType = iota
    TextNode
    DocumentNode
    ElementNode
    CommentNode
    DoctypeNode
)

type Attribute struct {
    Key, Val string
}

func Parse(r io.Reader) (*Node, error)

The first program here below accepts a URL and computes the unique links it finds, giving an output like this:

下面的第一个程序接受URL并计算找到的唯一链接,给出如下输出:

http://localhost:1313/go-filesystem-structure/ -> Filesystem Structure of a Go project
http://localhost:1313/golang-measure-time/ -> Measuring execution time in a Go program
http://localhost:1313/go-tutorial-fortune/ -> Go CLI tutorial: fortune clone
http://localhost:1313/go-tutorial-lolcat/ -> Build a Command Line app with Go: lolcat

Let’s start from main(), as it shows a high level overview of what the program does.

让我们从main()开始,因为它显示了该程序的概述。

  1. gets the url from the CLI args using `os.Args[1]

    使用`os.Args [1]从CLI args获取url

  2. instantiates visited, a map with key strings and value string, where we’ll store the URL and the title of the site pages

    实例化visited (具有键字符串和值字符串的地图),我们将在其中存储URL和网站页面标题

  3. calls analyze(). url is passed 2 times, as the function is recursive and the second parameter serves as the base URL for the recursive calls

    调用analyze() 。 由于函数是递归的,因此url被传递了两次,第二个参数用作递归调用的基本URL

  4. iterates over the visited map, which was passed by reference to analyze() and now has all the values filled, so we can print them

    迭代visited地图,该地图通过引用传递给了analyze() ,现在已填满所有值,因此我们可以打印它们

    package main
    
    import (
    	"fmt"
    	"net/http"
    	"os"
    	"strings"
    
    	"golang.org/x/net/html"
    )
    
    func main() {
    	url := os.Args[1]
    	if url == "" {
    		fmt.Println("Usage: `webcrawler <url>`")
    		os.Exit(1)
    	}
    	visited := map[string]string{}
    	analyze(url, url, &visited)
    	for k, v := range visited {
    		fmt.Printf("%s -> %s\n", k, v)
    	}
    }

Simple enough? Let’s get inside analyze(). First thing, it calls parse(), which given a string pointing to a URL will fetch and parse it returning an html.Node pointer, and an error.

很简单? 让我们进入analyze() 。 首先,它调用parse() ,给定一个指向URL的字符串,它将获取并解析它,返回html.Node指针和一个错误。

func parse(url string) (*html.Node, error)

func parse(URL字符串)(* html.Node,错误)

After checking for success, analyze() fetches the page title using pageTitle(), which given a reference to a html.Node, scans it until it finds the title tag, and then it returns its value.

检查成功后, pageTitle()使用pageTitle() analyze()获取页面标题,该页面标题提供了对html.Node的引用,对其进行扫描,直到找到标题标签,然后返回其值。

func pageTitle(n *html.Node) string

func pageTitle(n * html.Node)字符串

Once we have the page title, we can add it to the visited map.

获得页面标题后,我们可以将其添加到visited地图中。

Next, we get all the page links by calling pageLinks(), which given the starting page node, it will recursively scan all the page nodes and will return a list of unique links found (no duplicates).

接下来,我们通过调用pageLinks()获得所有页面链接,该链接给出了起始页面节点,它将递归扫描所有页面节点并返回找到的唯一链接列表(无重复)。

func pageLinks(links []string, n *html.Node) []string

func pageLinks(链接[] string,n * html.Node)[] string

Once we got the links slice, we iterate over them, and we do a little check: if visited does not yet contain the page it means we didn’t visit it yet, and the link must have baseurl as prefix. If those 2 assertions are confirmed, we can call analyze() with the link url.

一旦获得链接切片,我们将对其进行迭代,然后进行一些检查:如果visited的页面尚未包含该页面,则意味着我们尚未访问该页面,并且链接必须以baseurl作为前缀。 如果确认了这两个断言,我们可以使用链接URL调用analyze()

// analyze given a url and a basurl, recoursively scans the page
// following all the links and fills the `visited` map
func analyze(url, baseurl string, visited *map[string]string) {
	page, err := parse(url)
	if err != nil {
		fmt.Printf("Error getting page %s %s\n", url, err)
		return
	}
	title := pageTitle(page)
	(*visited)[url] = title

	//recursively find links
	links := pageLinks(nil, page)
	for _, link := range links {
		if (*visited)[link] == "" && strings.HasPrefix(link, baseurl) {
			analyze(link, baseurl, visited)
		}
	}
}

pageTitle() uses the golang.org/x/net/html APIs we introduced above. At the first iteration, n is the <html> node. We’re looking for the title tag. The first iteration never satisfies this, so we go and loop over the first child of <html> first, and its siblings later, and we call pageTitle() recursively passing the new node.

pageTitle()使用golang.org/x/net/html我们上面介绍的API。 在第一次迭代中, n<html>节点。 我们正在寻找标题标签。 第一次迭代永远不会满足此要求,因此我们先遍历<html>第一个孩子,然后遍历其同级兄弟,然后递归地调用pageTitle()传递新节点。

Eventually we’ll get to the <title> tag: an html.Node instance with Type equal to html.ElementNode (see above) and Data equal to title, and we return its content by accessing its FirstChild.Data property

最终,我们将到达<title>标记:一个Type等于html.ElementNodehtml.Node实例(请参见上文), Data等于title ,然后通过访问其FirstChild.Data属性返回其内容。

// pageTitle given a reference to a html.Node, scans it until it
// finds the title tag, and returns its value
func pageTitle(n *html.Node) string {
	var title string
	if n.Type == html.ElementNode && n.Data == "title" {
		return n.FirstChild.Data
	}
	for c := n.FirstChild; c != nil; c = c.NextSibling {
		title = pageTitle(c)
		if title != "" {
			break
		}
	}
	return title
}

pageLinks() is not much different than pageTitle(), except that it does not stop when it finds the first item, but looks up every link, so we must pass the links slice as a parameter for this recursive function. Links are discovered by checking the html.Node has html.ElementNode Type, Data must be a and also they must have an Attr with Key href, as otherwise it could be an anchor.

pageLinks()不大于太大的不同pageTitle()不同之处在于,当它发现的第一个项目,但查找的每一个环节它不会停止,所以我们必须通过links切片作为该递归函数的参数。 通过检查html.Node是否具有html.ElementNode Type来发现链接, Data必须是a ,并且它们必须具有带有Key hrefAttr ,否则它可能是锚点。

// pageLinks will recursively scan a `html.Node` and will return
// a list of links found, with no duplicates
func pageLinks(links []string, n *html.Node) []string {
	if n.Type == html.ElementNode && n.Data == "a" {
		for _, a := range n.Attr {
			if a.Key == "href" {
				if !sliceContains(links, a.Val) {
					links = append(links, a.Val)
				}
			}
		}
	}
	for c := n.FirstChild; c != nil; c = c.NextSibling {
		links = pageLinks(links, c)
	}
	return links
}

sliceContains() is a utility function called by pageLinks() to check uniquiness in the slice.

sliceContains()是由被称为效用函数pageLinks()在切片检查uniquiness。

// sliceContains returns true if `slice` contains `value`
func sliceContains(slice []string, value string) bool {
	for _, v := range slice {
		if v == value {
			return true
		}
	}
	return false
}

The last function is parse(). It uses the http stdlib functionality to get the contents of a URL (http.Get()) and then uses the golang.org/x/net/html html.Parse() API to parse the response body from the HTTP request, returning an html.Node reference.

最后一个函数是parse() 。 它使用http STDLIB功能来获得URL的(其内容http.Get()然后使用golang.org/x/net/html html.Parse() API来从HTTP请求解析响应主体,返回html.Node参考。

// parse given a string pointing to a URL will fetch and parse it
// returning an html.Node pointer
func parse(url string) (*html.Node, error) {
	r, err := http.Get(url)
	if err != nil {
		return nil, fmt.Errorf("Cannot get page")
	}
	b, err := html.Parse(r.Body)
	if err != nil {
		return nil, fmt.Errorf("Cannot parse page")
	}
	return b, err
}

检测重复的标题 (Detect duplicate titles)

Since I want to use a command line flag to check for duplicates, I’m going to change slightly how the URL is passed to the program: instead of using os.Args, I’ll pass the URL using a flag too.

由于我想使用命令行标志来检查重复项,因此我将略微更改将URL传递给程序的方式:我也将使用标志来传递URL,而不是使用os.Args

This is the modified main() function, with flags parsing before doing the usual work of preparing the analyze() execution and printing of values. In addition, at the end there’s a check for the dup boolean flag, and if true it runs checkDuplicates().

这是经过修改的main()函数,在执行analyze()执行和值打印的常规工作之前会进行标志解析。 另外,最后检查dup布尔标志,如果为true,则运行checkDuplicates()

import (
	"flag"
//...
)


func main() {
	var url string
	var dup bool
	flag.StringVar(&url, "url", "", "the url to parse")
	flag.BoolVar(&dup, "dup", false, "if set, check for duplicates")
	flag.Parse()

	if url == "" {
		flag.PrintDefaults()
		os.Exit(1)
	}

	visited := map[string]string{}
	analyze(url, url, &visited)
	for link, title := range visited {
		fmt.Printf("%s -> %s\n", link, title)
	}

	if dup {
		checkDuplicates(&visited)
	}
}

checkDuplicates takes the map of url -> titles and iterates on it to build its own uniques map, that this time has the page title as key, so we can check for uniques[title] == "" to determine if a title is already there, and we can access the first page that was entered with that title by printing uniques[title].

checkDuplicates > checkDuplicates的映射,并在其上迭代以构建自己的uniques映射,这次以页面标题为键,因此我们可以检查uniques[title] == ""来确定标题是否已经存在在这里,我们可以通过打印uniques[title]来访问以该标题输入的第一页。

// checkDuplicates scans the visited map for pages with duplicate titles
// and writes a report
func checkDuplicates(visited *map[string]string) {
	found := false
	uniques := map[string]string{}
	fmt.Printf("\nChecking duplicates..\n")
	for link, title := range *visited {
		if uniques[title] == "" {
			uniques[title] = link
		} else {
			found = true
			fmt.Printf("Duplicate title \"%s\" in %s but already found in %s\n", title, link, uniques[title])
		}
	}

	if !found {
		fmt.Println("No duplicates were found 😇")
	}
}

学分 (Credits)

The Go Programming Language book by Donovan and Kernighan uses a web crawler as an example throughout the book, changing it in different chapters to introduce new concepts. The code provided in this article takes inspiration from the book.

Donovan和Kernighan撰写的《 Go编程语言》一书以网络爬虫为例,并在不同章节中对其进行了更改,以引入新概念。 本文提供的代码从本书中汲取了灵感。

翻译自: https://flaviocopes.com/golang-web-crawler/

爬网程序

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值