我们的爬虫主体框架思路如下
- 获取
request
,将request
发送给engine
处理 engine
将url
发送给Fetcher
,Fetcher
获取网页信息Fetcher
获取网页信息后,发回engine
engine
将网页信息发送给Parser
处理
这一小节的主要内容是对我们之前编写的代码进行一些简单的封装,并且编写部分engine
和parser
的代码。下面的总体的结构图
我们首先将之前main
中的业务逻辑封装到Fetch
函数中,我们需要建立一个新的包fetcher
// fetcher/fetcher.go
func Fetch(url string) ([]byte, error) {
...
if res.StatusCode != http.StatusOK {
return nil, fmt.Errorf("wrong status code: %d", res.StatusCode)
}
b, err := ioutil.ReadAll(res.Body)
if err != nil {
return nil, fmt.Errorf("wrong status code: %s", err)
}
...
}
func determineEncoding(r io.Reader) encoding.Encoding {
...
if err != nil {
log.Printf("Fetcher error: %v", err)
return unicode.UTF8
}
...
}
接着我们就要开始编写engine
,首先建立一个engine
的包,然后建立一个types
文件
// engine/types.go
type Request struct {
Url string
ParserFunc func([]byte) ParseResult
}
type ParseResult struct {
Requests []Request
Items []interface{}
}
接着我们就要开始编写parser
,由于我们是建立58同城的parser
,所以我们新建立一个58
的包,然后再添加parser
,建立文件citylist.go
,用于处理城市列表信息
为了测试的必要,我们建立一个空的url
处理函数
// engine/types.go
func NilParser([]byte) ParseResult {
return ParseResult{}
}
将之前printCityList
函数中的代码封装到ParseCityList
函数中。
// 58/parser/citylist.go
func ParseCityList(contents []byte) engine.ParseResult {
result := engine.ParseResult{}
re = regexp.MustCompile(`independentCityList = {([^}]*)`)
matches = re.FindAllSubmatch(contents, -1)
for _, m := range matches {
for _, subMatch := range m[1:] {
str := strings.Replace(string(subMatch), " ", "", -1)
str = strings.Replace(str, "\n", "", -1)
for _, sub := range strings.FieldsFunc(str, splitByComma) {
...
independentCityAbUrl := "https://" + independentCityAb + ".58.com"
result.Items = append(result.Items, independentCity)
result.Requests = append(result.Requests, engine.Request{
Url:independentCityAbUrl,
ParserFunc: engine.NilParser,
})
}
}
}
...
return result
}
接着在engine
包中建立engine.go
文件,我们先写一个Run
函数,用来处理接收到的Request
。
// engine/engine.go
func Run(seeds ...Request) {
var requests []Request
for _, r := range seeds {
requests = append(requests, r)
}
for len(requests) > 0 {
r := requests[0]
requests = requests[1:]
log.Printf("Fetching %s", r.Url)
body, err := fetcher.Fetch(r.Url)
if err != nil {
log.Printf("Fetcher: error fetching url %s: %v", r.Url, err)
continue
}
parseResult := r.ParserFunc(body)
requests = append(requests, parseResult.Requests...)
for _, item := range parseResult.Items {
log.Printf("Got item %v", item)
}
}
}
修改一下我们的main.go
func main() {
engine.Run(engine.Request{
Url:"https://www.58.com/changecity.html",
ParserFunc: parser.ParseCityList,
})
}
最后我们运行一下代码
接着我们需要编写我们citylist.go
文件的测试文件citylist_test.go
,首先需要将我们访问的网页https://www.58.com/changecity.html
内容存为html
文件,方便我们的测试
接着编写测试函数TestParseCityList
func TestParseCityList(t *testing.T) {
contents, err := fetcher.Fetch("https://www.58.com/changecity.html")
if err != nil {
panic(err)
}
result := ParseCityList(contents)
const resultSize = 689
expectedUrls := []string {
"https://bj.58.com", "https://sh.58.com", "https://tj.58.com",
}
expectedCities := []string {
"北京", "上海", "天津",
}
if len(result.Requests) != resultSize {
t.Errorf("result should have %d requests; but had %d", resultSize, len(result.Requests))
}
for i, url := range expectedUrls {
if result.Requests[i].Url != url {
t.Errorf("expected url #%d: %s; but was %s", i, url, result.Requests[i].Url)
}
}
if len(result.Items) != resultSize {
t.Errorf("result should have %d requests; but had %d", resultSize, len(result.Items))
}
for i, city := range expectedCities {
if result.Items[i].(string) != city {
t.Errorf("expected city #%d: %s; but was %s", i, city, result.Items[i].(string))
}
}
}
测试结果如下
至此我们这一小节的内容完成,我们提交代码到github
。
如果你觉得上述过程对你有一点困难,没关系,可以查看我的Tiny-Go-Crawler lesson2代码。
如有问题,希望大家指出!!!