1、架构图
Scheduler实现一、架构图:
Scheduler实现二、架构图:
work: Fether和Parser,把work并发
func createWorker(in chan Request,out chan ParseResult){
go func() {
for{
Request := <- in //输入的request
result,err := workerc(Request)
if err != nil {
continue
}
out <- result //爬取URL,且解析出有用内容的返回信息
}
}()
}
func workerc(r Request)(ParseResult,error){
log.Printf("Fetching %s",r.Url)
body,err := fetcher.Fetcher(r.Url) //从网络上获取数据,然后由不同的解析器解析数据
if err != nil {
log.Printf("Fetcher:error fetching url %s,%v",r.Url,err)
return ParseResult{},err
}
return r.ParserFunc(body),nil
}
Scheduler: work并发之后,会面临多对多的并发任务的分配,有很对的request,很多worker在等着做它们,Scheduler去分配这些任务。
Scheduler实现一:Scheduler收到一个个request,所有worker公用一个输入,所有worker在同一个channel里面去抢下一个request。谁抢到谁做,这种不行,存在问题,解决如Scheduler实现二
代码如下:注意Submit()
type SimpleScheduler struct {//负责输入channel
workerChan chan engine.Request
}
//构造chan
func (s *SimpleScheduler) ConfigureMasterWorkerChan(c chan engine.Request){
s.workerChan = c
}
//request的goroutine
func (s *SimpleScheduler) Submit(r engine.Request){
//这里如果不加go func,会卡,因为只开了10个goroutine去接channel里的值,如果超过10个去Submit,就会卡死。
s.workerChan <- r
}
Scheduler实现二:
为每个request都开一个goroutine,如实现二架构图,不同的goroutine往统一的channel里分发request.goroutine数量不确定,发完就结束。
就是这句代码,每个request都开一个goroutine,
go func(){
s.workerChan <- r
}()
整体代码:
type SimpleScheduler struct {//负责输入channel
workerChan chan engine.Request
}
//构造chan
func (s *SimpleScheduler) ConfigureMasterWorkerChan(c chan engine.Request){
s.workerChan = c
}
//request的goroutine
func (s *SimpleScheduler) Submit(r engine.Request){
//
go func(){
s.workerChan <- r
}()
}
Engine: 处理request,把request交给相关负责的部分去负责
func (e *ConcurrentEngine) Run(seeds ... Request){
in := make(chan Request)
out := make(chan ParseResult)
e.Scheduler.ConfigureMasterWorkerChan(in)//构造输入chan
for i:=0; i < e.WorkerCount; i++ {
createWorker(in,out) //开10个可以爬取信息和解析信息的goroutine
}
//把对engine的请求全部交给scheduler
for _,r := range seeds {//往scheduler的chan里面发任务,代替了队列,抢占chan
e.Scheduler.Submit(r)
}
//计数器
itemCount := 0
for{
result := <- out
for _,item := range result.Items{
fmt.Printf("Got item #%d: %v",itemCount,item)
itemCount++
}
for _,request := range result.Requests {
e.Scheduler.Submit(request)
}
}
}
main.go
func main(){
e := engine.ConcurrentEngine{
Scheduler: &schedular.SimpleScheduler{},
WorkerCount:10,
}
e.Run(engine.Request{
Url: "http://www.zhenai.com/zhenghun",
ParserFunc: parser.ParseCityList,
})
}
此结构缺点:
控制力很小,每个request都有一个goroutine,每个分发出去的goroutine就收不回来了,也不知道在外面怎么样了,所有的worker都在抢同一个channel里的东西,我们也没有办法控制到底把哪个request给哪个worker.