gecco源码分析

最新推荐文章于 2024-08-22 08:48:27 发布

helloworddm

最新推荐文章于 2024-08-22 08:48:27 发布

阅读量813

点赞数 1

本文链接：https://blog.csdn.net/helloworlddm/article/details/84800293

版权

（1）GeccoEngine->run（）
1.默认采用proxys文件代理集合
2.scheduler的设置，在循环状态下scheduler = new StartScheduler()
否则 scheduler = new NoLoopStartScheduler();
3若spiderBeanFactory为空，则进行初始化
4.设置cdl的值，数目为线程数目
5.将starts.json文件转化为List，添加到startRequests
6.遍历startRequests，分别调用scheduler.into(startRequest);
7.初始化spiders = new ArrayList(threadCount)
8.实例化每一个Spider并启动相应的线程
9.设置启动时间
10.监控爬虫基本信息，导出JMX信息
11 非循环模式等待线程执行完毕后关闭
（2）currSpiderBeanClass = engine.getSpiderBeanFactory().matchSpider(request);解析
1 String url = request.getUrl();
2.遍历spiderBeans。spiderBeans是通过Gecco中的run方法的第3步得到的spiderBeanFactory = new SpiderBeanFactory(classpath, pipelineFactory);
3.将url和spiderbeans中的key值进行匹配，request.setParam(param)，param是匹配的结果，然后返回request.。spiderbeans中的key是@Gecco注解中的matchurl属性对应的值
4. 返回key对应的spider
（3）HttpResponse response = currDownloader.download(request, timeout);解析
1.启动时通过MonitorDownloaderFactory使用cglib生成HttpClientDownloader的代理类，在调用HttpClientDownload的down之后会执行DownloadMonitor.incrSuccess(request.getUrl());这个方法
（4）spiderBean = render.inject(currSpiderBeanClass, request, response);解析
核心方法
private Object injectHtmlField(HttpRequest request, HttpResponse response, Field field, Class<? extends SpiderBean> clazz) {
HtmlField htmlField = field.getAnnotation(HtmlField.class);
String content = response.getContent();
HtmlParser parser = new HtmlParser(request.getUrl(), content);
// parser.setLogClass(clazz);
String cssPath = htmlField.cssPath();
Class<?> type = field.getType();// 属性的类
boolean isArray = type.isArray();// 是否是数组类型
boolean isList = ReflectUtils.haveSuperType(type, List.class);// 是List类型
if (isList) {
Type genericType = field.getGenericType();// 获得包含泛型的类型
Class genericClass = ReflectUtils.getGenericClass(genericType, 0);// 泛型类
if (ReflectUtils.haveSuperType(genericClass, SpiderBean.class)) {
// List
return parser. $KaTeX parse error: Expected 'EOF', got '}' at position 46: \dotsericClass); }̲ else { // \dots$ basicList(cssPath, field);
} catch (Exception ex) {
//throw new FieldRenderException(field, content, ex);
FieldRenderException.log(field, content, ex);
}
}
} else if (isArray) {
Class genericClass = type.getComponentType();
if (ReflectUtils.haveSuperType(genericClass, SpiderBean.class)) {
List list = parser. $KaTeX parse error: Expected 'EOF', got '}' at position 156: \dotstoArray(a); }̲ else { // \dots$ basicList(cssPath, field).toArray();
} catch (Exception ex) {
//throw new FieldRenderException(field, content, ex);
FieldRenderException.log(field, content, ex);
}
}
} else {
if (ReflectUtils.haveSuperType(type, SpiderBean.class)) {
// SpiderBean
return parser. $KaTeX parse error: Expected 'EOF', got '}' at position 64: \dotsan>) type); }̲ else { // \dots$ basic(cssPath, field);
} catch (Exception ex) {
//throw new FieldRenderException(field, content, ex);
FieldRenderException.log(field, content, ex);
}
}
}
return null;
}
判断一个请求是ajax请求：看header部分是否有X-Requested-With:XMLHttpRequest