httpclient4.5+disruptor3.3.2+mongodb3.2+jsoup写的一个抓取小程序

最近由于工作上接手了一个storm的系统,storm采用dubbo接口接收数据,存队列,这里队列用的是jdk自带的阻塞队列ArrayBlockQueue,再做了几轮压测之后发现ArrayBlockQueue队列存在性能瓶颈。经老大介绍可以尝试disruptor;

disruptor是一个开源的无锁队列,性能非常强悍,具体强悍到什么地步大家可以自测一下;

但disruptor还是不适用于我们的系统场景,因为在我们的storm架构里是主动去队列里拿消息消费,而disruptor是被动执行,简单的说就是队列对每一个存在里面的消息都会有一个监听,即被动消费消息;总之后来没用成;

还是说正题吧,后来就自己写一个小的抓取程序去研究一个这个性能强大的无锁队列;

稍微思考了一下,抓取肯定需要httpclient,还有网页解析就用jsoup吧,存贮就用mongodb吧;

这才发现httpclient和mongodb都更新了,httpclient不多说,想说的是mongodb,上次用mongo还是2.0时代,现在已经更新到3.2了。

直接上代码了

这里就上MongoDB3.2的代码了,主要是封装了一个基本的工具类:

package com.jiangjun.crawler.mongodb;

import com.jiangjun.crawler.Constant;
import com.mongodb.*;
import com.mongodb.client.FindIterable;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoCursor;
import com.mongodb.client.MongoDatabase;
import org.bson.Document;

import java.util.List;
import java.util.Map;

/**
 * Created by 15061760 on 2015/12/28 0028.
 */
public abstract class AbstractMongoDBCache {

    public static MongoClient mongo = null;
    public static MongoDatabase db = null;

    static {
        MongoClientOptions.Builder builder = new MongoClientOptions.Builder();
        builder.connectionsPerHost(50);//最大链接数
        builder.threadsAllowedToBlockForConnectionMultiplier(50);//每个connection最多可以有多少个线程等待
        builder.maxWaitTime(1000 * 60 * 2);
        builder.connectTimeout(1000 * 60 * 1);
        MongoClientOptions options = builder.build();
        ServerAddress serverAddress = new ServerAddress("127.0.0.1", 27017);
        mongo = new MongoClient(serverAddress, options);
        db = mongo.getDatabase(Constant.MONGO_DB);
    }

    public static void destory() {
        if (mongo != null) {
            mongo.close();
            db = null;
        }
    }

    /**
     * 根据集合名称查询集合
     *
     * @param name
     * @return
     */
    public abstract MongoCollection getCollectionByName(String name);

    /**
     * 根据集合名称和条件查询记录
     * @param paramMap
     * @param name
     * @return
     */
    public abstract MongoCursor findOneByParam(Map<String,Object> paramMap,String name);

    /**
     * 根据name查询集合的所有记录
     *
     * @param name
     * @return
     */
    public abstract MongoCursor queryCollectionByName(String name);

    /**
     * 保存数据
     *
     * @param document
     * @param collectionName
     */
    public abstract void save(Document document, String collectionName);

    /**
     * 批量保存
     *
     * @param objectList
     * @param collectionName
     */
    public abstract void saveBatch(List<Document> objectList, String collectionName);

    /**
     * 从集合中删除一条记录
     *
     * @param paramMap
     * @param collectionName
     */
    public abstract void removeByParam(Map<String,Object> paramMap, String collectionName);

    /**
     * 删除一个集合
     *
     * @param collectionName
     */
    public abstract void dropCollection(String collectionName);

}

package com.jiangjun.crawler.mongodb;

import com.jiangjun.crawler.Constant;
import com.jiangjun.crawler.bean.NetEaseDaDa;
import com.mongodb.BasicDBObject;
import com.mongodb.DBCollection;
import com.mongodb.DBCursor;
import com.mongodb.DBObject;
import com.mongodb.client.FindIterable;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoCursor;
import com.mongodb.client.model.Filters;
import org.bson.BSON;
import org.bson.BsonDocument;
import org.bson.BsonValue;
import org.bson.Document;
import org.bson.conversions.Bson;

import java.util.*;
import java.util.logging.Filter;

/**
 * Created by 15061760 on 2015/12/28 0028.
 */
public class MongoDBSupport extends AbstractMongoDBCache {

    private static MongoDBSupport mongoDBSupport = new MongoDBSupport();

    private MongoDBSupport() {
    }

    public static MongoDBSupport getInstance() {
        return mongoDBSupport;
    }

    @Override
    public MongoCollection getCollectionByName(String name) {
        MongoCollection collection = db.getCollection(name);
        return collection;
    }

    @Override
    public MongoCursor queryCollectionByName(String name) {
        MongoCollection collection = db.getCollection(name);
        return collection.find().iterator();
    }

    @Override
    public void save(Document document, String collectionName) {
        MongoCollection collection = db.getCollection(collectionName);
        collection.insertOne(document);
    }

    @Override
    public void saveBatch(List<Document> objectList, String collectionName) {
        MongoCollection collection = db.getCollection(collectionName);
        collection.insertMany(objectList);
    }

    @Override
    public void dropCollection(String collectionName) {
        MongoCollection collection = db.getCollection(collectionName);
        collection.drop();
    }

    @Override
    public void removeByParam(Map<String, Object> paramMap, String collectionName) {
        MongoCollection collection = db.getCollection(collectionName);
        final List<Bson> bsons = new ArrayList<Bson>();
        for (Map.Entry<String, Object> m : paramMap.entrySet()) {
            Bson b = Filters.eq(m.getKey(), (String) m.getValue());
            bsons.add(b);
        }
        Iterable iterable = new Iterable() {
            public Iterator iterator() {
                return bsons.iterator();
            }
        };
        collection.deleteOne(Filters.and(iterable));
    }

    @Override
    public MongoCursor findOneByParam(Map<String, Object> paramMap, String name) {
        MongoCollection collection = db.getCollection(name);
        final List<Bson> bsons = new ArrayList<Bson>();
        for (Map.Entry<String, Object> m : paramMap.entrySet()) {
            Bson b = Filters.eq(m.getKey(), (String) m.getValue());
            bsons.add(b);
        }
        Iterable iterable = new Iterable() {
            public Iterator iterator() {
                return bsons.iterator();
            }
        };
        return collection.find(Filters.and(iterable)).iterator();
    }

    public static void main(String[] args) {
        MongoDBSupport mongoDBSupport = MongoDBSupport.getInstance();
        MongoCursor cursor = mongoDBSupport.findOneByParam(new HashMap<String, Object>() {
            {
                put("link", "http://d.news.163.com/article/BD9L573200014TUH");
                put("url", "http://d.news.163.com/articlesPage/new");
            }
        }, Constant.MONGO_NETEASE);
        if(cursor.hasNext()) {
            System.out.println(cursor.next().toString());
        }
    }
}

另外附上disruptor的代码:
package com.jiangjun.crawler.disruptor;

import com.lmax.disruptor.EventFactory;
import com.lmax.disruptor.EventHandler;
import com.lmax.disruptor.RingBuffer;
import com.lmax.disruptor.dsl.Disruptor;
import com.jiangjun.crawler.bean.UrlEvent;

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

/**
 * Created by 15061760 on 2015/12/26 0026.
 */
public class Disrupter4crawler {
    static EventFactory<UrlEvent> urlEventFactory = null;
    static ExecutorService executorService = null;
    final static int ringBufferSize = 2 * 1024;
    static Disruptor<UrlEvent> disruptor4Wait = null;
    static Disruptor<UrlEvent> disruptor4Complete = null;
    static EventHandler<UrlEvent> eventHandler = null;

    static {
        urlEventFactory = new UrlEventFactory();
        executorService = Executors.newCachedThreadPool();
        disruptor4Wait = new Disruptor<UrlEvent>(urlEventFactory, ringBufferSize, executorService);
        disruptor4Complete = new Disruptor<UrlEvent>(urlEventFactory, ringBufferSize, executorService);
        eventHandler = new UrlEventHandler();
        disruptor4Wait.handleEventsWith(eventHandler);
        disruptor4Complete.handleEventsWith(eventHandler);
        disruptor4Wait.start();
        disruptor4Complete.start();
    }

    public static void offer(UrlEvent urlEvent) {
        int flag = urlEvent.getFlag();
        switch (flag) {
            case 0:
                RingBuffer<UrlEvent> ringBuffer_wait = disruptor4Wait.getRingBuffer();
                long sequenceWait = ringBuffer_wait.next();
                try {
                    UrlEvent ueWait = ringBuffer_wait.get(sequenceWait);
                    ueWait.setUrl(urlEvent.getUrl());
                    ueWait.setDes(urlEvent.getDes());
                    ueWait.setFlag(urlEvent.getFlag());
                }finally {
                    ringBuffer_wait.publish(sequenceWait);
                }
                break;
            case 1:
                RingBuffer<UrlEvent> ringBuffer_complete = disruptor4Complete.getRingBuffer();
                long sequenceComplete = ringBuffer_complete.next();
                try {
                    UrlEvent ueComplete = ringBuffer_complete.get(sequenceComplete);
                    ueComplete.setUrl(urlEvent.getUrl());
                    ueComplete.setDes(urlEvent.getDes());
                    ueComplete.setFlag(urlEvent.getFlag());
                }finally {
                    ringBuffer_complete.publish(sequenceComplete);
                }
                break;
            default:
                break;
        }
    }
}

package com.jiangjun.crawler.disruptor;

import com.lmax.disruptor.EventFactory;
import com.jiangjun.crawler.bean.UrlEvent;

/**
 * Created by 15061760 on 2015/12/26 0026.
 */
public class UrlEventFactory implements EventFactory<UrlEvent>{
    public UrlEvent newInstance() {
        return new UrlEvent();
    }
}

package com.jiangjun.crawler.disruptor;

import com.jiangjun.crawler.filter.SaveFilter;
import com.lmax.disruptor.EventHandler;
import com.jiangjun.crawler.bean.UrlEvent;
import com.jiangjun.crawler.filter.DocumentParseFilter;
import com.jiangjun.crawler.filter.HttpClientFilter;
import org.bson.Document;

/**
 * Created by 15061760 on 2015/12/26 0026.
 */
public class UrlEventHandler implements EventHandler<UrlEvent> {
    public void onEvent(UrlEvent urlEvent, long sequence, boolean endOfBatch) throws Exception {
        if (urlEvent.getFlag() == 0) {
            //走抓取流程
            HttpClientFilter.getInstance().setFilter(DocumentParseFilter.getInstance());
            DocumentParseFilter.getInstance().setFilter(SaveFilter.getInstance());
            HttpClientFilter.getInstance().doProcess(urlEvent);
        }
    }
}


package com.jiangjun.crawler;

import com.jiangjun.crawler.bean.UrlEvent;
import com.jiangjun.crawler.disruptor.Disrupter4crawler;

import java.util.ArrayList;
import java.util.List;

/**
 * Created by 15061760 on 2015/12/29 0029.
 */
public class Main {

    static List<UrlEvent> urlEvents = new ArrayList<UrlEvent>();

    static {
        UrlEvent urlEvent = new UrlEvent();
        urlEvent.setUrl("http://baike.baidu.com/cms/home/eventsOnHistory/12.json?_=1451271757920");
        urlEvent.setDes("百度百科");
        urlEvent.setFlag(0);

        UrlEvent urlEvent2 = new UrlEvent();
        urlEvent2.setUrl("http://d.news.163.com/articlesPage/new");
        urlEvent2.setDes("网易哒哒-");
        urlEvent2.setFlag(0);

        urlEvents.add(urlEvent);
        urlEvents.add(urlEvent2);
    }

    public static void main(String[] args) {
        System.out.println("crawler");
//        UrlEvent urlEvent = new UrlEvent();
        urlEvent.setUrl("http://baike.baidu.com/cms/home/eventsOnHistory/12.json?_=1451271757920");
        urlEvent.setDes("百度百科");
//        urlEvent.setUrl("http://d.news.163.com/articlesPage/new");
//        urlEvent.setDes("网易哒哒-");
//        urlEvent.setFlag(0);
        for (UrlEvent u : urlEvents) {
            Disrupter4crawler.offer(u);
        }
    }
}
 

代码也没用心推敲,主要是mongoDB的API弃用了2.0时候的方法,看了3.2的API写了个小工具类,还有待完善和扩充。


[JAVA工程师必会知识点之并发编程]1、现在几乎100%的公司面试都必须面试并发编程,尤其是互联网公司,对于并发编程的要求更高,并发编程能力已经成为职场敲门砖。2、现在已经是移动互联和大数据时代,对于应用程序的性能、处理能力、处理时效性要求更高了,传统的串行化编程无法充分利用现有的服务器性能。3、并发编程是几乎所有框架的底层基础,掌握好并发编程更有利于我们学习各种框架。想要让自己的程序执行、接口响应、批处理效率更高,必须使用并发编程。4、并发编程是中高级程序员的标配,是拿高薪的必备条件。 【主讲讲师】尹洪亮Kevin:现任职某互联网公司首席架构师,负责系统架构、项目群管理、产品研发工作。10余年软件行业经验,具有数百个线上项目实战经验。擅长JAVA技术栈、高并发高可用伸缩式微服务架构、DevOps。主导研发的蜂巢微服务架构已经成功支撑数百个微服务稳定运行【推荐你学习这门课的理由:知识体系完整+丰富学习资料】1、 本课程总计122课时,由五大体系组成,目的是让你一次性搞定并发编程。分别是并发编程基础、进阶、精通篇、Disruptor高并发框架、RateLimiter高并发访问限流吗,BAT员工也在学。2、课程附带附带3个项目源码,几百个课程示例,5个高清PDF课件。3、本课程0基础入门,从进程、线程、JVM开始讲起,每一个章节只专注于一个知识点,每个章节均有代码实例。 【课程分为基础篇、进阶篇、高级篇】一、基础篇基础篇从进程与线程、内存、CPU时间片轮训讲起,包含线程的3种创建方法、可视化观察线程、join、sleep、yield、interrupt,Synchronized、重入锁、对象锁、类锁、wait、notify、线程上下文切换、守护线程、阻塞式安全队列等内容。二、进阶篇进阶篇课程涵盖volatied关键字、Actomic类、可见性、原子性、ThreadLocal、Unsafe底层、同步类容器、并发类容器、5种并发队列、COW容器、InheritableThreadLocal源码解析等内容。三、精通篇精通篇课程涵盖JUC下的核心工具类,CountDownLath、CyclicBarrier、Phaser、Semaphore、Exchanger、ReentrantLock、ReentrantReadWriteLock、StampedLock、LockSupport、AQS底层、悲观锁、乐观锁、自旋锁、公平锁、非公平锁、排它锁、共享锁、重入锁、线程池、CachedThreadPool、FixedThreadPool、ScheduledThreadPool、SingleThreadExecutor、自定义线程池、ThreadFactory、线程池切面编程、线程池动态管理等内容,高并发设计模式,Future模式、Master Worker模式、CompletionService、ForkJoin等课程中还包含Disruptor高并发无锁框架讲解:Disruptor支持每秒600万订单处理的恐怖能力。深入到底层原理和开发模式,让你又懂又会用。高并发访问限流讲解:涵盖木桶算法、令牌桶算法、Google RateLimiter限流开发、Apache JMeter压力测试实战。 【学完后我将达到什么水平?】1、 吊打一切并发编程相关的笔试题、面试题。2、 重构自己并发编程的体系知识,不再谈并发色变。3、 精准掌握JAVA各种并发工具类、方法、关键字的原理和使用。4、 轻松上手出更高效、更优雅的并发程序,在工作中能够提出更多的解决方案。  【面向人群】1、 总感觉并发编程很难、很复杂、不敢学习的人群。2、 准备跳槽、找工作、拿高薪的程序员。3、 希望提高自己的编程能力,开发出更高效、性能更强劲系统的人群。4、 想要快速、系统化、精准掌握并发编程的人群。【课程知识体系图】
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值