一个简单网络爬虫示例

最新推荐文章于 2024-07-26 17:44:34 发布

智公博客

最新推荐文章于 2024-07-26 17:44:34 发布

阅读量1.8w

点赞数 6

分类专栏： Java 文章标签：网络爬虫

本文链接：https://blog.csdn.net/wenhuayuzhihui/article/details/50634263

版权

本文介绍了网络爬虫的基本设计思路，从指定URL开始，通过HTTP获取页面内容，解析需要的信息，并递归处理新发现的URL链接。示例代码附带详细注释，适合初学者了解爬虫工作原理。

摘要由CSDN通过智能技术生成

在学生时期，可能听到网络爬虫这个词会觉得很高大上，但是它的简单实现可能学生都不难懂。
网络爬虫应用，就是把整个互联网真的就当做一张网，像蜘蛛网那样，应用就像一个虫子，在网上面按照一定的规则爬动。
现在互联网应用最广的就是http(s)协议了，本文例子就是基于使用http(s)协议的，只作为示例，不涉及复杂的算法（实际上是最重要的）。

设计思路：
程序入口从一个或多个url开始，通过http(s)获取url的内容，对获取到内容处理，获取内容中需要爬取的信息，获取到内容中的url链接，再重复以上步骤。
不多说，详情看代码已经注释：

/**
 * 功能概要：主程序
 *
 * @author hwz
 */
public class MainApp {
   

    private Integer corePoolSize = 10;

    private Integer maxPoolSize = 20;

    private ThreadPoolExecutor executor;

    /** 工作队列 */
    private SpiderQueue workQueue;

    public void start(String url) throws Exception {
        //初始化线程池
        LinkedBlockingDeque<Runnable> executorQueue = new LinkedBlockingDeque<Runnable>(maxPoolSize);
        executor = new ThreadPoolExecutor(corePoolSize, maxPoolSize, 60L, TimeUnit.SECONDS, 
                executorQueue);

        workQueue = new SpiderQueue(1024);
        SpiderUrl spiderUrl = new SpiderUrl(url, 0);
        try {
            workQueue.add(spiderUrl);
        }
        catch (Exception e) {
            System.out.println("insert url into workQueue error,url=" + url);
            e.printStackTrace();
        }

        //提交第一个执行任务
       executor.submit(new SimpleSpider(workQueue, "thread-" + "main"));
       int i=0;
       int idle = 0;
       while(true) {
           //判断是否增加更多线程执行任务
           if (workQueue.size() > 20 && executor.getActiveCount() < maxPoolSize) {
               idle = 0;
               System.out.println("submit new thread,workQueue.size=" + workQueue.size() + 
                       ",executorQueue.activeCount=" + executor.getActiveCount() + ",i=" + i);
               executor.submit(new SimpleSpider(workQueue, "thread-" + i++));
               Thread.sleep(500);
           }
           else if (workQueue.size() == 0){
               idle++;
               System.out.println("main method, idle times=" + idle);

               //主线程空闲20次，结束运行
               if (idle > 2