项目实训进度记录【5.31-6.3】

最新推荐文章于 2024-07-27 11:46:57 发布

Benzenene!

最新推荐文章于 2024-07-27 11:46:57 发布

阅读量89

点赞数

分类专栏：项目实训文章标签：爬虫 java 开发语言

本文链接：https://blog.csdn.net/benzenene/article/details/125241038

版权

项目实训专栏收录该内容

12 篇文章 0 订阅

订阅专栏

摘要

将文砚的单线程爬虫改为多线程爬虫，并嵌入资料库多扩展阅读中

爬虫

输入： 字符串数组，每个字符串是一个感兴趣的领域关键词
输出： Link数组，每个Link实例中包含网页链接和网页标题

实现：

单线程方案： 在oldmap中预置种子地址，每一轮中，从oldmap中拿出一个地址，如果它的标题符合关键词中的一个，就把它加入result，并把它页面中关联的网址都加入newmap，最后把这个网址都使用状态改为true。如果result中已经有足够多的网址，则跳出；否则，再从oldmap中取一个网址，直到oldmap空，就把newmap中的网址倒入oldmap，再从oldmap中取。
该方案的问题：
1⃣️种子网址中，百度百科和百度文库爬不到东西，baidu.com爬到的净是新闻，离我们要爬的内容相距甚远，很难爬到有效的内容；
2⃣️oldmap与newmap倒来倒去的，没什么意义（主要意义是去重，但这个事情可以用别的方法完成），而且非常不适合改为多线程；
3⃣️对于多个关键词，爬到的结果偏向于第一个关键词，后面关键词没有机会；
4⃣️每次访问都要重新爬；
5⃣️如果爬的关键词比较冷僻，则没完没了爬；
6⃣️爬的太慢了，一秒钟爬一个，还大概率不符合要求

多线程方案：

解决上述问题：
1⃣️去掉百度百科和百度文库，由于百度的反爬很严格，不好绕过，因此将baidu.com替换成bing.com，并通过"/search?q=关键词(utf-8编码)"构造获取更有针对性的搜索结果；
2⃣️只用一个map存放候选链接，并且使用FIFO模式，每个线程操作的时候加锁；
3⃣️限制每个关键词最多的结果数量不超过总数/关键词数；
4⃣️把每次爬到的符合结果但由于结果集大小有限而被放弃的结果保留在candidates中，下次爬同样关键词的时候先从candidates中取；
5⃣️设置最大循环次数，超出该次数无论结果集是否够大都跳出；
6⃣️改为多线程

数据结构：

boolean stop = false; //如果一个线程发现结果已经满足，则所有线程停止
int threadNum;  //线程数
int resultNum;  //结果集大小
public ArrayList<Link> result;  //结果集
Queue<String> oldmap; //待爬队列
HashSet<String> history; //被爬过的网址，避免重复爬
HashSet<String> names;  //被爬过的网站标题，避免一个网站不同节标题导致网址不同的情况
List<Queue<Link>> candidates;  //符合要求的网址，但由于该关键词的额度用尽而无法纳入结果集，为了不浪费结果，将其加入候选队列，下一次爬相同关键词的时候先从候选队列中抽取，如果数量不够再爬新的
Lock lock;  //多线程锁，保证原子操作
ExecutorService executor;  //执行器
int[] hitTimes;  //记录每个关键词收集了多少条网址记录
Queue<String> seeds;  //种子网址

执行流程
1⃣️将CSDN加入种子；根据关键词构造bing搜索网址并加入种子。
2⃣️初始化执行器，如果candidates中有东西，则先从candidates中取网址加入结果集；如果没有东西，则初始化candidates。
3⃣️执行器执行每一个线程：
（1）先从种子库中取网址，如果能取到，则收集该网址中最多50条网址，由于有些网址是站内链接，没有给出https:/，因此在此判断一下网址格式是否完整，如果不完整则给它加上https。然后判断网址是否在历史记录中出现过，如果没有出现过，则加入待爬队列。
（2）判断是否已经可以停止；判断结果集是否满足条件，如果满足则让大家都停止。如果结果集为空，则记录为空次数，并且休息0.5s之后再来查看结果集（因为一时为空可能是取走了最后一条网址的现场还未跑出结果），如果5次都为空则认为全部完成了，让大家都停止。如果结果集不为空，则从结果集中取出一条网址，逐个关键词判断该网址的标题是否能命中，如果命中了（涉及到英语字符的全部转换为小写比较）且该关键词还有余量且该网址标题没有在结果集中出现过，则加入结果集和名称空间，并记录该关键词的命中次数++；如果命中了但余量不够了，则加入candidates和命名空间中。
（3）爬取该网页上最多50条链接，如果没有在history中出现，则加入待爬队列和history。
4⃣️全部结束后，关闭执行器，退出主线程。
效果： 平均6s爬到所需内容
代码

package project.ourcourseassistant.util;

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import project.ourcourseassistant.entity.Link;

import java.io.UnsupportedEncodingException;
import java.net.SocketTimeoutException;
import java.net.UnknownHostException;
import java.util.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;


public class MultiThreadSpider {
    boolean stop = false;
    int threadNum = 20;
    int resultNum = 5;
    public ArrayList<Link> result;
    //    Map<String, Boolean> oldmap;
//    Map<String, Boolean> newmap;
    Queue<String> oldmap;
    //    Queue<String> newmap;
    HashSet<String> history;
    HashSet<String> names;
    List<Queue<Link>> candidates;

    Lock lock;
    ExecutorService executor;
    int[] hitTimes;
    Queue<String> seeds;


    public MultiThreadSpider(int threadNum, int resultNum) {
//        result = new ArrayList<String>();
//        oldmap=  new HashMap<String, Boolean>();
//        newmap = new HashMap<String, Boolean>();
        result = new ArrayList<>();
        oldmap = new LinkedList<>();
//        newmap = new LinkedList<>();
        history = new HashSet<>();
        this.threadNum = threadNum;
        this.resultNum = resultNum;
        candidates = new LinkedList<>();
        names = new HashSet<>();
        hitTimes = new int[100];

        seeds = new LinkedList<>();


        lock = new ReentrantLock();         // 开启显式家锁
    }

    public void crawl(String[] args) {
        seeds.add("https://www.csdn.net");

        hitTimes = new int[args.length];
        result.clear();
        history.clear();
        names.clear();
        oldmap.clear();
        stop = false;

        for (int i = 0; i < args.length; i++) {
            try {
                seeds.add("https://www.bing.com/search?q=" + java.net.URLEncoder.encode(args[i], "utf-8"));
            } catch (Exception e) {
                e.printStackTrace();
            }
        }


        executor = Executors.newCachedThreadPool();
        if (candidates.size() == 0) {
            for (int i = 0; i < args.length; i++) {
                candidates.add(new LinkedList<>());
            }
        } else {
            for (int i = 0; i < args.length; i++) {
                for (int j = 0; j < Math.min(resultNum / args.length, candidates.get(i).size()); j++) {
                    Link l = candidates.get(i).poll();
                    result.add(l);
                    hitTimes[i]++;

                }
            }
        }


        for (int i = 0; i < threadNum; i++) {
            int finalI = i;
            executor.execute(new java.lang.Runnable() {
                @Override
                public void run() {
//                    System.out.println(seeds.size());
                    for (int s = 0; s < seeds.size(); s++) {
                        System.out.print("a");

                        try {
                            String url = seeds.poll();
                            if (url == null) break;
                            System.out.println(url);
                            Connection connection = Jsoup.connect(url);
                            Document doc = connection.get();
                            Elements links = (Elements) doc.getElementsByTag("a");
                            if (links.size() == 0) System.out.println("nothing find in " + url);
                            for (int i = 0; i < Math.min(links.size(), 50); i++) {//如果设低了线程可能一直睡
                                Element link = links.get(i);
                                String linkHref = link.attr("href");
                                if (!linkHref.startsWith("http")) {
                                    linkHref = "https:/" + linkHref;
                                }
                                System.out.println(linkHref);
                                //链接去重，不在history中，就加入newmap
                                lock.lock();
                                if (!history.contains(linkHref)) {
//                                    System.out.println("new herf to add to oldmap");
                                    oldmap.add(linkHref);
                                    history.add(linkHref);
                                }
                                lock.unlock();
                            }
                        } catch (Exception e) {
                            e.printStackTrace();
                        }
                    }
//                    System.out.println("oldmap.size="+oldmap.size());

                    int emptyTimes = 0;
//                    System.out.println("thread "+ finalI +" in");
                    for (int u = 0; u < 20; u++) {
                        lock.lock();
                        if (stop) {
                            lock.unlock();
                            break;
                        }
                        if (result.size() > resultNum) {
                            stop = true;
                            lock.unlock();
                            break;
                        }
//                        lock.unlock();
//                        lock.lock();
                        if (oldmap.isEmpty()) {
                            if (emptyTimes < 5) {
                                emptyTimes++;
                                lock.unlock();
                                try {
                                    Thread.sleep(500);
                                } catch (InterruptedException e) {
                                    e.printStackTrace();
                                }
                                continue;
                            }
//                            System.out.println("old map is empty");
                            else {
                                stop = true;
                                lock.unlock();
                                break;
                            }
                        }
                        emptyTimes = 0;
                        String url = oldmap.poll();
                        assert url != null;//因为前面判断过oldmap不为空
//                        System.out.println("now check:"+url);
                        lock.unlock();
                        //开始基于这个url爬
                        try {
                            Connection connection = Jsoup.connect(url);
                            Document doc = connection.get();
                            String title = doc.title().toLowerCase(Locale.ROOT);
//                            System.out.println(title+":"+url);
                            //判断该网址的标题是否符合要求，进行匹配
                            for (int i = 0; i < args.length; i++) {
                                //能匹配则将该网址加入resultmap
                                lock.lock();
                                if (title.contains(args[i].toLowerCase(Locale.ROOT)) && !names.contains(title)) {
                                    Link l = new Link(title, url);

                                    if (hitTimes[i] <= resultNum / args.length) {
                                        result.add(l);
                                        hitTimes[i]++;
                                        names.add(title);
                                    } else {
                                        candidates.get(i).add(l);
                                        names.add(title);
                                    }
                                    lock.unlock();
                                    break;
                                }
                                lock.unlock();
                            }
                            //爬取当前网页的所有链接
                            Elements links = (Elements) doc.getElementsByTag("a");
//                            if(links.size()==0)System.out.println("nothing find in "+url);
                            for (int i = 0; i < Math.min(links.size(), 50); i++) {//如果设低了线程可能一直睡
                                Element link = links.get(i);
                                String linkHref = link.attr("href");
                                if (!linkHref.startsWith("http")) {
                                    linkHref = "https:/" + linkHref;
                                }
//                                System.out.println(linkHref);
                                //链接去重，不在history中，就加入newmap
                                lock.lock();
                                if (!history.contains(linkHref)) {
//                                    System.out.println("new herf to add to oldmap");
                                    oldmap.add(linkHref);
                                    history.add(linkHref);
                                }
                                lock.unlock();
                            }

                        } catch (Exception e) {
//                            e.printStackTrace();
                        }
//                        catch (IllegalArgumentException e) {
//                            System.out.println("无效URL");
//                        } catch (UnknownHostException e) {
//                            System.out.println("无效URL");
//                        } catch (SocketTimeoutException e) {
//                            System.out.println("读取超时");
//                        } catch (Exception e) {
                            e.printStackTrace();
//                            System.out.println("被反爬虫了");
//                            //"Unexpected end of file" implies that the remote server accepted and closed the connection without sending a response. It's possible that the remote system is too busy to handle the request, or that there's a network bug that randomly drops connections.
//                            //
//                            //It's also possible there is a bug in the server: something in the request causes an internal error, and the server simply closes the connection instead of sending a HTTP error response like it should. Several people suggest this is caused by missing headers or invalid header values in the request.
//                            //
//                            //With the information available it's impossible to say what's going wrong. If you have access to the servers in question you can use packet sniffing tools to find what exactly is sent and received, and look at logs to of the server process to see if there are any error messages.
//                        }

                    }
                    //while(stop):如果停止符(stop)为true，停止，并退出线程
                    //如果result.size()>20，给stop上锁，把stop改为true，解锁，退出线程
                    //如果oldmap中没有东西了，给oldmap和newmap上锁，如果newmap中有东西，则把newmap中的东西倒入oldmap，解锁；
                    //                      如果newmap中没东西，则给stop上锁，改为true，解锁，给map解锁，退出线程
                    //给oldmap上锁，从oldmap中取出一个网址，解锁，爬，给newmap上锁，把结果加入newmap，解锁
//                    System.out.println("thread "+ finalI +" out");

                }
            });        // 添加任务到线程池
        }
        executor.shutdown();
        while (!executor.isTerminated()) ;
        System.out.println("finish");
    }

    public static void main(String[] args) {
        String[] topics = {"Java", "C++", "线性代数", "小提琴", "现代文学"};
        MultiThreadSpider s = new MultiThreadSpider(30, 5);
        List<Link> res = new LinkedList<>();
        for (int j = 0; j < 4; j++) {
            s.crawl(topics);
            List<Link> r = s.result;
            res.addAll(r);
        }
        for (int i = 0; i < res.size(); i++) {
            System.out.println(res.get(i).getName() + "(" + res.get(i).getUrl() + ")");
        }
        System.out.println(Arrays.toString(s.hitTimes));

    }
}