一个最简单的网络爬虫的实现

最新推荐文章于 2022-10-30 15:35:45 发布

nhwcrival

最新推荐文章于 2022-10-30 15:35:45 发布

阅读量547

点赞数

分类专栏：网络应用文章标签：网络爬虫正则表达式数据结构

本文链接：https://blog.csdn.net/nhwcrival/article/details/40400599

版权

网络应用专栏收录该内容

0 篇文章 0 订阅

订阅专栏

网络爬虫听起来有点复杂，但最基本的原理却不难，就是给你一个网址，然后你把该网站的内容下载下来，筛选出上面其他的url地址，保存在一个队列里，然后访问其中一个url,再下载，再筛选，直到满足某个条件。

当然，其中还牵扯到各种策略，什么广度优先，深度优先，但我们这里是最简单的网络爬虫，所以不讨论。

好，那么我们从最简单的原理入手。首先，我们要建立一个保存网址的数据结构。

public class queue {  
  
    private LinkedList queue;  
      
    //构造函数  
    public queue()  
    {  
        queue=new LinkedList();  
    }  
    //入队列  
    public void enQueue(Object elem)  
    {  
        queue.addLast(elem);  
    }  
    //出队列  
    public Object deQueue()  
    {  
        return queue.removeFirst();  
    }  
    //判断队列是否为空  
    public boolean isEmpty()  
    {  
        return queue.isEmpty();  
    }  
    //判断队列中是否含有某个元素  
    public boolean contains(Object elem)  
    {  
        return queue.contains(elem);  
    }  
}

再然后，我们再写一个类来保存我们的网址。

public class MyQueue {

	// 已访问的URL的队列
	private Set visitedQueue;
	// 未访问的URL的队列
	private queue unVisitedQueue;

	// 构造函数
	public MyQueue() {
		visitedQueue = new HashSet<String>();
		unVisitedQueue = new queue();
	}

	// 加入已访问的队列
	public void addURL(String url) {
		visitedQueue.add(url);
	}

	// 返回已访问的队列
	public Set getVisited() {
		return this.visitedQueue;
	}

	// 移除访问过的URL
	public void removeUrl(String url) {
		visitedQueue.remove(url);
	}

	// 未访问过的URL出队列
	public String getUnVURL() {
		return (String) unVisitedQueue.deQueue();
	}
    public boolean contains(String url){
    	if(!unVisitedQueue.contains(url))
    		return true;
    	return false;
    }
	// 加入未访问过的URL
	public void addUnVURL(String url) {
		if ((url != null)
				&& (!url.trim().equals("") && (!visitedQueue.contains(url)) && (!unVisitedQueue
						.contains(url) ))&&url.contains("http") ) {
			unVisitedQueue.enQueue(url);
		}
	}

	// 获得已访问的URL的数目
	public int getVisitedNum() {
		return visitedQueue.size();
	}

	// 判断未访问的队列是否为空
	public boolean isEmpty() {
		return unVisitedQueue.isEmpty();
	}
}

最后，我们写主程序。当你输入一个网址，我们用Httpclient（需要你自己下载JAR包）来下载它的内容，并用正则表达式来筛选其中的url地址，并在控制台上输入爬去到的网址，当网址大于2000的时候，程序终止。

public class HttpDownLoader {
	static int count = 0;

	public static void main(String[] args) {
		HttpClient httpClient = new HttpClient();

		// 设置 Http 连接超时 5s
		httpClient.getHttpConnectionManager().getParams()
				.setConnectionTimeout(5000);

		Scanner sc = new Scanner(System.in);
		MyQueue mq = new MyQueue();
		mq.addUnVURL(sc.next());
		while (count < 1000 && !mq.isEmpty()) {
			String sh = mq.getUnVURL();
			GetMethod getMethod = new GetMethod(sh);
			// 设置 get 请求超时 5s
			getMethod.getParams().setParameter(HttpMethodParams.SO_TIMEOUT,
					5000);
			// 设置请求重试处理
			getMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER,
					new DefaultHttpMethodRetryHandler());
			mq.addURL(sh);
			try {
				StringBuilder sb = new StringBuilder();
				int status = httpClient.executeMethod(getMethod);
				BufferedReader br = null;
				if (status == HttpStatus.SC_OK) {
					br = new BufferedReader(new InputStreamReader(
							getMethod.getResponseBodyAsStream()));
					String line = null;
					while ((line = br.readLine()) != null) {
						sb.append(line);
					}
				} else if ((status == HttpStatus.SC_MOVED_PERMANENTLY)
						|| (status == HttpStatus.SC_MOVED_TEMPORARILY)
						|| (status == HttpStatus.SC_SEE_OTHER)
						|| (status == HttpStatus.SC_TEMPORARY_REDIRECT)) {
					Header head = getMethod.getResponseHeader("location");
					if (head != null) {
						String newURL = head.getValue();
						if ((newURL == null) || newURL.equals("")) {
							newURL = "/";
							GetMethod getMethod1 = new GetMethod(newURL);
							httpClient.equals(getMethod1);
							br = new BufferedReader(new InputStreamReader(
									getMethod1.getResponseBodyAsStream()));
							String line = null;
							while ((line = br.readLine()) != null) {
								sb.append(line);
							}
						}
					}
				}

				// String shtml = getMethod.getResponseBodyAsString();
				String mode = "(?<=(href=\")).*?(?=\")";
				// String mode ="<[aA]\\s*(href=[^>]+)>(.*?)</[aA]>" ;
System.out.println(sb.toString());
				Pattern p = Pattern.compile(mode);
				Matcher m = p.matcher(sb.toString());
				while (m.find()) {
					String url = m.group();
					if (url.contains("http") && mq.contains(url)) {
						System.out.println(url);
						mq.addUnVURL(url);
						count++;
					}
				}
			} catch (IOException e) {
				// TODO Auto-generated catch block
				// e.printStackTrace();
			}
			getMethod.releaseConnection();
		}
		// getMethod.releaseConnection();
		System.out.println("访问了" + mq.getVisitedNum() + "个网页");

	}
}

然后我们的简易网络爬虫就完成了。