网络爬虫之抓取网站新闻

最新推荐文章于 2024-08-11 14:00:23 发布

le4

最新推荐文章于 2024-08-11 14:00:23 发布

阅读量3.5k

点赞数 3

分类专栏：爬虫

本文链接：https://blog.csdn.net/openlms/article/details/8847176

版权

爬虫专栏收录该内容

0 篇文章 0 订阅

订阅专栏

由于项目的需要，需要用到爬虫，自己摸索一番，总结了一些小小的规律，现在总结如下：

1.什么是爬虫？

网络爬虫是一种自动获取网页内容的程序，是搜索引擎的重要组成部分。

传统爬虫从一个或若干初始网页的URL开始，获得初始网页上的URL，在抓取网页的过程中，不断从当前页面上抽取新的URL放入队列，直到满足系统的一定停止条件。对于垂直搜索来说，聚焦爬虫，即有针对性地爬取特定主题网页的爬虫，更为适合。

2.爬虫的实现

package com.demo;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Test {
	
	public static List<News> findList(String url) throws IOException{ //输入某个网站查找所有新闻的地址
		
		Connection conn = Jsoup.connect(url); //使用Jsoup获得url连接
		Document doc = conn.post(); 	// 请求返回整个文档对象
		//System.out.println(doc.html());
		Elements e=doc.select("a[class=newsgray a_space2]");     //返回所有的<a>超链接标签
		List<News> list=new ArrayList<News>();
		News news=null;
		for(Element element:e){			
			news=new News();
			String title=element.toString().substring(78);
			String temp=title.substring(0, title.length()-4);//新闻标题 
			news.setTitle(temp);		
			String path=element.absUrl("href"); //新闻所在路径
			String content=urlToHtml(path);
			news.setContent(content);
			news.setUrl(path);
			list.add(news);
		}
		return list;
	}
	public static String urlToHtml(String url) throws IOException{
		Connection conn = Jsoup.connect(url); //使用Jsoup获得url连接
		Document doc = conn.post(); 	// 请求返回整个文档对象
		StringBuilder sb=new StringBuilder();
		Elements e=doc.select("p");
		for(Element element:e){
			String content=element.toString();
			sb.append(content);
		}
		
		return sb.toString();
	}
	public static void main(String[] args) throws IOException {
		List<News> list=findList("http://news.aweb.com.cn/china/hyxw/");
		for(News news:list){
		  System.out.println(news.getContent());
		}
	}

}

News.java

package com.demo;

public class News {
	
	private String title;
	private String content;
	private String url;
	public String getTitle() {
		return title;
	}
	public void setTitle(String title) {
		this.title = title;
	}
	public String getContent() {
		return content;
	}
	public void setContent(String content) {
		this.content = content;
	}
	public String getUrl() {
		return url;
	}
	public void setUrl(String url) {
		this.url = url;
	}
	

}

比如我们要抓取 http://news.aweb.com.cn/china/hyxw/ 农业信息网上关于农业最新的消息