静态网页生成之一：一念生活网数据抓取

最新推荐文章于 2020-08-13 09:30:19 发布

leadergg

最新推荐文章于 2020-08-13 09:30:19 发布

阅读量936

点赞数

分类专栏： java技巧资料文章标签：生活抓取生成一念生活

本文链接：https://blog.csdn.net/leadergg/article/details/51643882

版权

java技巧资料专栏收录该内容

9 篇文章 0 订阅

订阅专栏

最近，想做个网站，试试Java的freemarker静态文件生成。想了半天也不知道做个什么，就随便看了下，然后通过域名工具，找到了一个较好的域名：www.1nsh.com；取名就叫一念生活网吧，然后又去阿里云买了个空间。

有了域名和空间，内容怎么办了，自己写，原创当然好，可惜太耗精力，咱们是程序员，当然想到的就是使用程序来处理了，于是就想到了抓取，网上搜索了许多抓取工具，但是都很难完全达到自己的要求，于是就想到了自己动手，丰衣足食。

上面已经说了，我的网站叫一念生活网，当然是因为域名已经是1nsh.com，逼得这样取得名字，内容当然是和生活相关的，于是就看上了情感生活这块，内容定位为：探讨两性话题；讲述情感故事；探索婚姻生活；口述经历实录；讨论婆媳关系；共聊恋爱技巧；八卦名人情事；做个单身贵族。够多的吧，呵呵，这些内容，男人都喜欢看的，这个不说你也明白。

明确了内容，就找了度娘，搜索了情感、两性相关的话题，一搜一大把，这当然也说明很热，当然也很好抓取内容。于是就找了个网站，来分析该怎么抓取。这里以mimito.com.cn为例，讲述怎么抓取：

准备：这里使用了jonup工具类。

public static Document getDoc(String url) {
		return getDoc(url, "utf-8");
	}
	public static Document getDoc(String url, String code) {
//		String ip = IpUtil.getIp();
		Document doc = null;
		try {
			doc =  Jsoup.parse(new URL(url).openStream(), code, url);
//			 
//			doc =Jsoup.connect(url)
				  .data("query", "Java")   // 请求参数
//					.header("X-Real-IP", ip)
//					.header("x-forwarded-for", ip)
//					.header("WL-Proxy-Client-IP", ip)
//					.userAgent("Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2") // 设置 User-Agent 
				.cookie("auth", "token") // 设置 cookie 
//					.timeout(5000)           // 设置连接超时时间
//					.get(); // get方式访问
		} catch (IOException e) {
			e.printStackTrace();
		}
		return doc;
	}

上述代码就是获取站点的html页面内容，注释部分，是设置了头部信息等，这里有个IpUtil.getIp()，是随机获取IP地址，以避免被站点封。这个就需要各位自己获取了，这里就不放出来了。

获取DOC之后，就是真正的抓取了，这里只获取了两个频道的信息，情感和X爱（避免和谐，大家都懂的）：

public void getNav() throws FailingHttpStatusCodeException, MalformedURLException, IOException {
		Map<String, String> pages = new HashMap<String, String>();
		pages.put("http://www.mimito.com.cn/qinggan/page_015.html", "情感");
		pages.put("http://www.mimito.com.cn/sex-love/page_100.html", "X爱");

        for(String key : pages.keySet()) {
	        String pageTotal = key.substring(key.indexOf("page_") + 5, key.lastIndexOf("."));
	        for(int i = 1; i<Integer.valueOf(pageTotal); i++) {
	        	String url = i==1 ? key.substring(0, key.indexOf("/page")) + "/index.html" : key.replaceAll("page_" + pageTotal, "page_" + i);
	        	boolean rs = true;
	        	String tn = pages.get(key);
	        	if(tn.equals("X爱")) {
	        		rs = getSexPage(url, tn);
	        	} else if(tn.equals("情感")) {
		        	rs = getQgPage(url, tn);
	        	}
	        	if(!rs) break;
	        }
        }
        System.out.println("本次抓取数据：" + count);
		// 最后输出失败的链接到数据库
		GenericArti.insertError2Db(soruce);
	}

上述代码是抓取频道数据，这里就没有直接从首页或顶部菜单遍历去抓了，所以这里直接遍历获取了分页数量（map里面的链接是最后一页的链接），然后进入到抓取列表数据：

	
	/**
	 * 获取分页数据
	 * @param url
	 * @throws FailingHttpStatusCodeException
	 * @throws MalformedURLException
	 * @throws IOException
	 */
	public boolean getSexPage(String url, String typeName) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
		System.out.println("start get page data!" + typeName);
		Document doc = GenericGrap.getDoc(url);
		Elements els = doc.getElementsByAttributeValue("class", "travel");
		List<ArticlesTo> artis = new ArrayList<ArticlesTo>();
		for(Element el : els) {
			try {
				Element hrefEl = el.getElementsByTag("h2").get(0).getElementsByTag("a").get(0);
				String title = hrefEl.text();
				if(GenericArti.titles.contains(title)) {
					System.out.println("存在，跳过：" + title);
					if(artis.size() == 0) {
						// 后续抓取的时候，就存在则直接返回
						return false;
					}
				}
				Elements thumbEls = el.getElementsByTag("img");
				Element thumbEl = null;
				ArticlesTo arti = new ArticlesTo();
				if(thumbEls != null && thumbEls.size() > 0) {
					thumbEl = el.getElementsByTag("img").get(0);
					arti.setThumb(baseUrl + thumbEl.attr("src"));
				}
				arti.setTitle(title);
				arti.setSource(soruce);
				arti.setSourceUrl(hrefEl.attr("href"));
				arti.setTypeName(typeName);
				arti.setState(8);
				artis.add(arti);
			} catch (Exception e) {
				e.printStackTrace();
			}
		}
		
        for(ArticlesTo arti : artis) {
        	try {
				Thread.sleep(IpUtil.genRandomNum() * 500);
			} catch (InterruptedException e) {
				e.printStackTrace();
			}
        	getDetail(arti);
        }
        return true;
	}
	

	/**
	 * 获取分页数据
	 * @param url
	 * @throws FailingHttpStatusCodeException
	 * @throws MalformedURLException
	 * @throws IOException
	 */
	public boolean getQgPage(String url, String typeName) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
		System.out.println("start get page data!" + typeName);
		Document doc = GenericGrap.getDoc(url);
		Elements els = doc.getElementById("n1").getElementsByTag("li");
		List<ArticlesTo> artis = new ArrayList<ArticlesTo>();
		for(Element el : els) {
			try {
				Element hrefEl = el.getElementsByTag("a").get(0);
				String title = hrefEl.text();
				if(GenericArti.titles.contains(title)) {
					System.out.println("存在，跳过：" + title);
					if(artis.size() == 0) {
						// 后续抓取的时候，就存在则直接返回
						return false;
					} else {
						break;
					}
				}
				Element timeEl = el.getElementsByAttributeValue("class", "time").get(0);
				ArticlesTo arti = new ArticlesTo();
				arti.setTitle(title);
				arti.setSource(soruce);
				arti.setSourceUrl(hrefEl.attr("href"));
				arti.setTypeName(typeName);
				arti.setPublishTime(timeEl.text());
				arti.setState(8);
				artis.add(arti);
			} catch (Exception e) {
				e.printStackTrace();
			}
		}
		
        for(ArticlesTo arti : artis) {
        	try {
				Thread.sleep(IpUtil.genRandomNum() * 500);
			} catch (InterruptedException e) {
				e.printStackTrace();
			}
        	getDetail(arti);
        }
        return true;
	}

这里有两个方法，是由于情感和X爱两个频道内容有点不一样，所以分开两个方法写的（这里没有把有些共用代码封装，自己用，就随意了点）；这里有个小处理，随机睡眠一个时间，这样是防止站点当做机器抓取，而封了。

拿到列表数据后，下一步当然就是抓取详情页面了：

	public void getDetail(ArticlesTo arti) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
		System.out.println("start get detail data!  " + arti.getSourceUrl());
		try {
			Document doc = GenericGrap.getDoc(arti.getSourceUrl());
			// 获取分页信息
			Elements els = doc.getElementsByAttributeValue("class", "page2").get(0).getElementsByTag("a");
			// 删除广告div
			doc.getElementsByAttributeValue("class", "t5").remove();
			String content = doc.getElementsByAttributeValue("class", "content_01").html();
			if(null != els && els.size()>2) {
				// 去掉当前页和下一页
				els.remove(els.size() - 1);
				els.remove(0);
				for(Element el : els) {
					String href = el.attr("href");
					Document pageDoc = GenericGrap.getDoc(href);
					// 删除广告div
					pageDoc.getElementsByAttributeValue("class", "t5").remove();
					content += "\n"+pageDoc.getElementsByAttributeValue("class", "content_01").html();
				}
			}
			content = GenericArti.handleSpace(HtmlUtil.delAllTag(content));
			
			arti.setContent("<div class=\"article\">" + content + "</div>");
			// 获取发布时间
			String info = doc.getElementById("artinfo").html();
			String publishTime = info.substring(info.indexOf("20"), info.indexOf("20") + 10);
			arti.setPublishTime(publishTime);
			GenericArti.insert2Db(arti);
			GenericArti.titles.add(arti.getTitle());
			count++;
        } catch (Exception e) {
        	GenericArti.errorUrls.put(arti.getSourceUrl(), arti.getTypeName());
        	e.printStackTrace();
        	return;
        }
	}

上述代码就是抓取详情，里面有些共用代码就没有贴出来了，后续会剥离一个干净的代码，打包上传。就这样，一念生活网的内容基本就抓取到了，站点一共抓取了3个站点的数据，共2万多条，这里只展示了抓取文章；后面会讲解下抓取图片及图片处理、生成静态页。

leadergg

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
静态网页生成之一：一念生活网数据抓取

最近，想做个网站，试试Java的freemarker静态文件生成。想了半天也不知道做个什么，就随便看了下，然后通过域名工具，找到了一个较好的域名：www.1nsh.com；取名就叫一念生活网吧，然后又去阿里云买了个空间。有了域名和空间，内容怎么办了，自己写，原创当然好，可惜太耗精力，咱们是程序员，当然想到的就是使用程序来处理了，于是就想到了抓取，网上搜索了许多抓取工具，但是都很
复制链接

扫一扫