使用Jsoup抽取数据

最新推荐文章于 2021-03-25 17:38:18 发布

iteye_14216

最新推荐文章于 2021-03-25 17:38:18 发布

阅读量190

点赞数

分类专栏： Search Engine 文章标签： jQuery HTML5 CSS ViewUI

本文链接：https://blog.csdn.net/iteye_14216/article/details/82035747

版权

Search Engine 专栏收录该内容

19 篇文章 0 订阅

订阅专栏

Jsoup是一个Java的HTML解析器，提供了非常方便的抽取和操作HTML文档方法，可以结合DOM，CSS和Jquery类似的方法来定位和得到节点的信息。
有着和Jquery一样强大的select和pipeline的API。
我们以从58同城网抽取租房信息为例,来说明如何使用它：


package test

import org.jsoup.nodes.Document
import java.util.HashMap
import org.jsoup.Jsoup
/**
 * Author: fuliang
 * http://fuliang.iteye.com
 */
class HouseEntry(var title: String,var link: String,var price: Integer, var houseType: String, var date: String){
	override def toString(): String = {
		return String.format("title: %s\tlink:%s\tprice:%d\thouseType:%s\tdate:%s\n", title,link,price,houseType,date);
	}
}

class HouseRentCrawler{
	def crawl(url: String,keyword: String,lowRange: Int,highRange: Int): List[HouseEntry] = {
		var doc = fetch(url,keyword,lowRange,highRange);
		return extract(doc);
	}

	private def fetch(url:String,keyword: String,lowRange: Int,highRange: Int): Document = {
		var params = new HashMap[String,String]();
		params.put("final","1");
		params.put("jump","2");
		params.put("searchtype","3");
		params.put("key",keyword);
		params.put("MinPrice",lowRange + "_" + highRange);

	    return Jsoup.connect(url).data(params)
									.userAgent("Mozilla")
									.timeout(10000)
									.get();
	}

	private def extract(doc: Document):  List[HouseEntry] = {
		val elements = doc.select("#infolist > tr:not(.dev)");
		var houseEntries = List[HouseEntry]();
		for(val i <- 0 until elements.size()){
			val entry = elements.get(i);
			val fields = entry.select("td"); 
			val title = fields.get(0).text();
			val link = fields.get(0).select("a[class=t]").attr("href");
			val price = fields.get(1).text().toInt;
			val houseType = fields.get(2).text();
			val date = fields.get(3).text();
			val houseEntry = new HouseEntry(title,link,price,houseType,date);
			houseEntries ::= houseEntry;
		}
		return houseEntries;
	}
}

object HouseRentCrawler{
	def main(args: Array[String]) {
		val url = "http://bj.58.com/zufang";
		val crawler = new HouseRentCrawler();
		val houseEntries = crawler.crawl(url,"智学苑",2000,3500);
		for(val entry <- houseEntries){
			println(entry);
		}
	}
}

Selector overview

* tagname: find elements by tag, e.g. a
* ns|tag: find elements by tag in a namespace, e.g. fb|name finds <fb:name> elements
* #id: find elements by ID, e.g. #logo
* .class: find elements by class name, e.g. .masthead
* [attribute]: elements with attribute, e.g. [href]
* [^attr]: elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes
* [attr=value]: elements with attribute value, e.g. [width=500]
* [attr^=value], [attr$=value], [attr*=value]: elements with attributes that start with, end with, or contain the value, e.g. [href*=/path/]
* [attr~=regex]: elements with attribute values that match the regular expression; e.g. img[src~=(?i)\.(png|jpe?g)]
* *: all elements, e.g. *

Selector combinations

* el#id: elements with ID, e.g. div#logo
* el.class: elements with class, e.g. div.masthead
* el[attr]: elements with attribute, e.g. a[href]
* Any combination, e.g. a[href].highlight
* ancestor child: child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body"
* parent > child: child elements that descend directly from parent, e.g. div.content > p finds p elements; and body > * finds the direct children of the body tag
* siblingA + siblingB: finds sibling B element immediately preceded by sibling A, e.g. div.head + div
* siblingA ~ siblingX: finds sibling X element preceded by sibling A, e.g. h1 ~ p
* el, el, el: group multiple selectors, find unique elements that match any of the selectors; e.g. div.masthead, div.logo

Pseudo selectors

* :lt(n): find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than n; e.g. td:lt(3)
* :gt(n): find elements whose sibling index is greater than n; e.g. div p:gt(2)
* :eq(n): find elements whose sibling index is equal to n; e.g. form input:eq(1)
* :has(seletor): find elements that contain elements matching the selector; e.g. div:has(p)
* :not(selector): find elements that do not match the selector; e.g. div:not(.logo)
* :contains(text): find elements that contain the given text. The search is case-insensitive; e.g. p:contains(jsoup)
* :containsOwn(text): find elements that directly contain the given text
* :matches(regex): find elements whose text matches the specified regular expression; e.g. div:matches((?i)login)
* :matchesOwn(regex): find elements whose own text matches the specified regular expression
* Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc

更多的信息可以参考[http://jsoup.org/|http://jsoup.org/]

iteye_14216

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
使用Jsoup抽取数据

Jsoup是一个Java的HTML解析器，提供了非常方便的抽取和操作HTML文档方法，可以结合DOM，CSS和Jquery类似的方法来定位和得到节点的信息。有着和Jquery一样强大的select和pipeline的API。我们以从58同城网抽取租房信息为例,来说明如何使用它：[code="java"]package testimport org.jsoup.nodes.D...
复制链接

扫一扫

专栏目录