最近让做一个统计数据的作业,就想去去爬取重庆的房价,决定爬取安居客的数据,然后又想着练习一下多线程就爬取了所有的数据。完整代码GitHub
看网页的代码结构
首先看网页的结构,安居客是按照城市分类,城市又分为不同的地区,我就计划一个一个地区的来抓取数据。首先看一下所有城市那个界面的结构
<!-- 我们需要的数据在下面的a标签里面 -->
<html>
<body>
<div class="content">
<div class="city-itm">
<div class="letter_city">
<ul>
<li>
<div class="city_list">
<a href="" class="hot"></a>
<a href="" class="hot"></a>
</div>
</li>
<li>
<div class="city_list">
<a href="" class="hot"></a>
<a href="" class="hot"></a>
</div>
</li>
</ul>
</div>
</div>
</div>
</body>
</html>
随便找一个城市进去找到那个城市的所有地区,我们要的是界面里面新房下面的按地区找房的地址,这里要注意有些城市是没有新房,打开后只有热门板块。打开一个有地区的城市看代码结构
<!-- 我们需要的数据在按地区下面的a标签里面 -->
<!-- 当没有新房地区的城市class="clearfix"标签下面的内容就不一样了 -->
<html>
<body>
<div id="content">
<div class="left-cont fl">
<div class="buy-house tab-contents" id="content_Rd1">
<div class="clearfix">
<!-- 这里是二手房的列表 --->
<div class="details float_l"></div>
<!-- 这里是二手房的列表 --->
<div class="details float_l">
<!-- 按地区 --->
<div class="areas">
<a> </a>
</div>
<!-- 按价格 --->
<div class="prices"></div>
</div>
</div>
</div>
</div>
</div>
</body>
</html>
然后我们找一个地区看结构
<!-- 若是第一页并且还有其他的页,就要获取那些链接 --->
<html>
<body>
<div id="container">
<div class="list-contents">
<div class="list-results">
<div class="key-list">
<!-- 这是楼盘的列表 -->
<div class="item-mod"></div>
</div>
<!-- 这是页码的标签 -->
<div class="list-page">
<div class="pagination">
<a></a>
</div>
</div>
</div>
</div>
</div>
</body>
</html>
然后楼盘的结构,还是刚才的界面,只是我们要看class=”item-mod”的div里面的结构。
<div class="item-mod" data-link="" data-soj="" rel="">
<a class="pic"><image></a>
<div class="info">
<a class"lp-name">
<h3><span class="items-name>融创白象街</span></h3>
</a>
<a class="address">
<span>[ 渝中 解放碑 ] 凯旋路16号</span>
</a>
<a class="tags-wrap>
<div class="tag-panel">
<!-- 这里可能不是两个都有,但是class="status-icon wuyetp"肯定有 -->
<i class="status-icon onsale">在售</i>
<i class="status-icon wuyetp">住宅</i>
</div>
</a>
</div>
<a class="favor-pos">
<p class="price">最低<span>1000</span>万元/套起</p>
</a>
</div>
开始写代码
先写一个模拟http请求的方法
public class HttpUtils {
public static String CreatHttpGet(String url) {
HttpClient httpclient = HttpClients.createDefault();
HttpGet httpget = new HttpGet(url);
httpget.setHeader("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1.2)");
httpget.setHeader("Accept-Language", "zh-cn,zh;q=0.5");
httpget.setHeader("Accept-Charset", "GB2312,utf-8;q=0.7,*;q=0.7");
HttpResponse response;
String result = null;
try {
response = httpclient.execute(httpget);
int statusCode = response.getStatusLine().getStatusCode();
if ((statusCode == HttpStatus.SC_MOVED_PERMANENTLY) || (statusCode == HttpStatus.SC_MOVED_TEMPORARILY)
|| (statusCode == HttpStatus.SC_SEE_OTHER) || (statusCode == HttpStatus.SC_TEMPORARY_REDIRECT)) {
String newUri = response.getLastHeader("Location").getValue();
httpclient = HttpClients.createDefault();
httpget = new HttpGet(newUri);
response = httpclient.execute(httpget);
}
HttpEntity entity = response.getEntity();
if (entity != null) {
// 将源码流保存在一个byte数组当中,因为可能需要两次用到该流,
byte[] bytes = EntityUtils.toByteArray(entity);
String charSet = "";
charSet = EntityUtils.getContentCharSet(entity);
result = new String(bytes, charSet);
}
} catch (ClientProtocolException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
httpclient.getConnectionManager().shutdown();
return result;
}
}
开始写解析网页的方法
解析网页获取数据可以用Pattern,开始我就是用的这个,测试时模拟的字符串短能正常使用,但是解析网页时就出问题了,后来百度后使用了Jsoup,解析网页很方便。
解析网页的思路想好以后我就想获取所有城市用一个线程。获取城市的地区用一个线程池,在城市的list里面去取链接后,获得网页解析出地区的链接。获取楼盘用一个线程池,在地区链接的list中取链接,解析出这个地区的楼盘。把数据添加到数据库也用一个线程池,得到一个地区所有的楼盘后就向线程池中添加一个线程。
开始时我是直接开始5个线程,每个线程一直去取链表的数据,直到链表为空,线程就结束。但是这样不是我想要的效果,然后就改成了得到一个链接就向线程池中提交一个任务。我把主要的代码放出来,完整的代码Github
搜索城市
public class CitySearchThread extends Thread {
public void run() {
String result = null;
result = HttpUtils.CreatHttpGet("https://www.anjuke.com/sy-city.html");
if (result != null) {
try {
Document doc = Jsoup.parse(result);
Element cityItmElement = doc.getElementsByClass("city-itm").first();
Element lettercityElement = cityItmElement.getElementsByClass("letter_city").first();
Element ulElement = lettercityElement.getElementsByTag("ul").first();
Elements liElements = ulElement.getElementsByTag("li");
for (int i = 0; i < liElements.size() - 1; i++) {
Element cityListElement = liElements.get(i).getElementsByClass("city_list").first();
Elements cityElements = cityListElement.getElementsByTag("a");
for (int j = 0; j < cityElements.size(); j++) {
Element cityElement = cityElements.get(j);
String url = cityElement.absUrl("href");
synchronized (DataList.cityList) {
DataList.cityList.add(url);
if (DataList.cityList.size() == 1) {
DataList.cityList.notifyAll();
}
}
}
}
System.out.println("城市搜索完成");
}catch (Exception e) {
System.out.println("网页被拦截了");
}finally {
// 标志搜索所有的城市完成
DataList.cityFlag = false;
}
}else {
// 标志搜索所有的城市完成
DataList.cityFlag = false;
System.out.println("城市搜索完成");
}
}
}
搜索地区
public class DistrictSearchThread extends Thread {
public void run() {
ExecutorService pool = Executors.newFixedThreadPool(5);
boolean flag = true;
while (flag) {
String url = null;
synchronized (DataList.cityList) {
try {
url = DataList.cityList.getFirst();
DataList.cityList.removeFirst();
} catch (Exception e) {
}
if (url == null) {
if (DataList.cityFlag == false) {
flag = false;
} else {
try {
DataList.cityList.wait(1000);// 防止最后一直等待,没有唤醒他
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
pool.execute(new SearchUtil(url));
}
pool.shutdown();
while (true) {
if (pool.isTerminated()) {
DataList.districtFlag = false;
System.out.println("地区搜索完成");
break;
}
try {
Thread.sleep(5000);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
class SearchUtil extends Thread {
private String url;
public SearchUtil(String url) {
super();
this.url = url;
}
@Override
public void run() {
String result = null;
// 当线程是刚唤醒的那str是null
if (url != null) {
result = HttpUtils.CreatHttpGet(url);
if (result != null) {
Document doc = Jsoup.parse(result);
try {
Element contentElement = doc.getElementById("content_Rd1");
Element detailsfloat_lElement = contentElement.getElementsByClass("details float_l").get(1);
Element areasElement = detailsfloat_lElement.getElementsByClass("areas").first();
Elements aElements = areasElement.getElementsByTag("a");
for (int i = 0; i < aElements.size(); i++) {
Element cityElement = aElements.get(i);
String districtUrl = cityElement.absUrl("href");
synchronized (DataList.districtList) {
DataList.districtList.add(districtUrl);
if (DataList.districtList.size() == 1) {
DataList.districtList.notifyAll();
}
}
}
} catch (Exception e) {
System.out.println(Thread.currentThread().getName() + "=====区域:" + url + "无相关数据");
}
}
}
}
}
}
获取楼盘和添加楼盘
public class HouseSearchThread extends Thread {
ExecutorService pool = Executors.newFixedThreadPool(10);
ExecutorService pool1 = Executors.newFixedThreadPool(5);
@Override
public void run() {
boolean flag = true;
while (flag) {
String url = null;
synchronized (DataList.districtList) {
try {
url = DataList.districtList.getFirst();
DataList.districtList.removeFirst();
} catch (Exception e) {
}
if (url == null) {
if (DataList.districtFlag == false) {
flag = false;
} else {
try {
DataList.districtList.wait(1000);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
pool.execute(new SearchUtil(url));
}
pool.shutdown();
while (true) {
if (pool.isTerminated()) {
System.out.println("楼盘搜索完成");
break;
}
try {
Thread.sleep(5000);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
pool1.shutdown();
while (true) {
if (pool1.isTerminated()) {
System.out.println("楼盘添加完成");
break;
}
try {
Thread.sleep(5000);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
System.out.println("完成");
}
class SearchUtil extends Thread {
private String url;
public SearchUtil(String url) {
this.url = url;
}
@Override
public void run() {
String baseUrl = url;
String result = null;
if (url != null) {
result = HttpUtils.CreatHttpGet(url);
if (result != null) {
LinkedList<Elements> resultList = new LinkedList<Elements>();
Map<String, Object> map = ExcisionUtil.roughExcisionAPage(result);// 获取当地的首页获取信息和页数
Integer page = (Integer) map.get("page");
resultList.add((Elements) map.get("list"));
// 获取当地的剩下几页
if (page != null) {
for (; page > 1; page--) {
String nextUrl = baseUrl + "p" + page + "/";
result = HttpUtils.CreatHttpGet(nextUrl);
resultList.add(ExcisionUtil.roughExcisionNPage(result));
}
}
// 添加信息
Iterator<Elements> it = resultList.iterator();
while (it.hasNext()) {
Elements ele = it.next();
if (ele != null) {
LinkedList<House> houseList = ExcisionUtil.exactExcision(ele);
pool1.execute(new AddUtil(houseList));
}
}
}
}
}
}
class AddUtil extends Thread{
private LinkedList<House> houseList;
public AddUtil(LinkedList<House> houseList) {
this.houseList = houseList;
}
@Override
public void run() {
SqlSession sqlSession = MyBatisUtil.getSqlSession(true);
HouseMapper mapper = sqlSession.getMapper(HouseMapper.class);
Iterator<House> it = houseList.iterator();
while(it.hasNext()) {
House house = it.next();
mapper.addHouse(house);
}
sqlSession.close();
}
}
}
解析网页获得楼盘
public class ExcisionUtil {
/**
* 切割出来包含楼盘的elements,包含页数
* @param result
* @return
*/
public static Map<String,Object> roughExcisionAPage(String result){
HashMap<String, Object> map = new HashMap<String, Object>();
if(result != null) {
Document doc = Jsoup.parse(result);
try {
Element elementRoot = doc.getElementById("container");
Element listElement = elementRoot.getElementsByClass("list-contents").first();
//Element element2 = elements.first();
Element keyListElement = listElement.getElementsByClass("key-list").first();
Elements houseElements = keyListElement.getElementsByClass("item-mod");
Element listPageElement = listElement.getElementsByClass("list-page").first();
Elements pageElement = listPageElement.getElementsByTag("a");
map.put("list", houseElements);
map.put("page", pageElement.size()+1);
}catch(Exception e) {
}
}
return map;
}
/**
* 粗略切割,不包含页数
* @param result
* @return
*/
public static Elements roughExcisionNPage(String result){
Elements houseElements = null;
if(result != null) {
Document doc = Jsoup.parse(result);
try {
Element elementRoot = doc.getElementById("container");
Element listElement = elementRoot.getElementsByClass("list-contents").first();
Element keyListElement = listElement.getElementsByClass("key-list").first();
houseElements = keyListElement.getElementsByClass("item-mod");
}catch(Exception e) {
}
}
return houseElements;
}
/**
* 获取房子的信息
* @param elements
* @return
*/
public static LinkedList<House> exactExcision(Elements elements){
LinkedList<House> resultList = new LinkedList<House>();
for(int i = 0;i<elements.size();i++) {
String name = null;
String address = null;
String state = null;
String describe = null;
String price = null;
Element element = elements.get(i);
Element infoElement = element.getElementsByClass("infos").first();
try {
Element nameElement = infoElement.getElementsByClass("lp-name").first();
Element h3Element = nameElement.getElementsByTag("h3").first().getAllElements().first();
name = h3Element.text();
}catch(Exception e) {
}
try {
Element addressElement = infoElement.getElementsByClass("address").first();
Element spanElement = addressElement.getElementsByTag("span").first();
address = spanElement.text();
}catch(Exception e) {
}
Element stateElement = null;
Element describeElement = null;
try {
Element tagswrapElement = infoElement.getElementsByClass("tags-wrap").first();
Element tagpanelElement = tagswrapElement.getElementsByClass("tag-panel").first();
Elements sdElements = tagpanelElement.getElementsByTag("i");
if(sdElements.size() == 1) {
describeElement = sdElements.first();
}
if(sdElements.size() == 2) {
stateElement = sdElements.first();
describeElement = sdElements.get(1);
}
if(stateElement != null) {
state = stateElement.text();
}
if(describeElement != null) {
describe = describeElement.text();
}
}catch(Exception e) {
}
try {
Element favorposElement = element.getElementsByClass("favor-pos").first();
Element pElement = favorposElement.getElementsByTag("p").first();
price = pElement.text();
}catch(Exception e) {
}
if(name != null) {
House house = new House(name, address, state, describe, price);
resultList.add(house);
}
}
return resultList;
}
}
这次写这个遇到了一些问题,多线程的很多东西以前都很模糊,写完后增加了技巧和理解,细节在后面写出来。