网络爬虫

最新推荐文章于 2024-08-29 21:44:40 发布

贤云

最新推荐文章于 2024-08-29 21:44:40 发布

阅读量1.2k

点赞数

文章标签：爬虫

本文链接：https://blog.csdn.net/weixin_43880379/article/details/104607641

版权

一.使用的技术

这个爬虫是近半个月前学习爬虫技术的一个小例子,比较简单,怕时间久了会忘,这里简单总结一下.主要用到的外部Jar包有HttpClient4.3.4,HtmlParser2.1,使用的开发工具(IDE)为intelij 13.1,Jar包管理工具为Maven,不习惯用intelij的同学,也可以使用eclipse新建一个项目.

二.爬虫基本知识

1.什么是网络爬虫?(爬虫的基本原理)

网络爬虫,拆开来讲,网络即指互联网,互联网就像一个蜘蛛网一样,爬虫就像是蜘蛛一样可以到处爬来爬去,把爬来的数据再进行加工处理.

百科上的解释:网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动的抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁，自动索引，模拟程序或者蠕虫。

基本原理:传统爬虫从一个或若干初始网页的URL开始，获得初始网页上的URL，在抓取网页的过程中，不断从当前页面上抽取新的URL放入队列,直到满足系统的一定停止条件，流程图所示。聚焦爬虫的工作流程较为复杂，需要根据一定的网页分析算法过滤与主题无关的链接，保留有用的链接并将其放入等待抓取的URL队列。然后，它将根据一定的搜索策略从队列中选择下一步要抓取的网页URL，并重复上述过程，直到达到系统的某一条件时停止

2.常用的爬虫策略有哪些?

网页的抓取策略可以分为深度优先、广度优先和最佳优先三种。深度优先在很多情况下会导致爬虫的陷入(trapped)问题，目前常见的是广度优先和最佳优先方法。

2.1广度优先(Width-First)

广度优先遍历是连通图的一种遍历策略。因为它的思想是从一个顶点V0开始，辐射状地优先遍历其周围较广的区域,故得名.

其基本思想:
1)、从图中某个顶点V0出发，并访问此顶点； 2)、从V0出发，访问V0的各个未曾访问的邻接点W1，W2，…,Wk;然后,依次从W1,W2,…,Wk出发访问各自未被访问的邻接点； 3)、重复步骤2，直到全部顶点都被访问为止。
如下图所示:

2.2深度优先(Depth-First)
假设初始状态是图中所有顶点都未被访问，则深度优先搜索方法的步骤是： 1）选取图中某一顶点Vi为出发点，访问并标记该顶点； 2）以Vi为当前顶点，依次搜索Vi的每个邻接点Vj，若Vj未被访问过，则访问和标记邻接点Vj，若Vj已被访问过，则搜索Vi的下一个邻接点； 3）以Vj为当前顶点，重复步骤2，直到图中和Vi有路径相通的顶点都被访问为止； 4）若图中尚有顶点未被访问过（非连通的情况下），则可任取图中的一个未被访问的顶点作为出发点，重复上述过程，直至图中所有顶点都被访问。
下面以一个有向图和一个无向图为例:

广度和深度和区别:

广度优先遍历是以层为顺序，将某一层上的所有节点都搜索到了之后才向下一层搜索；而深度优先遍历是将某一条枝桠上的所有节点都搜索到了之后，才转向搜索另一条枝桠上的所有节点。

2.3 最佳优先搜索

最佳优先搜索策略按照一定的网页分析算法，预测候选URL与目标网页的相似度，或与主题的相关性，并选取评价最好的一个或几个URL进行抓取。它只访问经过网页分析算法预测为“有用”的网页。这种搜索适合暗网数据的爬取,只要符合要求的内容.

3.本文爬虫示例图

本文介绍的例子是抓取新闻类的信息,因为一般新闻类的信息,重要的和时间近的都会放在首页,处在网络层中比较深的信息的重要性一般将逐级降低,所以广度优先算法更适合,下图是本文将要抓取的网页结构图:

三.广度优先爬虫示例

1.需求:抓取复旦新闻信息(只抓取100个网页信息)

这里只抓取100条信息,并用url必须以new.fudan.edu.cn开头.

2.代码实现

使用maven引入外部jar包:

view source print ?

01. <dependency>

02. <groupId>org.apache.httpcomponents</groupId>

03. <artifactId>httpclient</artifactId>

04. <version>4.3.4</version>

05. </dependency>

06. <dependency>

07. <groupId>org.htmlparser</groupId>

08. <artifactId>htmlparser</artifactId>

09. <version>2.1</version>

10. </dependency>

程序主入口:

view source print ?

01. package com.amos.crawl;

02.

03. import java.util.Set;

04.

05. /**

06. * Created by amosli on 14-7-10.

07. */

08. public class MyCrawler {

09. /**

10. * 使用种子初始化URL队列

11. *

12. * @param seeds

13. */

14. private void initCrawlerWithSeeds(String[] seeds) {

15. for (int i = 0; i < seeds.length; i++) {

16. LinkQueue.addUnvisitedUrl(seeds[i]);

17. }

18. }

19.

20. public void crawling(String[] seeds) {

21. //定义过滤器,提取以http://news.fudan.edu.cn/的链接

22. LinkFilter filter = new LinkFilter() {

23. @Override

24. public boolean accept(String url) {

25. if (url.startsWith("http://news.fudan.edu.cn")) {

26. return true;

27. }

28. return false;

29. }

30. };

31. //初始化URL队列

32. initCrawlerWithSeeds(seeds);

33.

34. int count=0;

35. //循环条件:待抓取的链接不为空抓取的网页最多100条

36. while (!LinkQueue.isUnvisitedUrlsEmpty() && LinkQueue.getVisitedUrlNum() <= 100) {

37.

38. System.out.println("count:"+(++count));

39.

40. //附头URL出队列

41. String visitURL = (String) LinkQueue.unVisitedUrlDeQueue();

42. DownLoadFile downloader = new DownLoadFile();

43. //下载网页

44. downloader.downloadFile(visitURL);

45. //该URL放入怩访问的URL中

46. LinkQueue.addVisitedUrl(visitURL);

47. //提取出下载网页中的URL

48. Set<String> links = HtmlParserTool.extractLinks(visitURL, filter);

49.

50. //新的未访问的URL入列

51. for (String link : links) {

52. System.out.println("link:"+link);

53. LinkQueue.addUnvisitedUrl(link);

54. }

55. }

56.

57. }

58.

59. public static void main(String args[]) {

60. //程序入口

61. MyCrawler myCrawler = new MyCrawler();

62. myCrawler.crawling(new String[]{"http://news.fudan.edu.cn/news/"});

63. }

64.

65. }

工具类:Tools.java

view source print ?

001. package com.amos.tool;

002.

003. import java.io.*;

004. import java.net.URI;

005. import java.net.URISyntaxException;

006. import java.net.UnknownHostException;

007. import java.security.KeyManagementException;

008. import java.security.KeyStoreException;

009. import java.security.NoSuchAlgorithmException;

010. import java.security.cert.CertificateException;

011. import java.security.cert.X509Certificate;

012. import java.util.Locale;

013.

014. import javax.net.ssl.SSLContext;

015. import javax.net.ssl.SSLException;

016.

017. import org.apache.http.*;

018. import org.apache.http.client.CircularRedirectException;

019. import org.apache.http.client.CookieStore;

020. import org.apache.http.client.HttpRequestRetryHandler;

021. import org.apache.http.client.RedirectStrategy;

022. import org.apache.http.client.config.RequestConfig;

023. import org.apache.http.client.methods.HttpGet;

024. import org.apache.http.client.methods.HttpHead;

025. import org.apache.http.client.methods.HttpUriRequest;

026. import org.apache.http.client.methods.RequestBuilder;

027. import org.apache.http.client.protocol.HttpClientContext;

028. import org.apache.http.client.utils.URIBuilder;

029. import org.apache.http.client.utils.URIUtils;

030. import org.apache.http.conn.ConnectTimeoutException;

031. import org.apache.http.conn.HttpClientConnectionManager;

032. import org.apache.http.conn.ssl.SSLConnectionSocketFactory;

033. import org.apache.http.conn.ssl.SSLContextBuilder;

034. import org.apache.http.conn.ssl.TrustStrategy;

035. import org.apache.http.cookie.Cookie;

036. import org.apache.http.impl.client.*;

037. import org.apache.http.impl.conn.BasicHttpClientConnectionManager;

038. import org.apache.http.impl.cookie.BasicClientCookie;

039. import org.apache.http.protocol.HttpContext;

040. import org.apache.http.util.Args;

041. import org.apache.http.util.Asserts;

042. import org.apache.http.util.TextUtils;

043. import org.omg.CORBA.Request;

044.

045. /**

046. * Created by amosli on 14-6-25.

047. */

048. public class Tools {

049.

050.

051. /**

052. * 写文件到本地

053. *

054. * @param httpEntity

055. * @param filename

056. */

057. public static void saveToLocal(HttpEntity httpEntity, String filename) {

058.

059. try {

060.

061. File dir = new File(Configuration.FILEDIR);

062. if (!dir.isDirectory()) {

063. dir.mkdir();

064. }

065.

066. File file = new File(dir.getAbsolutePath() + "/" + filename);

067. FileOutputStream fileOutputStream = new FileOutputStream(file);

068. InputStream inputStream = httpEntity.getContent();

069.

070. byte[] bytes = new byte[1024];

071. int length = 0;

072. while ((length = inputStream.read(bytes)) > 0) {

073. fileOutputStream.write(bytes, 0, length);

074. }

075. inputStream.close();

076. fileOutputStream.close();

077. } catch (Exception e) {

078. e.printStackTrace();

079. }

080.

081. }

082.

083. /**

084. * 写文件到本地

085. *

086. * @param bytes

087. * @param filename

088. */

089. public static void saveToLocalByBytes(byte[] bytes, String filename) {

090.

091. try {

092.

093. File dir = new File(Configuration.FILEDIR);

094. if (!dir.isDirectory()) {

095. dir.mkdir();

096. }

097.

098. File file = new File(dir.getAbsolutePath() + "/" + filename);

099. FileOutputStream fileOutputStream = new FileOutputStream(file);

100. fileOutputStream.write(bytes);

101. //fileOutputStream.write(bytes, 0, bytes.length);

102. fileOutputStream.close();

103. } catch (Exception e) {

104. e.printStackTrace();

105. }

106.

107. }

108.

109. /**

110. * 输出

111. * @param string

112. */

113. public static void println(String string){

114. System.out.println("string:"+string);

115. }

116. /**

117. * 输出

118. * @param string

119. */

120. public static void printlnerr(String string){

121. System.err.println("string:"+string);

122. }

123.

124.

125. /**

126. * 使用ssl通道并设置请求重试处理

127. * @return

128. */

129. public static CloseableHttpClient createSSLClientDefault() {

130. try {

131. SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, new TrustStrategy() {

132. //信任所有

133. public boolean isTrusted(X509Certificate[] chain,String authType) throws CertificateException {

134. return true;

135. }

136. }).build();

137.

138. SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(sslContext);

139.

140. //设置请求重试处理,重试机制,这里如果请求失败会重试5次

141. HttpRequestRetryHandler retryHandler = new HttpRequestRetryHandler() {

142. @Override

143. public boolean retryRequest(IOException exception, int executionCount, HttpContext context) {

144. if (executionCount >= 5) {

145. // Do not retry if over max retry count

146. return false;

147. }

148. if (exception instanceof InterruptedIOException) {

149. // Timeout

150. return false;

151. }

152. if (exception instanceof UnknownHostException) {

153. // Unknown host

154. return false;

155. }

156. if (exception instanceof ConnectTimeoutException) {

157. // Connection refused

158. return false;

159. }

160. if (exception instanceof SSLException) {

161. // SSL handshake exception

162. return false;

163. }

164. HttpClientContext clientContext = HttpClientContext.adapt(context);

165. HttpRequest request = clientContext.getRequest();

166. boolean idempotent = !(request instanceof HttpEntityEnclosingRequest);

167. if (idempotent) {

168. // Retry if the request is considered idempotent

169. return true;

170. }

171. return false;

172. }

173. };

174.

175. //请求参数设置,设置请求超时时间为20秒,连接超时为10秒,不允许循环重定向

176. RequestConfig requestConfig = RequestConfig.custom()

177. .setConnectionRequestTimeout(20000).setConnectTimeout(20000)

178. .setCircularRedirectsAllowed(false)

179. .build();

180.

181. Cookie cookie ;

182. return HttpClients.custom().setSSLSocketFactory(sslsf)

183. .setUserAgent("Mozilla/5.0 (X11; <a href="http://www.it165.net/os/oslin/" target="_blank" class="keylink">Linux</a> x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36")

184. .setMaxConnPerRoute(25).setMaxConnPerRoute(256)

185. .setRetryHandler(retryHandler)

186. .setRedirectStrategy(new SelfRedirectStrategy())

187. .setDefaultRequestConfig(requestConfig)

188. .build();

189.

190. } catch (KeyManagementException e) {

191. e.printStackTrace();

192. } catch (NoSuchAlgorithmException e) {

193. e.printStackTrace();

194. } catch (KeyStoreException e) {

195. e.printStackTrace();

196. }

197. return HttpClients.createDefault();

198. }

199.

200. /**

201. * 带cookiestore

202. * @param cookieStore

203. * @return

204. */

205.

206. public static CloseableHttpClient createSSLClientDefaultWithCookie(CookieStore cookieStore) {

207. try {

208. SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, new TrustStrategy() {

209. //信任所有

210. public boolean isTrusted(X509Certificate[] chain,String authType) throws CertificateException {

211. return true;

212. }

213. }).build();

214.

215. SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(sslContext);

216.

217. //设置请求重试处理,重试机制,这里如果请求失败会重试5次

218. HttpRequestRetryHandler retryHandler = new HttpRequestRetryHandler() {

219. @Override

220. public boolean retryRequest(IOException exception, int executionCount, HttpContext context) {

221. if (executionCount >= 5) {

222. // Do not retry if over max retry count

223. return false;

224. }

225. if (exception instanceof InterruptedIOException) {

226. // Timeout

227. return false;

228. }

229. if (exception instanceof UnknownHostException) {

230. // Unknown host

231. return false;

232. }

233. if (exception instanceof ConnectTimeoutException) {

234. // Connection refused

235. return false;

236. }

237. if (exception instanceof SSLException) {

238. // SSL handshake exception

239. return false;

240. }

241. HttpClientContext clientContext = HttpClientContext.adapt(context);

242. HttpRequest request = clientContext.getRequest();

243. boolean idempotent = !(request instanceof HttpEntityEnclosingRequest);

244. if (idempotent) {

245. // Retry if the request is considered idempotent

246. return true;

247. }

248. return false;

249. }

250. };

251.

252. //请求参数设置,设置请求超时时间为20秒,连接超时为10秒,不允许循环重定向

253. RequestConfig requestConfig = RequestConfig.custom()

254. .setConnectionRequestTimeout(20000).setConnectTimeout(20000)

255. .setCircularRedirectsAllowed(false)

256. .build();

257.

258.

259. return HttpClients.custom().setSSLSocketFactory(sslsf)

260. .setUserAgent("Mozilla/5.0 (X11; <a href="http://www.it165.net/os/oslin/" target="_blank" class="keylink">Linux</a> x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36")

261. .setMaxConnPerRoute(25).setMaxConnPerRoute(256)

262. .setRetryHandler(retryHandler)

263. .setRedirectStrategy(new SelfRedirectStrategy())

264. .setDefaultRequestConfig(requestConfig)

265. .setDefaultCookieStore(cookieStore)

266. .build();

267.

268. } catch (KeyManagementException e) {

269. e.printStackTrace();

270. } catch (NoSuchAlgorithmException e) {

271. e.printStackTrace();

272. } catch (KeyStoreException e) {

273. e.printStackTrace();

274. }

275. return HttpClients.createDefault();

276. }

277.

278. }

View Code
将网页写入到本地的下载类:DownLoadFile.java

view source print ?

001. package com.amos.crawl;

002.

003. import com.amos.tool.Configuration;

004. import com.amos.tool.Tools;

005. import org.apache.http.*;

006. import org.apache.http.client.HttpClient;

007. import org.apache.http.client.HttpRequestRetryHandler;

008. import org.apache.http.client.config.RequestConfig;

009. import org.apache.http.client.methods.HttpGet;

010. import org.apache.http.client.protocol.HttpClientContext;

011. import org.apache.http.conn.ClientConnectionManager;

012. import org.apache.http.conn.ConnectTimeoutException;

013. import org.apache.http.impl.client.AutoRetryHttpClient;

014. import org.apache.http.impl.client.DefaultHttpClient;

015. import org.apache.http.protocol.HttpContext;

016.

017. import javax.net.ssl.SSLException;

018. import java.io.*;

019. import java.net.UnknownHostException;

020.

021.

022. /**

023. * Created by amosli on 14-7-9.

024. */

025. public class DownLoadFile {

026.

027. public String getFileNameByUrl(String url, String contentType) {

028. //移除http http://

029. url = url.contains("http://") ? url.substring(7) : url.substring(8);

030.

031. //text/html类型

032. if (url.contains(".html")) {

033. url = url.replaceAll("[\\?/:*|<>\"]", "_");

034. } else if (contentType.indexOf("html") != -1) {

035. url = url.replaceAll("[\\?/:*|<>\"]", "_") + ".html";

036. } else {

037. url = url.replaceAll("[\\?/:*|<>\"]", "_") + "." + contentType.substring(contentType.lastIndexOf("/") + 1);

038. }

039. return url;

040. }

041.

042. /**

043. * 将网页写入到本地

044. * @param data

045. * @param filePath

046. */

047. private void saveToLocal(byte[] data, String filePath) {

048.

049. try {

050. DataOutputStream out = new DataOutputStream(new FileOutputStream(new File(filePath)));

051. for(int i=0;i<data.length;i++){

052. out.write(data[i]);

053. }

054. out.flush();

055. out.close();

056.

057. } catch (Exception e) {

058. e.printStackTrace();

059. }

060. }

061.

062. /**

063. * 写文件到本地

064. *

065. * @param httpEntity

066. * @param filename

067. */

068. public static void saveToLocal(HttpEntity httpEntity, String filename) {

069.

070. try {

071.

072. File dir = new File(Configuration.FILEDIR);

073. if (!dir.isDirectory()) {

074. dir.mkdir();

075. }

076.

077. File file = new File(dir.getAbsolutePath() + "/" + filename);

078. FileOutputStream fileOutputStream = new FileOutputStream(file);

079. InputStream inputStream = httpEntity.getContent();

080.

081. if (!file.exists()) {

082. file.createNewFile();

083. }

084. byte[] bytes = new byte[1024];

085. int length = 0;

086. while ((length = inputStream.read(bytes)) > 0) {

087. fileOutputStream.write(bytes, 0, length);

088. }

089. inputStream.close();

090. fileOutputStream.close();

091. } catch (Exception e) {

092. e.printStackTrace();

093. }

094.

095. }

096.

097.

098. public String downloadFile(String url) {

099.

100. //文件路径

101. String filePath=null;

102.

103. //1.生成HttpClient对象并设置参数

104. HttpClient httpClient = Tools.createSSLClientDefault();

105.

106. //2.HttpGet对象并设置参数

107. HttpGet httpGet = new HttpGet(url);

108.

109. //设置get请求超时5s

110. //方法1

111. //httpGet.getParams().setParameter("connectTimeout",5000);

112. //方法2

113. RequestConfig requestConfig = RequestConfig.custom().setConnectTimeout(5000).build();

114. httpGet.setConfig(requestConfig);

115.

116. try {

117. HttpResponse httpResponse = httpClient.execute(httpGet);

118. int statusCode = httpResponse.getStatusLine().getStatusCode();

119. if(statusCode!= HttpStatus.SC_OK){

120. System.err.println("Method failed:"+httpResponse.getStatusLine());

121. filePath=null;

122. }

123.

124. filePath=getFileNameByUrl(url,httpResponse.getEntity().getContentType().getValue());

125. saveToLocal(httpResponse.getEntity(),filePath);

126.

127. } catch (Exception e) {

128. e.printStackTrace();

129. }

130.

131. return filePath;

132.

133. }

134.

135.

136.

137. public static void main(String args[]) throws IOException {

138. String url = "http://websearch.fudan.edu.cn/search_dep.html";

139. HttpClient httpClient = new DefaultHttpClient();

140. HttpGet httpGet = new HttpGet(url);

141. HttpResponse httpResponse = httpClient.execute(httpGet);

142. Header contentType = httpResponse.getEntity().getContentType();

143.

144. System.out.println("name:" + contentType.getName() + "value:" + contentType.getValue());

145. System.out.println(new DownLoadFile().getFileNameByUrl(url, contentType.getValue()));

146.

147. }

148.

149.

150. }

View Code
创建一个过滤接口:LinkFilter.java

view source print ?

01. package com.amos.crawl;

02.

03. /**

04. * Created by amosli on 14-7-10.

05. */

06. public interface LinkFilter {

07.

08. public boolean accept(String url);

09.

10. }

使用HtmlParser的过滤url的方法:HtmlParserTool.java

view source print ?

01. package com.amos.crawl;

02.

03. import org.htmlparser.Node;

04. import org.htmlparser.NodeFilter;

05. import org.htmlparser.Parser;

06. import org.htmlparser.filters.NodeClassFilter;

07. import org.htmlparser.filters.OrFilter;

08. import org.htmlparser.tags.LinkTag;

09. import org.htmlparser.util.NodeList;

10.

11. import java.util.HashSet;

12. import java.util.Set;

13.

14. /**

15. * Created by amosli on 14-7-10.

16. */

17. public class HtmlParserTool {

18. public static Set<String> extractLinks(String url, LinkFilter filter) {

19. Set<String> links = new HashSet<String>();

20.

21. try {

22. Parser parser = new Parser(url);

23. parser.setEncoding("GBK");

24. //过滤<frame>标签的filter,用来提取frame标签里的src属性

25. NodeFilter framFilter = new NodeFilter() {

26. @Override

27. public boolean accept(Node node) {

28. if (node.getText().contains("frame src=")) {

29. return true;

30. } else {

31. return false;

32. }

33.

34. }

35. };

36.

37. //OrFilter来设置过滤<a>标签和<frame>标签

38. OrFilter linkFilter = new OrFilter(new NodeClassFilter(LinkTag.class), framFilter);

39. //得到所有经过过滤的标签

40. NodeList list = parser.extractAllNodesThatMatch(linkFilter);

41. for (int i = 0; i < list.size(); i++) {

42. Node tag = list.elementAt(i);

43. if (tag instanceof LinkTag) {

44. tag = (LinkTag) tag;

45. String linkURL = ((LinkTag) tag).getLink();

46.

47. //如果符合条件那么将url添加进去

48. if (filter.accept(linkURL)) {

49. links.add(linkURL);

50. }

51.

52. } else {//frame 标签

53. //frmae里src属性的链接,如<frame src="test.html" />

54. String frame = tag.getText();

55. int start = frame.indexOf("src=");

56. frame = frame.substring(start);

57.

58. int end = frame.indexOf(" ");

59. if (end == -1) {

60. end = frame.indexOf(">");

61. }

62. String frameUrl = frame.substring(5, end - 1);

63. if (filter.accept(frameUrl)) {

64. links.add(frameUrl);

65. }

66. }

67.

68. }

69.

70. } catch (Exception e) {

71. e.printStackTrace();

72. }

73.

74. return links;

75. }

76.

77.

78. }

管理网页url的实现队列: Queue.java

view source print ?

01. package com.amos.crawl;

02.

03. import java.util.LinkedList;

04.

05. /**

06. * Created by amosli on 14-7-9.

07. */

08. public class Queue {

09.

10. //使用链表实现队列

11. private LinkedList queueList = new LinkedList();

12.

13.

14. //入队列

15. public void enQueue(Object object) {

16. queueList.addLast(object);

17. }

18.

19. //出队列

20. public Object deQueue() {

21. return queueList.removeFirst();

22. }

23.

24. //判断队列是否为空

25. public boolean isQueueEmpty() {

26. return queueList.isEmpty();

27. }

28.

29. //判断队列是否包含ject元素..

30. public boolean contains(Object object) {

31. return queueList.contains(object);

32. }

33.

34. //判断队列是否为空

35. public boolean empty() {

36. return queueList.isEmpty();

37. }

38.

39. }

网页链接进出队列的管理:LinkQueue.java

view source print ?

01. package com.amos.crawl;

02.

03. import java.util.HashSet;

04. import java.util.Set;

05.

06. /**

07. * Created by amosli on 14-7-9.

08. */

09. public class LinkQueue {

10. //已经访问的队列

11. private static Set visitedUrl = new HashSet();

12. //未访问的队列

13. private static Queue unVisitedUrl = new Queue();

14.

15. //获得URL队列

16. public static Queue getUnVisitedUrl() {

17. return unVisitedUrl;

18. }

19. public static Set getVisitedUrl() {

20. return visitedUrl;

21. }

22. //添加到访问过的URL队列中

23. public static void addVisitedUrl(String url) {

24. visitedUrl.add(url);

25. }

26.

27. //删除已经访问过的URL

28. public static void removeVisitedUrl(String url){

29. visitedUrl.remove(url);

30. }

31. //未访问的URL出队列

32. public static Object unVisitedUrlDeQueue(){

33. return unVisitedUrl.deQueue();

34. }

35. //保证每个URL只被访问一次,url不能为空,同时已经访问的URL队列中不能包含该url,而且因为已经出队列了所未访问的URL队列中也不能包含该url

36. public static void addUnvisitedUrl(String url){

37. if(url!=null&&!url.trim().equals("")&&!visitedUrl.contains(url)&&!unVisitedUrl.contains(url))

38. unVisitedUrl.enQueue(url);

39. }

40. //获得已经访问过的URL的数量

41. public static int getVisitedUrlNum(){

42. return visitedUrl.size();

43. }

44.

45. //判断未访问的URL队列中是否为空

46. public static boolean isUnvisitedUrlsEmpty(){

47. return unVisitedUrl.empty();

48. }

49. }

抓取思路是:首先给出要抓取的url==>查询符合条件的url,并将其加入到队列中==>按顺序取出队列中的url,并访问之,同时取出符合条件的url==>下载队列中的url网页,即按层探索,最多限制100条数据.

3.3　截图

贤云

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
网络爬虫

lParser2.1,使用的开发工具(IDE)为intelij 13.1,Jar包管理工具为Maven,不习惯用intelij的同学,也可以使用eclipse新建一个项目.二.爬虫基本知识1.什么是网络爬虫?(爬虫的基本原理)网络爬虫,拆开来讲,网络即指互联网,互联网就像一个蜘蛛网一样,爬虫就像是蜘蛛一样可以到处爬来爬去,把爬来的数据再进行加工处理.百科上的解释:网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动的抓取万维网信息的
复制链接

扫一扫