java 爬网_java – 如何在Web爬网中获取内容

最新推荐文章于 2021-07-22 21:51:48 发布

一本黑

最新推荐文章于 2021-07-22 21:51:48 发布

阅读量76

点赞数

文章标签： java 爬网

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_32821477/article/details/114171597

版权

嗨！我正在尝试为蜘蛛算法实现这个伪代码来探索网络.

我需要一些关于伪代码下一步的想法：“使用SpiderLeg来获取内容”,

我在另一个类SpiderLeg中有一个方法,它有一个方法来获取该网页的所有URL,但想知道如何在这个类中使用它？

// method to crawl web and print out all URLs that the spider visit

public List crawl(String url, String keyword) throws IOException{

String currentUrl;

// while list of unvisited URLs is not empty

while(unvisited != null ){

// take URL from list

currentUrl = unvisited.get(0);

//using spiderLeg to fetch content

SpiderLeg leg = new SpiderLeg();

}

return unvisited;

}

干杯！！将尝试…但是我尝试了这个没有使用队列D.S,它几乎工作,但不会在搜索某些单词时停止程序.

当它发现它只显示网页的链接而不是它找到该单词的所有特定URL.

想知道可以这样做吗？

private static final int MAX_PAGES_TO_SEARCH = 10;

private Set pagesVisited = new HashSet();

private List pagesToVisit = new LinkedList();

public void crawl(String url, String searchWord)

{

while(this.pagesVisited.size() < MAX_PAGES_TO_SEARCH)

{

String currentUrl;

SpiderLeg leg = new SpiderLeg();

if(this.pagesToVisit.isEmpty())

{

currentUrl = url;

this.pagesVisited.add(url);

}

else

{

currentUrl = this.nextUrl();

}

leg.getHyperlink(currentUrl);

boolean success = leg.searchForWord(searchWord);

if(success)

{

System.out.println(String.format("**Success** Word %s found at %s", searchWord, currentUrl));

break;

}

this.pagesToVisit.addAll(leg.getLinks());

}

System.out.println("\n**Done** Visited " + this.pagesVisited.size() + " web page(s)");

}

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。