java 段子_java爬取捧腹网段子

最新推荐文章于 2021-07-06 19:36:57 发布

好摩

最新推荐文章于 2021-07-06 19:36:57 发布

阅读量159

点赞数

文章标签： java 段子

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_42134144/article/details/114451094

版权

该博客介绍了如何利用Java的HttpURLConnection和Jsoup库爬取并解析捧腹网的段子信息，包括作者、标题和正文。通过建立HTTP连接，解析DOM结构，提取所需内容，并将其写入磁盘文件。示例代码详细展示了爬虫的实现过程。

摘要由CSDN通过智能技术生成

先上效果图：

准备工作：

/**

* 建立http连接

*/

public static String Connect(String address) {

HttpURLConnection conn = null;

URL url = null;

InputStream in = null;

BufferedReader reader = null;

StringBuffer stringBuffer = null;

try {

url = new URL(address);

conn = (HttpURLConnection) url.openConnection();

conn.setConnectTimeout(5000);

conn.setReadTimeout(5000);

conn.setDoInput(true);

conn.connect();

in = conn.getInputStream();

reader = new BufferedReader(new InputStreamReader(in));

stringBuffer = new StringBuffer();

String line = null;

while ((line = reader.readLine()) != null) {

stringBuffer.append(line);

}

} catch (Exception e) {

e.printStackTrace();

} finally {

conn.disconnect();

try {

in.close();

reader.close();

} catch (Exception e) {

e.printStackTrace();

}

}

return stringBuffer.toString();

}

/**

* 用于将内容写入到磁盘文件

* @param allText

*/

private static void writeToFile(String allText) {

System.out.println("正在写入。。。");

BufferedOutputStream bos = null;

try {

File targetFile = new File("/Users/shibo/tmp/pengfu.txt");

File fileDir = targetFile.getParentFile();

if (!fileDir.exists()) {

fileDir.mkdirs();

}

if (!targetFile.exists()) {

targetFile.createNewFile();

}

bos = new BufferedOutputStream(new FileOutputStream(targetFile, true));

bos.write(allText.getBytes());

} catch (IOException e) {

e.printStackTrace();

} finally {

if (null != bos) {

try {

bos.close();

} catch (IOException e) {

e.printStackTrace();

}

}

}

System.out.println("写入完毕。。。");

}

引入jsoup的jar包(用于解析dom)：

org.jsoup

jsoup

1.11.2

开始分析网站:

捧腹网段子

首先找到我们需要的内容(作者、标题和正文)

查看其元素，我这里查看的是标题标签：

知道其结构之后，就可以获取我们想要的内容了：

public static void main(String[] args) {

StringBuilder allText = new StringBuilder();

for (int i = 1; i <= 50; i++) {

System.out.println("正在爬取第" + i + "页内容。。。");

// 建立连接，获取网页内容

String html = ConnectionUtil.Connect("https://www.pengfu.com/xiaohua_" + i + ".html");

// 将内容转换成dom格式，方便操作

Document doc = Jsoup.parse(html);

// 获取网页内所有标题节点

Elements titles = doc.select("h1.dp-b");

for (Element titleEle : titles) {

Element parent = titleEle.parent();

// 标题内容

String title = titleEle.getElementsByTag("a").text();

// 标题对应的作者

String author = parent.select("p.user_name_list > a").text();

// 标题对应的正文

String content = parent.select("div.content-img").text();

// 将内容格式化

allText.append(title)

.append("\r\n作者：").append(author)

.append("\r\n").append(content)

.append("\r\n").append("\r\n");

}

allText.append("-------------第").append(i).append("页-------------").append("\r\n");

System.out.println("第" + i + "页内容爬取完毕。。。");

}

//将内容写入磁盘

Test.writeToFile(allText.toString());

}

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。