html页面抓取

最新推荐文章于 2021-05-31 10:31:59 发布

ysong109

最新推荐文章于 2021-05-31 10:31:59 发布

阅读量202

点赞数

文章标签： HTML

本文链接：https://blog.csdn.net/ysong109/article/details/83676254

版权

public class Test001 {
List<String> rsList = new ArrayList<String>();
private Test001() {
try {
loadHtml();
} catch (IOException e) {
e.printStackTrace();
}
}

private void loadHtml() throws IOException {
// 定义一个url类的实例。
URL url = new URL("http://top.baidu.com/buzz/top10.html");
// 以特定格式读取文件流。
InputStreamReader isr = new InputStreamReader(url.openStream(),
"gb2312");
BufferedReader br = new BufferedReader(isr);
String s;

boolean beginFind = false;
while (null != (s = br.readLine())) {
if ("<tbody id=\"listdata\">".equals(s.trim())) {
beginFind = true;
} else if ("</tbody>".equals(s.trim())) {
break;
}

if (beginFind) {
if(s.trim().startsWith("<td><a")){
rsList.add(findContent(s.trim()));
}
}
}

for (int i = 0; i < rsList.size(); i++) {
System.out.println(rsList.get(i));
}
}

public String findContent(String html) {
// 配置html标记。
Pattern p = Pattern.compile("<(\\S*?)[^>]*>.*?| <.*? />");
Matcher m = p.matcher(html);

String rs = new String(html);
// 找出所有html标记。
while (m.find()) {
// 删除html标记。
rs = rs.replace(m.group(), "");
}
return rs;
}

public static void main(String[] args) throws IOException {
new Test001();
}
}

ysong109

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
html页面抓取

public class Test001 { List rsList = new ArrayList(); private Test001() { try { loadHtml(); } catch (IOException e) { e.printSta...
复制链接

扫一扫