class WebReptils {
public static void main(String[] args) {
URL url = null;
URLConnection urlconn = null;
BufferedReader br = null;
PrintWriter pw = null;
String regex = "https://[\\w+\\.?/?]+\\.[A-Za-z]+"; //url匹配规则
Pattern p = Pattern.compile(regex);
try {
url = new URL("https://www.baidu.com/");
urlconn = url.openConnection();
pw = new PrintWriter(new FileWriter("D:/SiteURL.txt"), true);//将爬取到的链接放到D盘的SiteURL文件中
br = new BufferedReader(new InputStreamReader(urlconn.getInputStream()));
String buf = null;
while ((buf = br.readLine()) != null) {
Matcher buf_m = p.matcher(buf);
while (buf_m.find()) {
pw.println(buf_m.group());
}
}
System.out.println("爬取成功^_^");
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
pw.close();
}
}
}
默认是通过io流输出到D:/SiteURL.txt 这个路径 ,里面还涉及了匹配网址的正则表达式。还有关于Pattern这个为正则表达式设计的api的使用。在JDK API里有详细的关于如何写正则表达式的说明。然后平时io流写得少,用buffer接了一下原生io。