今天学习用java爬取网页上数据,发怕发现爬出来的都是乱码,结合网上找的一些编码问题解决了问题。
最开始写法:
public static void main(String[] args) throws IOException {
String web = "https://blog.csdn.net/dnuiking?type=blog";
String getdata = getdata(web);
System.out.println(getdata);
}
public static String getdata(String web) throws IOException {
StringBuilder sb = new StringBuilder();
URL url = new URL(web);
URLConnection conn = url.openConnection();
InputStreamReader isr = new InputStreamReader(conn.getInputStream());
int ch;
while ((ch = isr.read()) != -1) {
sb.append((char) ch);
}
isr.close();
return sb.toString();
}
乱码现象:
修改后:
public static void main(String[] args) throws IOException {
String web = "https://blog.csdn.net/dnuiking?type=blog";
String getdata = getdata(web);
System.out.println(getdata);
}
public static String getdata(String web) throws IOException {
StringBuilder sb = new StringBuilder();
URL url = new URL(web);
URLConnection conn = url.openConnection();
InputStreamReader isr = new InputStreamReader(conn.getInputStream(),"UTF-8");
int ch;
while ((ch = isr.read()) != -1) {
sb.append((char) ch);
}
isr.close();
return sb.toString();
}
解决问题
InputStreamReader 第二个参数可以指定编码方式,选择UTF-8可以正常显示中文。