Web Spider
seo:研究各大搜索引擎的爬虫规则,提高排名
爬虫步骤:
1、有一个URL
2、下载资源
3、分析数据(正则表达式。。)
4、数据抽取,清洗,存储
例子:下载京东网站数据
public static void main(String[] args) throws Exception {
//获取URL
URL url = new URL("https://www.jd.com");
//下载资源(从网络下载)
InputStream is = url.openStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is,"UTF-8"));
String msg = null;
while(null != (msg=br.readLine())) {
System.out.println(msg);
}
br.close();
}
例子:下载点评网数据
按照下载京东的方式会出现403错误提示,需要模拟浏览器
改成以下方式可以正常下载:
public static void main(String[] args) throws Exception {
//获取URL
URL url = new URL("https://www.dianping.com");
//下载资源(从网络下载)
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36");
BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream(),"UTF-8"));
String msg = null;
while(null != (msg=br.readLine())) {
System.out.println(msg);
}
br.close();
}