爬虫简介
- 爬虫系统一般由三个模块构成:获取数据,解析数据,存储数据。
- 根据爬取范围分为垂直爬虫和通用爬虫,垂直爬虫专门针对某一类网站进行爬取,通用爬虫爬取互联网网上的所有数据。
爬虫原理
获取数据
获取数据需要发起网络请求,网络请求使用HTTP协议。
通过JDK的API获取数据
public void JdkHttpGetData throws Exception {
URL url = new URL("http://www.kanxue.ren");
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
InputStream inputStream = conn.getInputStream();
BufferedReader bufferedReader = new BufferedReader(
new InputStreamReader(inputStream, Charset.forName("utf-8")));
String line = null;
while ((line = bufferedReader.readLine()) != null) {
System.out.println(line);
}
}
通过HttpClient工具包获取数据
* 发起网络请求使用CloseableHttpClient
* GET请求用HttpGet对象,POST请求用HttpPost对象
* EntityUtils.toString()能够将输入流转换成字符串
* UrlEncodingFormEntity用来封装提交的数据
maven中导如jar包
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.5</version>
</dependency>
public static void main(String[] args) throws Exception {
CloseableHttpClient client = HttpClients.createDefault();
HttpGet request = new HttpGet("http://www.kanxue.ren");
CloseableHttpResponse response = client.execute(request);
int statusCode = response.getStatusLine().getStatusCode();
if (statusCode == 200) {
HttpEntity entity = response.getEntity();
String string = EntityUtils.toString(entity, Charset.forName("UTF-8"));
System.out.println(string);
}
}