对于网络之中的数据爬取是当前大数据时代最为流行的一件事。关于数据的爬取有很多方法,很多语言,很多框架去完成相关的工作。比如当前最为主流的爬虫语言python、R语言,爬取数据相当方便和速度。但是,对于大多的java使用者来说使用java进行相关数据的爬取也是可以的。比如当前使用较多的java爬虫框架webmagic,httpclient+htmlclearner+xpath 。。。。进行数据的爬取也是相当不错的。
对于数据的爬取主要进行一下的步骤:
1. 下载指定URL的页面
2. 解析下载后的页面
3. 相关URL的管理
4, 下载数据的存储
。。。。。
今天我们主要进行的事是对于下载的代码进行分析。
public Page startDown(String url) {
Page page=new Page();
HttpClientBuilder custom = HttpClients.custom();
CloseableHttpClient build = custom.build();
HttpUriRequest request=new HttpGet(url);
try {
CloseableHttpResponse execute = build.execute(request);
HttpEntity entity = execute.getEntity();
String string = EntityUtils.toString(entity);
//
page.setUrl(url);
page.setPageString(string);
} catch (ClientProtocolException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return page;
}
上面为下载页面的主要代码,我们对于下载的代码进行分析。在分析怎样下载之前我们假象一下,下载页面的内容一般需要什么?
1. 连网 。没有网络你也打不开网页啊,更别说相关的下载啦。所以,第一步要检查相关的网络相关信息。
2. 有网那就根据url去连上网啊
3. 不用说就下载网页的内容
HttpClientBuilder custom = HttpClients.custom();
这段代码的实质是啥,就是创建
HttpClientBuilder类
public static HttpClientBuilder custom() { return HttpClientBuilder.create(); } 在往下看: public static HttpClientBuilder create() { return new HttpClientBuilder(); }
类都创建完了,要这个类干啥?就是创建请求
CloseableHttpClient build = custom.build(); 往下看: public CloseableHttpClient build() { // Create main request executor // We copy the instance fields to avoid changing them, and rename to avoid accidental use of the wrong version PublicSuffixMatcher publicSuffixMatcherCopy = this.publicSuffixMatcher; if (publicSuffixMatcherCopy == null) { publicSuffixMatcherCopy = PublicSuffixMatcherLoader.getDefault(); } HttpRequestExecutor requestExecCopy = this.requestExec; if (requestExecCopy == null) { requestExecCopy = new HttpRequestExecutor(); } HttpClientConnectionManager connManagerCopy = this.connManager; if (connManagerCopy == null) { LayeredConnectionSocketFactory sslSocketFactoryCopy = this.sslSocketFactory; if (sslSocketFactoryCopy == null) { final String[] supportedProtocols = systemProperties ? split( System.getProperty("https.protocols")) : null; final String[] supportedCipherSuites = systemProperties ? split( System.getProperty("https.cipherSuites")) : null; HostnameVerifier hostnameVerifierCopy = this.hostnameVerifier; if (hostnameVerifierCopy == null) { hostnameVerifierCopy = new DefaultHostnameVerifier(publicSuffixMatcherCopy); } if (sslcontext != null) { sslSocketFactoryCopy = new SSLConnectionSocketFactory( sslcontext, supportedProtocols, supportedCipherSuites, hostnameVerifierCopy); } else { if (systemProperties) { sslSocketFactoryCopy = new SSLConnectionSocketFactory( (SSLSocketFactory) SSLSocketFactory.getDefault(), supportedProtocols, supportedCipherSuites, hostnameVerifierCopy); } else { sslSocketFactoryCopy = new SSLConnectionSocketFactory( SSLContexts.createDefault(), hostnameVerifierCopy); } } } @SuppressWarnings("resource") final PoolingHttpClientConnectionManager poolingmgr = new PoolingHttpClientConnectionManager( RegistryBuilder.<ConnectionSocketFactory>create() .register("http", PlainConnectionSocketFactory.getSocketFactory()) .register("https", sslSocketFactoryCopy) .build(), null, null, null, connTimeToLive, connTimeToLiveTimeUnit != null ? connTimeToLiveTimeUnit : TimeUnit.MILLISECONDS); if (defaultSocketConfig != null) { poolingmgr.setDefaultSocketConfig(defaultSocketConfig); } if (defaultConnectionConfig != null) { poolingmgr.setDefaultConnectionConfig(defaultConnectionConfig); } if (systemProperties) { String s = System.getProperty("http.keepAlive", "true"); if ("true".equalsIgnoreCase(s)) { s = System.getProperty("http.maxConnections", "5"); final int max = Integer.parseInt(s); poolingmgr.setDefaultMaxPerRoute(max); poolingmgr.setMaxTotal(2 * max); } } if (maxConnTotal > 0) { poolingmgr.setMaxTotal(maxConnTotal); } if (maxConnPerRoute > 0) { poolingmgr.setDefaultMaxPerRoute(maxConnPerRoute); } connManagerCopy = poolingmgr; } ConnectionReuseStrategy reuseStrategyCopy = this.reuseStrategy; if (reuseStrategyCopy == null) { if (systemProperties) { final String s = System.getProperty("http.keepAlive", "true"); if ("true".equalsIgnoreCase(s)) { reuseStrategyCopy = DefaultConnectionReuseStrategy.INSTANCE; } else { reuseStrategyCopy = NoConnectionReuseStrategy.INSTANCE; } } else { reuseStrategyCopy = DefaultConnectionReuseStrategy.INSTANCE; } } ConnectionKeepAliveStrategy keepAliveStrategyCopy = this.keepAliveStrategy; if (keepAliveStrategyCopy == null) { keepAliveStrategyCopy = DefaultConnectionKeepAliveStrategy.INSTANCE; } AuthenticationStrategy targetAuthStrategyCopy = this.targetAuthStrategy; if (targetAuthStrategyCopy == null) { targetAuthStrategyCopy = TargetAuthenticationStrategy.INSTANCE; } AuthenticationStrategy proxyAuthStrategyCopy = this.proxyAuthStrategy; if (proxyAuthStrategyCopy == null) { proxyAuthStrategyCopy = ProxyAuthenticationStrategy.INSTANCE; } UserTokenHandler userTokenHandlerCopy = this.userTokenHandler; if (userTokenHandlerCopy == null) { if (!connectionStateDisabled) { userTokenHandlerCopy = DefaultUserTokenHandler.INSTANCE; } else { userTokenHandlerCopy = NoopUserTokenHandler.INSTANCE; } } String userAgentCopy = this.userAgent; if (userAgentCopy == null) { if (systemProperties) { userAgentCopy = System.getProperty("http.agent"); } if (userAgentCopy == null) { userAgentCopy = VersionInfo.getUserAgent("Apache-HttpClient", "org.apache.http.client", getClass()); } } ClientExecChain execChain = createMainExec( requestExecCopy, connManagerCopy, reuseStrategyCopy, keepAliveStrategyCopy, new ImmutableHttpProcessor(new RequestTargetHost(), new RequestUserAgent(userAgentCopy)), targetAuthStrategyCopy, proxyAuthStrategyCopy, userTokenHandlerCopy); execChain = decorateMainExec(execChain); HttpProcessor httpprocessorCopy = this.httpprocessor; if (httpprocessorCopy == null) { final HttpProcessorBuilder b = HttpProcessorBuilder.create(); if (requestFirst != null) { for (final HttpRequestInterceptor i: requestFirst) { b.addFirst(i); } } if (responseFirst != null) { for (final HttpResponseInterceptor i: responseFirst) { b.addFirst(i); } } b.addAll( new RequestDefaultHeaders(defaultHeaders), new RequestContent(), new RequestTargetHost(), new RequestClientConnControl(), new RequestUserAgent(userAgentCopy), new RequestExpectContinue()); if (!cookieManagementDisabled) { b.add(new RequestAddCookies()); } if (!contentCompressionDisabled) { if (contentDecoderMap != null) { final List<String> encodings = new ArrayList<String>(contentDecoderMap.keySet()); Collections.sort(encodings); b.add(new RequestAcceptEncoding(encodings)); } else { b.add(new RequestAcceptEncoding()); } } if (!authCachingDisabled) { b.add(new RequestAuthCache()); } if (!cookieManagementDisabled) { b.add(new ResponseProcessCookies()); } if (!contentCompressionDisabled) { if (contentDecoderMap != null) { final RegistryBuilder<InputStreamFactory> b2 = RegistryBuilder.create(); for (Map.Entry<String, InputStreamFactory> entry: contentDecoderMap.entrySet()) { b2.register(entry.getKey(), entry.getValue()); } b.add(new ResponseContentEncoding(b2.build())); } else { b.add(new ResponseContentEncoding()); } } if (requestLast != null) { for (final HttpRequestInterceptor i: requestLast) { b.addLast(i); } } if (responseLast != null) { for (final HttpResponseInterceptor i: responseLast) { b.addLast(i); } } httpprocessorCopy = b.build(); } execChain = new ProtocolExec(execChain, httpprocessorCopy); execChain = decorateProtocolExec(execChain); // Add request retry executor, if not disabled if (!automaticRetriesDisabled) { HttpRequestRetryHandler retryHandlerCopy = this.retryHandler; if (retryHandlerCopy == null) { retryHandlerCopy = DefaultHttpRequestRetryHandler.INSTANCE; } execChain = new RetryExec(execChain, retryHandlerCopy); } HttpRoutePlanner routePlannerCopy = this.routePlanner; if (routePlannerCopy == null) { SchemePortResolver schemePortResolverCopy = this.schemePortResolver; if (schemePortResolverCopy == null) { schemePortResolverCopy = DefaultSchemePortResolver.INSTANCE; } if (proxy != null) { routePlannerCopy = new DefaultProxyRoutePlanner(proxy, schemePortResolverCopy); } else if (systemProperties) { routePlannerCopy = new SystemDefaultRoutePlanner( schemePortResolverCopy, ProxySelector.getDefault()); } else { routePlannerCopy = new DefaultRoutePlanner(schemePortResolverCopy); } } // Add redirect executor, if not disabled if (!redirectHandlingDisabled) { RedirectStrategy redirectStrategyCopy = this.redirectStrategy; if (redirectStrategyCopy == null) { redirectStrategyCopy = DefaultRedirectStrategy.INSTANCE; } execChain = new RedirectExec(execChain, routePlannerCopy, redirectStrategyCopy); } // Optionally, add service unavailable retry executor final ServiceUnavailableRetryStrategy serviceUnavailStrategyCopy = this.serviceUnavailStrategy; if (serviceUnavailStrategyCopy != null) { execChain = new ServiceUnavailableRetryExec(execChain, serviceUnavailStrategyCopy); } // Optionally, add connection back-off executor if (this.backoffManager != null && this.connectionBackoffStrategy != null) { execChain = new BackoffStrategyExec(execChain, this.connectionBackoffStrategy, this.backoffManager); } Lookup<AuthSchemeProvider> authSchemeRegistryCopy = this.authSchemeRegistry; if (authSchemeRegistryCopy == null) { authSchemeRegistryCopy = RegistryBuilder.<AuthSchemeProvider>create() .register(AuthSchemes.BASIC, new BasicSchemeFactory()) .register(AuthSchemes.DIGEST, new DigestSchemeFactory()) .register(AuthSchemes.NTLM, new NTLMSchemeFactory()) .register(AuthSchemes.SPNEGO, new SPNegoSchemeFactory()) .register(AuthSchemes.KERBEROS, new KerberosSchemeFactory()) .build(); } Lookup<CookieSpecProvider> cookieSpecRegistryCopy = this.cookieSpecRegistry; if (cookieSpecRegistryCopy == null) { final CookieSpecProvider defaultProvider = new DefaultCookieSpecProvider(publicSuffixMatcherCopy); final CookieSpecProvider laxStandardProvider = new RFC6265CookieSpecProvider( RFC6265CookieSpecProvider.CompatibilityLevel.RELAXED, publicSuffixMatcherCopy); final CookieSpecProvider strictStandardProvider = new RFC6265CookieSpecProvider( RFC6265CookieSpecProvider.CompatibilityLevel.STRICT, publicSuffixMatcherCopy); cookieSpecRegistryCopy = RegistryBuilder.<CookieSpecProvider>create() .register(CookieSpecs.DEFAULT, defaultProvider) .register("best-match", defaultProvider) .register("compatibility", defaultProvider) .register(CookieSpecs.STANDARD, laxStandardProvider) .register(CookieSpecs.STANDARD_STRICT, strictStandardProvider) .register(CookieSpecs.NETSCAPE, new NetscapeDraftSpecProvider()) .register(CookieSpecs.IGNORE_COOKIES, new IgnoreSpecProvider()) .build(); } CookieStore defaultCookieStore = this.cookieStore; if (defaultCookieStore == null) { defaultCookieStore = new BasicCookieStore(); } CredentialsProvider defaultCredentialsProvider = this.credentialsProvider; if (defaultCredentialsProvider == null) { if (systemProperties) { defaultCredentialsProvider = new SystemDefaultCredentialsProvider(); } else { defaultCredentialsProvider = new BasicCredentialsProvider(); } } List<Closeable> closeablesCopy = closeables != null ? new ArrayList<Closeable>(closeables) : null; if (!this.connManagerShared) { if (closeablesCopy == null) { closeablesCopy = new ArrayList<Closeable>(1); } final HttpClientConnectionManager cm = connManagerCopy; if (evictExpiredConnections || evictIdleConnections) { final IdleConnectionEvictor connectionEvictor = new IdleConnectionEvictor(cm, maxIdleTime > 0 ? maxIdleTime : 10, maxIdleTimeUnit != null ? maxIdleTimeUnit : TimeUnit.SECONDS); closeablesCopy.add(new Closeable() { @Override public void close() throws IOException { connectionEvictor.shutdown(); } }); connectionEvictor.start(); } closeablesCopy.add(new Closeable() { @Override public void close() throws IOException { cm.shutdown(); } }); } return new InternalHttpClient( execChain, connManagerCopy, routePlannerCopy, cookieSpecRegistryCopy, authSchemeRegistryCopy, defaultCookieStore, defaultCredentialsProvider, defaultRequestConfig != null ? defaultRequestConfig : RequestConfig.DEFAULT, closeablesCopy); } }
看这么多头皮发麻吧,其实就是检查相关网络信息,进行创建网络的连接。接下来就是连接网络啦。
看到这个应该想到的是GET请求相关的信息,没错。HttpUriRequest request=new HttpGet(url);
网络已经连接,请求已经发出。剩下的就是获取请求数据下载。public HttpGet(final String uri) { super(); setURI(URI.create(uri)); }
CloseableHttpResponse execute = build.execute(request);
抛到你祖坟也要找到你,通过网络下载了吧。@Override public CloseableHttpResponse execute( final HttpUriRequest request) throws IOException, ClientProtocolException { return execute(request, (HttpContext) null); } 往下: public CloseableHttpResponse execute( final HttpUriRequest request, final HttpContext context) throws IOException, ClientProtocolException { Args.notNull(request, "HTTP request"); return doExecute(determineTarget(request), request, context); } 再往下: private static HttpHost determineTarget(final HttpUriRequest request) throws ClientProtocolException { // A null target may be acceptable if there is a default target. // Otherwise, the null target is detected in the director. HttpHost target = null; final URI requestURI = request.getURI(); if (requestURI.isAbsolute()) { target = URIUtils.extractHost(requestURI); if (target == null) { throw new ClientProtocolException("URI does not specify a valid host name: " + requestURI); } } return target; }
最后就是等到信息实体。
HttpEntity entity = execute.getEntity();
/** * Obtains the message entity of this response, if any. * The entity is provided by calling {@link #setEntity setEntity}. * * @return the response entity, or * {@code null} if there is none */ HttpEntity getEntity();
到此下载的流程和代码的分析已经完成。实现的代码会放在我的码云上的。