Scala结合HttpClient实现简单的爬虫

最新推荐文章于 2024-09-19 13:37:48 发布

biww620

最新推荐文章于 2024-09-19 13:37:48 发布

阅读量2.8k

点赞数

分类专栏： Scala 文章标签： scala 爬虫函数式编程

本文链接：https://blog.csdn.net/biww620/article/details/77938575

版权

Scala 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Scala是一种“面向对象的函数式”语言。对于不熟悉函数式编程风格的人来说适应Scala确实需要一段时间。也只有多看多写了吧。以下是用Scala结合HttpClient实现的一个简单的爬虫小程序。

package com.eric.crawler

import java.io.{BufferedReader, InputStream, InputStreamReader}

import com.eric.Response
import org.apache.http.HttpEntity
import org.apache.http.client.methods.{CloseableHttpResponse, HttpGet}
import org.apache.http.impl.client.{CloseableHttpClient, HttpClients}
import org.apache.http.util.EntityUtils

import scala.io.Source

object Crawler {

  val httpClient : CloseableHttpClient = HttpClients.createDefault()

  /**
    * doGet请求获取一个网页
    * @param url
    * @return
    */
  def doGet(url : String) : Response = {
    val httpGet : HttpGet = new HttpGet(url)     //初始化httpGet
    val httpResponse : CloseableHttpResponse = httpClient.execute(httpGet)
    val httpEntity : HttpEntity = httpResponse.getEntity
    val inputStream : InputStream = httpEntity.getContent
    val pageContent : String = Source.fromInputStream(inputStream).mkString   //inputStream转化为String
    val status : Int = httpResponse.getStatusLine.getStatusCode
    EntityUtils.consume(httpEntity)   //关闭httpResponse中的inputStream
    httpResponse.close()
    new Response(status, url, pageContent)

  }


  def main(args : Array[String]) : Unit = {
    val seed : String = "这里是网址..."
    val resp : Response = Crawler.doGet(seed)
    println(resp.content)
  }

}