使用node+cheerio爬取网页上的内容

最新推荐文章于 2024-05-14 08:21:54 发布

白嫖leader

最新推荐文章于 2024-05-14 08:21:54 发布

阅读量1.3k

点赞数 1

分类专栏：笔记文章标签：爬虫前端 javascript

本文链接：https://blog.csdn.net/ksjdbdh/article/details/122363089

版权

笔记专栏收录该内容

82 篇文章 7 订阅

订阅专栏

使用node做爬虫必不可少的一个包是cheerio
今天要爬的网址为：
http://blog.sina.com.cn/s/blog_4d30d65b01009rn5.html
要爬取的内容如下：
在这里插入图片描述

// 两种方式爬取文本http协议url的文本
const cheerio = require("cheerio")
const http = require("http")
const fs = require("fs")
const axios = require("axios")

var url = "http://blog.sina.com.cn/s/blog_4d30d65b01009rn5.html"
// 使用http的当时去爬取，不太方便，不推荐使用，不可以爬取https协议的url
// http
//   .get(url, (res) => {
//     let rawData = ""
//     res.on("data", (chunk) => {
//       rawData += chunk
//     })
//     res.on("end", () => {
//       try {
//         getData(rawData)
//       } catch (e) {
//         console.error(e.message)
//       }
//     })
//   })
//   .on("error", (e) => {
//     console.error(`出现错误: ${e.message}`)
//   })
// 使用axios的当时去爬取，非常方便，可以爬取https协议的url
axios
  .get(url)
  .then(function (response) {
    try {
      console.log(response)
      getData(response.data)
    } catch (e) {
      console.error(e.message)
    }
  })
  .catch(function (error) {
    // handle error
    console.log(error)
  })

function getData(data) {
  //将获取到的html结构赋值给$
  const $ = cheerio.load(data)
  var aBox = $(".info_list2 li")
  console.log(aBox)
  var arr = []
  aBox.each((index, item) => {
    try {
      var key = $(item).find("span").text()
      var val = $(item).find("strong").text()
    } catch (error) {
      console.log(error)
    }
    arr.push({
      key,
      val,
    })
  })
  fs.writeFile(__dirname + "/content.txt", JSON.stringify(arr), (err) =>
    console.log(err)
  )
  console.log(arr)
}

运行完这个js文件之后，就会发现活了一个文件，这个问家里放置的就是爬取到的内容：
在这里插入图片描述

白嫖leader

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
打赏
0
评论
使用node+cheerio爬取网页上的内容

使用node做爬虫必不可少的一个包是cheerio今天要爬的网址为：http://blog.sina.com.cn/s/blog_4d30d65b01009rn5.html要爬取的内容如下：// 两种方式爬取文本http协议url的文本const cheerio = require("cheerio")const http = require("http")const fs = require("fs")const axios = require("axios")var url = "
复制链接

扫一扫