用js写爬虫

最新推荐文章于 2024-05-02 08:03:36 发布

别清兵你会死

最新推荐文章于 2024-05-02 08:03:36 发布

阅读量1.1k

点赞数

分类专栏： nodejs 文章标签： nodejs

本文链接：https://blog.csdn.net/weixin_44796147/article/details/104141620

版权

nodejs 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

爬取内容

http/https/axios爬取api接口
后端渲染模式(ejs,jsp,thymeleaf…)：服务器渲染好页面，再交给前端展示

const axios = require('axios')
const { headers, data } = await axios.get(url)

request/crawl/superagent爬取html
前端渲染模式(vue,react…)：服务器返回空页面，页面数据，js文件，再由前端完成渲染。无法爬取

let request = require('request')
request(url, (err, response, body) => {
    // body即html
    const regex = /class="title" data-v-\w+>(.+?)<\/a>/g
    let titles = []
    body.replace(regex, (match, title) => {
        titles.push(title)
    })
    console.log(titles)
})

puppeteer控制chromium(为所欲为)
可以通过api控制浏览器行为(实现爬虫，自动签到，网页截图，生成pdf，自动化测试等)

const browser = await puppeteer.launch({ headless: false })
const page = await browser.newPage()
await page.goto('https://juejin.im/welcome/frontend')
const titles = await page.$$eval('a.title', as => { 
    return Array.from(as).map(a => a.innerText).join('\r\n')
})
fs.writeFileSync('titles.txt', titles, 'UTF8')
await browser.close()

数据格式化、持久化、建立索引

传输数据格式:json form formdata
Content-Type:(application/json application/x-www-urlencoded multipart/form-data)
持久化建立索引
连接数据库

数据编码问题(iconv-lite)
request在内部buffer转字符串默认utf8编码(例如gbk扩展了gb2312，gb180扩展了gbk)

const iconv = require('iconv-lite')
// cheerio像jquery一样获取html元素
const cheerio = require('cheerio)
request({ url,encoding:null }, (err,response,body)=>{ 
    body=iconv.decode(body,'gbk')
    let $ = cheerio.load(body)
    let titles = []
    $('a.list-title').each((index,item)=>{
        titles.push($(item).text())
    })
})

数据订阅、发布

数据订阅
cron 与 org.springframework.scheduling.annotation.Scheduled 类同
cron周期性执行

const {CronJob} = require('cron')
new CronJob('* * * * * *', ()=>{ console.log() }).start()

例如每个月8号晚上10点执行一次: ‘0 0 22 8 * *’

防止程序退出
node.js大部分情况异步io的错误无法被trycatch捕获，接下来会交给uncaughtException函数处理，若没有注册则会导致程序退出(node只有一个线程)
监听未知错误

process.on('uncaughtException', error => {
    console.log('监听到了未知的错误', error)
})

进程管理(pm2)

别清兵你会死

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
用js写爬虫

爬取内容http/https/axios爬取api接口后端渲染模式(ejs,jsp,thymeleaf…)：服务器渲染好页面，再交给前端展示const axios = require('axios')const { headers, data } = await axios.get(url)request/crawl/superagent爬取html前端渲染模式(vue,rea...
复制链接

扫一扫