记一次typescript爬虫全过程

最新推荐文章于 2023-04-22 15:36:15 发布

_阿锋丶

最新推荐文章于 2023-04-22 15:36:15 发布

阅读量890

点赞数

分类专栏： stumbling in typescript

本文链接：https://blog.csdn.net/weixin_43342105/article/details/110441368

版权

stumbling in typescript 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

文章目录

前言

这几天疯狂学习typescript,让我感受到了它的无线魅力，无意中了解到了typescript也可以操作爬虫技术，欣喜落狂。于是进行了第一次typescript爬虫记录

github代码地址

初始化项目

先创建一个叫做tsScrapy的文件夹，用cmd命令进入命令行窗口

依次输入下列命令

npm init -y 


tsc --init

这两部命令分别出现了pakage.json文件和tsconfig.json文件后

需要npm安装许多的依赖：

在这里插入图片描述

superagent 是一个轻量的Ajax API,目的用于获取

这里为了方便起见：你可以之间将我下面的pakage.json文件夹粘贴到替换你的pakage.json保存后，输入

npm install

之后全部依赖安装完毕

电影名和评分等数据的爬取

我这里快速选取了网上的一个爬虫练习网站
点击进入爬虫网站

在这里插入图片描述
我准备吧网站上的电影名和评分以及上映日期数据爬取

代码

目录结构如下
在这里插入图片描述
crowller.ts文件如下

// const superagent = require('superagent')
import superagent from 'superagent';
import { isThisTypeNode } from 'typescript';
import path from 'path';
import fs from 'fs';

import SpiderAnalysis from './spiderAnalyzer';
import SpiderAnalyzer from './spiderAnalyzer';

export interface Analyzer {
  analysis: (html: string, filePath: string) => string;
}

class Crowller {
  private filePath = path.resolve(__dirname, '../data/movie.json');

  private async getRawHtml() {
    const res = await superagent.get(this.url);
    return res.text;
  }

  private async initSpider() {
    //
    const html = await this.getRawHtml();
    const fileContent = this.analyzer.analysis(html, this.filePath);
    this.writeFile(fileContent);
  }

  private writeFile(content: string) {
    fs.writeFileSync(this.filePath, content);
  }
  constructor(private url: string, private analyzer: Analyzer) {
    this.initSpider();
  }
}
const url = `https://ssr1.scrape.center/`;

// const analyzer = new SpiderAnalysis();
const analyzer = SpiderAnalyzer.getInstance();
new Crowller(url, analyzer);

spiderAnalyzer.ts

import cheerio from 'cheerio';
import fs from 'fs';
import { Analyzer } from './crowller';

interface Movie {
  title: string;
  score: string;
  publishTime: string;
}

interface MovieResult {
  time: number;
  data: Movie[];
}

interface Content {
  [propName: number]: Movie[];
}

export default class SpiderAnalyzer implements Analyzer {
  // 单例模式：只用生成对应的一个网址的爬虫分析器
  private static instance: SpiderAnalyzer;
  static getInstance() {
    if (!SpiderAnalyzer.instance) {
      SpiderAnalyzer.instance = new SpiderAnalyzer();
    }
    return SpiderAnalyzer.instance;
  }

  private getVedioInfo(html: string) {
    const $ = cheerio.load(html);
    const vedioItem = $('.el-card');
    console.log(vedioItem.length);
    const movieArr: Movie[] = [];
    vedioItem.map((index, ele) => {
      // console.log(index,ele);
      const title = $(ele).find('.m-b-sm').text().trim();

      const score = $(ele).find('.score').text().trim();

      const publishTime = $(ele).find('.m-v-sm').eq(1).text().trim();
      // console.log(title, score, publishTime);
      movieArr.push({
        title,
        score,
        publishTime,
      });
    });
    // console.log(movieArr);

    return {
      time: new Date().getTime(),
      data: movieArr,
    };
  }

  private generateJson(movieRes: MovieResult, filePath: string) {
    let fileContent: Content = {};
    if (fs.existsSync(filePath)) {
      fileContent = JSON.parse(fs.readFileSync(filePath, 'utf-8'));
    }

    console.log(fileContent);

    fileContent[movieRes.time] = movieRes.data;
    console.log(filePath);

    return fileContent;
  }

  public analysis(html: string, filePath: string) {
    const movieRes = this.getVedioInfo(html);
    // console.log(movieRes);

    const fileContent = this.generateJson(movieRes, filePath);
    console.log(fileContent);

    return JSON.stringify(fileContent);
  }

  private constructor() {}
}

最终在命令行窗口运行npm run dev
生成的movie.json文件如下：

{
    "1606821748200": [
        {
            "title": "霸王别姬 - Farewell My Concubine",
            "score": "9.5",
            "publishTime": "1993-07-26 上映"
        },
        {
            "title": "这个杀手不太冷 - Léon",
            "score": "9.5",
            "publishTime": "1994-09-14 上映"
        },
        {
            "title": "肖申克的救赎 - The Shawshank Redemption",
            "score": "9.5",
            "publishTime": "1994-09-10 上映"
        },
        {
            "title": "泰坦尼克号 - Titanic",
            "score": "9.5",
            "publishTime": "1998-04-03 上映"
        },
        {
            "title": "罗马假日 - Roman Holiday",
            "score": "9.5",
            "publishTime": "1953-08-20 上映"
        },
        {
            "title": "唐伯虎点秋香 - Flirting Scholar",
            "score": "9.5",
            "publishTime": "1993-07-01 上映"
        },
        {
            "title": "乱世佳人 - Gone with the Wind",
            "score": "9.5",
            "publishTime": "1939-12-15 上映"
        },
        {
            "title": "喜剧之王 - The King of Comedy",
            "score": "9.5",
            "publishTime": "1999-02-13 上映"
        },
        {
            "title": "楚门的世界 - The Truman Show",
            "score": "9.0",
            "publishTime": ""
        },
        {
            "title": "狮子王 - The Lion King",
            "score": "9.0",
            "publishTime": "1995-07-15 上映"
        }
    ]
}

_阿锋丶

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
记一次typescript爬虫全过程

文章目录前言初始化项目电影名和评分等数据的爬取代码前言这几天疯狂学习typescript,让我感受到了它的无线魅力，无意中了解到了typescript也可以操作爬虫技术，欣喜落狂。于是进行了第一次typescript爬虫记录github代码地址初始化项目先创建一个叫做tsScrapy的文件夹，用cmd命令进入命令行窗口依次输入下列命令npm init -y tsc --init这两部命令分别出现了pakage.json文件和tsconfig.json文件后需要npm安装许多的依
复制链接

扫一扫

专栏目录