前言
这几天疯狂学习typescript,让我感受到了它的无线魅力,无意中了解到了typescript也可以操作爬虫技术,欣喜落狂。于是进行了第一次typescript爬虫记录
初始化项目
先创建一个叫做tsScrapy的文件夹,用cmd命令进入命令行窗口
依次输入下列命令
npm init -y
tsc --init
这两部命令分别出现了pakage.json文件和tsconfig.json文件后
需要npm安装许多的依赖:
- superagent 是一个轻量的Ajax API,目的用于获取
这里为了方便起见:你可以之间将我下面的pakage.json文件夹粘贴到替换你的pakage.json保存后,输入
npm install
之后全部依赖安装完毕
电影名和评分等数据的爬取
我这里快速选取了网上的一个爬虫练习网站
点击进入爬虫网站
我准备吧网站上的电影名和评分以及上映日期数据爬取
代码
目录结构如下
crowller.ts文件如下
// const superagent = require('superagent')
import superagent from 'superagent';
import { isThisTypeNode } from 'typescript';
import path from 'path';
import fs from 'fs';
import SpiderAnalysis from './spiderAnalyzer';
import SpiderAnalyzer from './spiderAnalyzer';
export interface Analyzer {
analysis: (html: string, filePath: string) => string;
}
class Crowller {
private filePath = path.resolve(__dirname, '../data/movie.json');
private async getRawHtml() {
const res = await superagent.get(this.url);
return res.text;
}
private async initSpider() {
//
const html = await this.getRawHtml();
const fileContent = this.analyzer.analysis(html, this.filePath);
this.writeFile(fileContent);
}
private writeFile(content: string) {
fs.writeFileSync(this.filePath, content);
}
constructor(private url: string, private analyzer: Analyzer) {
this.initSpider();
}
}
const url = `https://ssr1.scrape.center/`;
// const analyzer = new SpiderAnalysis();
const analyzer = SpiderAnalyzer.getInstance();
new Crowller(url, analyzer);
spiderAnalyzer.ts
import cheerio from 'cheerio';
import fs from 'fs';
import { Analyzer } from './crowller';
interface Movie {
title: string;
score: string;
publishTime: string;
}
interface MovieResult {
time: number;
data: Movie[];
}
interface Content {
[propName: number]: Movie[];
}
export default class SpiderAnalyzer implements Analyzer {
// 单例模式:只用生成对应的一个网址的爬虫分析器
private static instance: SpiderAnalyzer;
static getInstance() {
if (!SpiderAnalyzer.instance) {
SpiderAnalyzer.instance = new SpiderAnalyzer();
}
return SpiderAnalyzer.instance;
}
private getVedioInfo(html: string) {
const $ = cheerio.load(html);
const vedioItem = $('.el-card');
console.log(vedioItem.length);
const movieArr: Movie[] = [];
vedioItem.map((index, ele) => {
// console.log(index,ele);
const title = $(ele).find('.m-b-sm').text().trim();
const score = $(ele).find('.score').text().trim();
const publishTime = $(ele).find('.m-v-sm').eq(1).text().trim();
// console.log(title, score, publishTime);
movieArr.push({
title,
score,
publishTime,
});
});
// console.log(movieArr);
return {
time: new Date().getTime(),
data: movieArr,
};
}
private generateJson(movieRes: MovieResult, filePath: string) {
let fileContent: Content = {};
if (fs.existsSync(filePath)) {
fileContent = JSON.parse(fs.readFileSync(filePath, 'utf-8'));
}
console.log(fileContent);
fileContent[movieRes.time] = movieRes.data;
console.log(filePath);
return fileContent;
}
public analysis(html: string, filePath: string) {
const movieRes = this.getVedioInfo(html);
// console.log(movieRes);
const fileContent = this.generateJson(movieRes, filePath);
console.log(fileContent);
return JSON.stringify(fileContent);
}
private constructor() {}
}
最终在命令行窗口运行npm run dev
生成的movie.json文件如下:
{
"1606821748200": [
{
"title": "霸王别姬 - Farewell My Concubine",
"score": "9.5",
"publishTime": "1993-07-26 上映"
},
{
"title": "这个杀手不太冷 - Léon",
"score": "9.5",
"publishTime": "1994-09-14 上映"
},
{
"title": "肖申克的救赎 - The Shawshank Redemption",
"score": "9.5",
"publishTime": "1994-09-10 上映"
},
{
"title": "泰坦尼克号 - Titanic",
"score": "9.5",
"publishTime": "1998-04-03 上映"
},
{
"title": "罗马假日 - Roman Holiday",
"score": "9.5",
"publishTime": "1953-08-20 上映"
},
{
"title": "唐伯虎点秋香 - Flirting Scholar",
"score": "9.5",
"publishTime": "1993-07-01 上映"
},
{
"title": "乱世佳人 - Gone with the Wind",
"score": "9.5",
"publishTime": "1939-12-15 上映"
},
{
"title": "喜剧之王 - The King of Comedy",
"score": "9.5",
"publishTime": "1999-02-13 上映"
},
{
"title": "楚门的世界 - The Truman Show",
"score": "9.0",
"publishTime": ""
},
{
"title": "狮子王 - The Lion King",
"score": "9.0",
"publishTime": "1995-07-15 上映"
}
]
}