文章教你如何做掘金站内数据抓取,数据解析,最后形成排序后的排名。
项目起因是我突然想看看掘金站内有哪些优质作者,为了不错过每一个大佬,我选择直接抓取站内所有的文章信息找到作者并进行排名。各位关注 + 文章阅读 一条龙走起!
项目地址 juejin-spider[1] 欢迎 star issue
掘金 spider 和数据分析,主要关注了下面几个排行和统计,排行点击直接查看
掘金站内标签总数[2]
掘金站内标签下文章[3]
掘金用户排名(前5000)[4]
文章评论量排行[5]
点赞量排行[6]
浏览量排行[7]
先上掘金前50排名,关注一波???? 前5000排名看这里[8]
? 等级,? 关注数,?公司
(1)[? 4][? 67909] [? 掘金] 阴明[9]
(2)[? 5][? 47061] [? 稀土] 稀土君[10]
(3)[? 5][? 45676] [? Alibaba] HollisChuang[11]
(4)[? 5][? 44229] [? ] 腾讯云加社区[12]
(5)[? 3][? 37565] [? 前端外刊评论网] 前端外刊评论[13]
(6)[? 0][? 37062] [? SN] 丁一[14]
(7)[? 3][? 34825] [? 腾讯alloyteam -> 腾讯云 -> Shopee] 李CHENGXI[15]
(8)[? 3][? 34588] [? ] liutao[16]
(9)[? 3][? 33436] [? 易快报] 水墨寒[17]
(10)[? 1][? 30516] [? 前掘金] NeXT[18]
(11)[? 4][? 28101] [? 公众号【远洋号】] 超人汪小建[19]
(12)[? 4][? 27221] [? ] stormzhangV[20]
(13)[? 5][? 25833] [? ] Java3y[21]
(14)[? 2][? 25707] [? 吆喝科技] 吆喝科技_Zoran[22]
(15)[? 5][? 25237] [? 美团] 美团技术团队[23]
(16)[? 0][? 23913] [? ] 刘欣[24]
(17)[? 6][? 23829] [? 宋小菜] yck[25]
(18)[? 5][? 22345] [? 公众号『crossoverJie』] crossoverJie[26]
(19)[? 6][? 21367] [? ] 技术胖[27]
(20)[? 5][? 21170] [? ] 石杉的架构笔记[28]
(21)[? 3][? 21100] [? 阿里巴巴集团] 闲鱼技术[29]
(22)[? 1][? 20815] [? 滴滴] 孙福生[30]
(23)[? 5][? 20785] [? 前网易,现哈啰] 木易杨说[31]
(24)[? 2][? 20642] [? 弋云科技] AleCC[32]
(25)[? 0][? 20562] [? 滴滴出行] five_years_struggle[33]
(26)[? 5][? 20196] [? ThoughtWorks准入职] SnailClimb[34]
(27)[? 2][? 20065] [? ofo] 猴子搬来的救兵[35]
(28)[? 3][? 20058] [? HUAWEI] 雨神姥爷[36]
(29)[? 2][? 19307] [? 金融科技] taotao.li[37]
(30)[? 4][? 19068] [? 公众号【码洞】] 老錢[38]
(31)[? 2][? 18847] [? ] 凤尾[39]
(32)[? 5][? 18465] [? ] 冴羽[40]
(33)[? 5][? 18390] [? 腾讯 微信] Carson_Ho[41]
(34)[? 2][? 18318] [? zhisheng] zhisheng[42]
(35)[? 0][? 17887] [? 自由职业] IT程序狮[43]
(36)[? 3][? 17741] [? Goertek] 泱泱[44]
(37)[? 4][? 17633] [? 纯源码解析,目前源码解析500+篇] 芋道源码_以德服人_不服就干[45]
(38)[? 3][? 17588] [? 胖橘网络] KyXu[46]
(39)[? 5][? 17535] [? Fundebug] Fundebug[47]
(40)[? 0][? 16984] [? 腾讯] flike[48]
(41)[? 3][? 16962] [? 百度] 胡子大哈[49]
(42)[? 4][? 16827] [? ] 老司机iOS周报[50]
(43)[? 4][? 16364] [? ] 机器之心[51]
(44)[? 1][? 15699] [? AXE] 果只[52]
(45)[? 3][? 15466] [? ] Mockplus[53]
(46)[? 5][? 15448] [? 腾讯科技(深圳)有限公司] 腾讯IVWEB团队[54]
(47)[? 6][? 15421] [? 上海] OBKoro1[55]
(48)[? 5][? 15362] [? ELEME] sunshine小小倩[56]
(49)[? 2][? 15164] [? ucashin.com] MrMuscles[57]
(50)[? 3][? 15077] [? ] 已禁用[58]
脚本
全站标签抓取
获取掘金站内所有标签信息
npm run tagList
会把标签信息写入到 src/assets/tagList/tagList.json
,每个标签包含下面的信息,主要是 title
和 id
{
"id": "5597a063e4b08a686ce57030",
"title": "后端",
"createdAt": "2015-07-04T00:59:16Z",
"updatedAt": "2017-06-18T23:34:00Z",
"color": "#C679FF",
"icon": "https://lc-gold-cdn.xitu.io/d83da9d012ddb7ae85f4.png",
"background": "",
"showOnNav": true,
"relationTagId": "",
"alias": "backend houduan",
"isCategory": true,
"entryCount": 19840,
"subscribersCount": 295562,
"isSubscribe": false
},
全站文章抓取
将会采集全站所有标签下面的所有文章,采集过程会因为网速和机器性能表现出差异,请各位耐心等待采集完成
这一步采集的数据非常重要,是后面所有分析的基础
采集到的文件会存放在 src/assets/articleData
下面,包含有很多 json 文件,每个文件包含这个标签下的所有专栏文章元信息
npm run allTagData
数组中每个对象
{
"collectionCount": 5, // 点赞数
"userRankIndex": 5.4006856695164,
"buildTime": 1565582852.8327,
"commentsCount": 2, // 评论数
"gfw": false,
"objectId": "5d40d29d518825221b4cbb40",
"checkStatus": true,
"isEvent": false,
"entryView": "",
"subscribersCount": 0, // 无用
"ngxCachedTime": 1565627197,
"verifyStatus": true,
"tags": [
{
"ngxCachedTime": 1565627193,
"ngxCached": true,
"title": "React.js",
"id": "555e99ffe4b00c57d99556aa"
}
],
"updatedAt": "2019-08-12T04:07:32.818Z",
"rankIndex": 0.005346156248974,
"hot": false,
"autoPass": false,
"originalUrl": "https://juejin.im/post/5d3ef3646fb9a06b1b1999fd", // 文章的 url
"verifyCreatedAt": "2019-07-31T01:36:14.238Z",
"createdAt": "2019-07-31T01:36:14.238Z",
"user": {
"community": {
"weibo": { "uid": "5345591282", "nickname": "岁月痕迹A88" },
"wechat": {
"avatarLarge": "http://thirdwx.qlogo.cn/mmopen/vi_32/cabLXAUXiavVhiaDh2050AOOEToUvnZTWsSNqqKZC4hzPzHABC7fxwv6VxwebIxfKdaRkYDZoic8UXfonLDyiafuiaw/132"
},
"github": {
"username": "lxfriday",
"avatarLarge": "https://avatars0.githubusercontent.com/u/20264467?v=4",
"uid": "20264467"
}
},
"collectedEntriesCount": 154, // 点赞数
"company": "xxx", // 公司
"followersCount": 35, // 被关注数
"followeesCount": 70, // 关注数
"role": "guest", // 用户角色
"postedPostsCount": 19, // 发布的专栏数
"level": 2, // 用户等级
"isAuthor": false,
"postedEntriesCount": 2, // 分享数?
"totalCommentsCount": 16, // 总评论数
"ngxCachedTime": 1565627197,
"viewedEntriesCount": 1347, // 查看的文章数
"jobTitle": "前端", // 工作:前端
"subscribedTagsCount": 166, // 关注的标签数
"totalCollectionsCount": 120, // 总收藏数
"username": "云影sky", // 用户名
"avatarLarge": "https://user-gold-cdn.xitu.io/2019/7/14/16bf1155693d96c2?w=570&h=488&f=png&s=312610",
"objectId": "57a0c28979bc440054958498" // 用户 id
},
"author": "",
"screenshot": "https://user-gold-cdn.xitu.io/2019/7/29/16c3e3d979a96831?w=1097&h=573&f=png&s=58239",
"original": true,
"hotIndex": 21.2095,
"content": "给 PureComponent 重新指向构造函数之后,_assign 复制对象属性时, Component 构造函数不会覆盖 PureComponent 构造函数,看下面的例子就明白了。把 PureComponent 变成 Component,userInfo 可正常变化。",
"title": "React 源码系列-Component、PureComponent、function Component 分析",
"lastCommentTime": "2019-08-03T16:53:20.577Z",
"type": "post",
"english": false,
"category": {
"ngxCached": true,
"title": "frontend",
"id": "5562b415e4b00c57d9b94ac8",
"name": "前端",
"ngxCachedTime": 1565627098
},
"viewsCount": 267, // 浏览量
"summaryInfo": "经过