不会写代码怎么办?带你学点真正的黑科技,搞定知乎,b站,豆瓣,抖音,公众号,微博等平台...

苏生不惑原创文章,加入我的知识星球

前几天我的知识星球里一位小伙伴问怎么下载知乎用户的回答 ?f0ca27052fdfa41d4616a83c007e7e73.png

有兴趣的小伙伴可以加入我的知识星球 , 星球几乎每天更新,主要发布我每天在国内外互联网上看到过有趣的网站,软件和一些工作生活经验分享,包括方方面面,堪称互联网宝藏库,所以叫互联网达人嘛,每条帖子都有标签,可以选择标签查看对应内容 https://t.zsxq.com/13bqoLXHJ

71999610ede49c75c5283fdb72f3f143.jpeg

记得很久之前写过一篇关于web scraper抓取数据的文章,今天再整理分享下,不用写代码也可以自由抓取数据。

这里以渤海小吏这个知乎号为例https://www.zhihu.com/people/dai-zong-66 ,首先安装 web scraper 浏览器扩展2024 年还有人不会安装使用脚本神器?手把手教你,下载地址在公众号后台对话框回复 scraper  ,安装后打开浏览器控制台点击import sitemap 。

4fed677524f0d4bb570688c21411a164.png复制以下代码:

{"_id":"zhihu_answer","startUrl":["https://www.zhihu.com/people/dai-zong-66/answers?page=[1-5]"],"selectors":[{"id":"row","parentSelectors":["_root"],"type":"SelectorElement","selector":"div.List-item","multiple":true},{"id":"知乎问题标题","parentSelectors":["row"],"type":"SelectorText","selector":"div[itemprop='zhihu:question'] a","multiple":false,"regex":""},{"id":"知乎问题链接","parentSelectors":["row"],"type":"SelectorElementAttribute","selector":"[itemprop='zhihu:question'] a","multiple":false,"extractAttribute":"href"}]}

09d367884dd6a3e7fad3de8c2b48d04d.png点击 Data Preview看数据没问题。

6a59ba2d4d77aeee82d6ffddc6294796.png然后点击scrape开始抓取。

41790db6b85beadd3fa4cba37050d9f5.png之后浏览器会自动抓取数据,不用管,抓取完后浏览器自动关闭,看数据都抓取完成。

0a9c904a59e5da623254b708ff9c3320.png最后导出excel就行,包含所有知乎回答问题标题和链接。

776d2e629646da7a522875ec202ab3ad.png效果如图:

df07d252432a4f0876c2288442b993a3.png如果想下载所有回答内容可以对抓取的回答链接再提取下载,这个就自己研究了,对于知乎文章的抓取也是一样的。

导出的excel数据包含知乎文章标题,链接,评论数和赞同数:

677f5169331b979c29056d689b40f6ba.png如果还想批量下载知乎专栏的文章可以用我开发的这个工具2023 更新版:苏生不惑开发过的那些原创工具和脚本 ,下载效果:795142600749188407f4f26743ffea91.jpeg文章和回答保存到html目录,文件名是时间+标题。bd0c8d44d1422516004ef045fff3fd67.jpeg所有文章合成一个pdf文件。909d1b67b3bd8864a0e4e95aa93ed1d7.jpeg视频保存到video目录。2d24e4931fd8d4e9dd498021e80e80a5.jpeg

还有知乎话题的抓取,导入以下代码:

{"_id":"zhihu_topic","startUrl":["https://www.zhihu.com/topic/19559424/top-answers"],"selectors":[{"id":"row","parentSelectors":["_root"],"type":"SelectorElementScroll","selector":"div.List-item:nth-of-type(-n+10)","multiple":true,"delay":2000,"elementLimit":500},{"id":"知乎标题","parentSelectors":["row"],"type":"SelectorText","selector":"h2 a","multiple":false,"regex":""},{"id":"知乎链接","parentSelectors":["row"],"type":"SelectorLink","selector":"[itemprop='zhihu:question'] a[data-za-detail-view-element_name]","multiple":false,"linkType":"linkFromHref"}]}

1cc8ef6589d3c050cf985ce363c9ed90.png哔哩哔哩视频抓取,比如抓取b站上木鱼水心的所有视频 https://space.bilibili.com/927587/video ,导入以下代码:

{"_id":"bilibili_videos","startUrl":["https://space.bilibili.com/927587/video?tid=0&pn=[1-42:1]&keyword=&order=pubdate"],"selectors":[{"id":"row","parentSelectors":["_root"],"type":"SelectorElement","selector":"li.small-item","multiple":true},{"id":"视频标题","parentSelectors":["row"],"type":"SelectorText","selector":"a.title","multiple":false,"regex":""},{"id":"视频链接","parentSelectors":["row"],"type":"SelectorElementAttribute","selector":"a.cover","multiple":false,"extractAttribute":"href"},{"id":"视频封面","parentSelectors":["row"],"type":"SelectorElementAttribute","selector":"a.cover div.b-img picture img","multiple":false,"extractAttribute":"src"},{"id":"视频播放量","parentSelectors":["row"],"type":"SelectorText","selector":".play span","multiple":false,"regex":""},{"id":"视频长度","parentSelectors":["row"],"type":"SelectorText","selector":" a.cover  span.length","multiple":false,"regex":""},{"id":"发布时间","parentSelectors":["row"],"type":"SelectorText","selector":"span.time","multiple":false,"regex":""}]}
3ee2ade677e90c391c441f874c2b5c37.png
 
8a0562936c2a4ad04d838ded0b4eb94b.png

导出的excel数据包含视频标题,链接,封面,播放量,长度,时间等,从2013到2023年共发布视频1200多个。9ce160783a4d3b66b489169df6080aa7.pngb站热榜数据抓取,导入以下代码:

{"_id":"bilibili","startUrl":["https://www.bilibili.com/v/popular/rank/all"],"selectors":[{"id":"row","multiple":true,"parentSelectors":["_root"],"selector":"li.rank-item","type":"SelectorElement"},{"id":"视频排名","multiple":false,"parentSelectors":["row"],"regex":"","selector":"i.num","type":"SelectorText"},{"id":"视频标题","multiple":false,"parentSelectors":["row"],"regex":"","selector":"a.title","type":"SelectorText"},{"id":"播放量","multiple":false,"parentSelectors":["row"],"regex":"","selector":".detail-state > span:nth-of-type(1)","type":"SelectorText"},{"id":"弹幕数","multiple":false,"parentSelectors":["row"],"regex":"","selector":"span:nth-of-type(2)","type":"SelectorText"},{"id":"up主","multiple":false,"parentSelectors":["row"],"regex":"","selector":"a span","type":"SelectorText"},{"id":"视频链接","multiple":false,"parentSelectors":["row"],"selector":"a.title","type":"SelectorLink"},{"id":"点赞数","multiple":false,"parentSelectors":["视频链接"],"regex":"","selector":"span.like","type":"SelectorText"},{"id":"投币数","multiple":false,"parentSelectors":["视频链接"],"regex":"","selector":"span.coin","type":"SelectorText"},{"id":"收藏数","multiple":false,"parentSelectors":["视频链接"],"regex":"","selector":"span.collect","type":"SelectorText"}]}
5b832fdb9ad0fff49f20bc2894d9d724.png

抓取豆瓣电影排行榜 top 250,导入以下代码:

{"_id":"douban_movie_top_250","startUrl":["https://movie.douban.com/top250?start=0&filter="],"selectors":[{"id":"next_page","type":"SelectorLink","parentSelectors":["_root","next_page"],"selector":".next a","multiple":true,"delay":0},{"id":"container","type":"SelectorElement","parentSelectors":["_root","next_page"],"selector":".grid_view li","multiple":true,"delay":0},{"id":"title","type":"SelectorText","parentSelectors":["container"],"selector":"span.title:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"number","type":"SelectorText","parentSelectors":["container"],"selector":"em","multiple":false,"regex":"","delay":0}]}
e808f31b03fbf780fa9d5511bc6533a8.png

还有抖音账号所有视频数据 ,数据包括视频日期,视频标题,视频链接,点赞数,评论数,收藏数,转发数等。c721a7313ae211656ed614f6bf03c37f.jpeg微博账号的所有数据,包含微博链接,微博内容,发布时间,点赞数,转发数,评论数,话题等。 2024 批量下载微博内容/图片/视频/评论/转发数据,导出excel和pdf399b2cdd232f59878ebd396dce4ee811.jpeg

以及公众号的所有文章数据,数据包含文章日期,文章标题,文章链接,文章简介,文章作者,文章封面图,是否原创,IP归属地,阅读数,在看数,点赞数,留言数,分享数,粉丝数,赞赏次数,视频数,音频数等,比如深圳卫健委2022年的文章阅读数都是10万+,文章数据分析见文章2022年过去,抓取公众号阅读数点赞数在看数留言数做数据分析, 以深圳卫健委这个号为例 。9b4104045f3d1ec57ccc006b1c93af89.jpeg

最新原创文章:

正式介绍下我的知识星球

2023 更新版:苏生不惑开发过的那些原创工具和脚本

再次更新:2023批量下载公众号文章内容/话题/图片/封面/视频/音频,导出文章pdf,文章数据含阅读数/点赞数/在看数/留言数

如果文章对你有帮助还请 点赞/在看/分享 三连支持下, 感谢各位!

公众号苏生不惑

3f58cfa619f3bec1ef6dcebdcc298e0e.jpeg48c38cbb9ddf525f9afb19aee78f1dde.jpeg

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值