Resource Recommendation
前段时间做项目需要爬Facebook,但因为疫情原因官方的个人Graph API暂停申请权限,抓耳挠腮之际只能奔向万能的GitHub找资源。多多少少试了好多包,把个人觉得比较好的罗列在下面,仅供个人学习和交流,不用于商业用途。
- 在线 Facebook主页基本信息(公开的地址、电话、邮箱、营业时间等等)爬取工具, 快速便捷,有免费试用版。https://phantombuster.com/automations/facebook/8369/facebook-profile-scraper
- 来自GitHub,试了下爬取个人主页的相关帖子、视频等等还是很强大的,需要有效的credentials(注册邮箱和密码)。 https://github.com/harismuneer/Ultimate-Facebook-Scraper
- 来自GitHub,可以爬取公共主页所有帖子、对应时间、转赞评数目、帖子ID等,不需要credentials,是我找到的少数几个能爬公共主页的有效代码,可惜评论的具体内容无法爬取。https://github.com/kevinzg/facebook-scraper
Practical Usage
最终选择上述第三种方法来爬取目标公司Facebook公共主页的所有帖子并输出xlsx数据:
import re
import time
import datetime
import pandas as pd
import numpy as np
from Facebook_Scraper.facebook_scraper import get_posts
from Facebook_Scraper.facebook_scraper import fetch_share_and_reactions
def facebook_scrap():
# The data type of incorporation date and dissolution date is timestamp, we'll convert them into string containing only date.
data = pd.read_excel('../data/dataset.xlsx',converters={
'Date of Establishment_legal':str,'Dissolved_legal':str})
# Column 'Date of Establishment_legal' contains the company's incorporation date, column 'Dissolved_legal' contains the company's dissolution date, and column 'Facebook' contains the link of the Facebook public page of the company if any.
# We only extract companies with Facebook links
data = data[data['Facebook'].notna()]
data['Date of Establish