使用Python和Selenium抓取Twitter推文

江先森

已于 2024-11-11 09:29:47 修改

阅读量1.2k

点赞数 5

文章标签： selenium 测试工具 python 爬虫 twitter

于 2024-10-17 19:44:06 首次发布

本文链接：https://blog.csdn.net/joy357692577/article/details/143026715

版权

怎么抓取twitter推文？怎么抓取推文？

高效的数据抓取
使用我们的Python和Selenium集成工具解锁Twitter数据的潜力！自动化数据收集流程，利用强大的方法监控浏览器活动，轻松捕获请求和响应，优化您的抓取任务。

高级推文筛选
利用Twitter的高级搜索功能，定位符合您标准的特定推文。您可以按关键词、日期和标签进行过滤，确保数据采集的相关性和精准性。

步骤1：设置环境

首先，安装Selenium，它可以帮助我们自动化浏览器操作：

pip install -r requirements.txt

步骤2：下载ChromeDriver

您可以从这里下载相应的ChromeDriver download chromeDriver

步骤3：运行Chrome进行测试

该步骤用于调试以查看效果。如果您不想看到，可以跳过此步骤。

@echo off
start C:\software\chrome-win64\chrome.exe --remote-debugging-port=9223

步骤4：设置Chrome选项

self.options = webdriver.ChromeOptions()
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'
self.options.add_argument(f'user-agent={user_agent}')
self.options.add_argument('--disable-gpu')
self.options.add_argument('--no-sandbox')
self.options.add_argument('--disable-dev-shm-usage')

self.options.add_experimental_option("debuggerAddress", "localhost:9223")

js_script_name = modify_random_canvas_js()
self.browser = self.get_browser(script_files=[js_script_name], record_network_log=True, headless=True)

步骤5：使用Selenium搜索推文数据

self.browser.switch_to.new_window('tab')
url = "https://x.com/explore"
self.browser.get(url=url)
search_box = self.browser.find_element(By.CSS_SELECTOR, '[data-testid="SearchBox_Search_Input"]')
search_box.send_keys(Keys.CONTROL + "a")  # Select all text
search_box.send_keys(Keys.DELETE)
self.browser.implicitly_wait(20)
search_box.send_keys(search_query)

# Press Enter to submit the search
search_box.send_keys(Keys.RETURN)
self.browser.implicitly_wait(1000)
second_div = self.browser.find_element(By.CSS_SELECTOR,'[data-testid="ScrollSnap-List"] [role="presentation"]:nth-of-type(2)')

步骤6：监控浏览器网络响应

performance_log = self.browser.get_log("performance")
for packet in performance_log:

    msg = packet.get("message")
    message = json.loads(packet.get("message")).get("message")
    packet_method = message.get("method")

    if "Network" in packet_method and 'SearchTimeline' in msg:
        document_url = message['params'].get('documentURL')
        if (not document_url) or ('&f=live' not in document_url):
            continue
        request_id = message.get("params").get("requestId")

步骤7：从响应中提取数据

entries = json.loads(body)['data']['search_by_raw_query']['search_timeline']['timeline']['instructions'][0].get('entries', None)
if not entries:
    continue
for entry in entries:
    item_content = entry['content'].get('itemContent', None)
    if not item_content:
        continue
    tweet_result = entry['content']['itemContent']['tweet_results']['result']
    entry_id = entry['entryId']