使用 Python 实现 Google 搜索结果抓取并随机延迟时间间隔

qq^^614136809

于 2024-10-11 16:09:00 发布

阅读量75

点赞数 2

文章标签： python 开发语言

本文链接：https://blog.csdn.net/D0126_/article/details/142856709

版权

在使用 Selenium 抓取 Google 搜索结果时，为了避免 Google 分析 HTTP 请求的模式，需要在请求之间引入时间间隔。但原先在代码中使用固定的时间间隔（3 秒）可能会被 Google 检测到，因此需要改进代码，使其以随机时间间隔进行延迟。

2. 解决方案

为了实现随机时间间隔，可以使用 Python 中的 random 模块生成随机数作为延迟时间。这里提供了两种改进代码的方案：

方案一：使用 random.randrange() 函数生成随机时间间隔

import random

# 使用 random.randrange() 函数生成随机时间间隔
random_delay = random.randrange(10, 30, 2)

# 将随机时间间隔打印出来
print('Sleeping for {} seconds'.format(random_delay))

# 使用随机时间间隔进行延迟
time.sleep(random_delay)

方案二：使用 random.randint() 函数生成随机时间间隔

import random

# 使用 random.randint() 函数生成随机时间间隔
random_delay = random.randint(10, 30)

# 将随机时间间隔打印出来
print('Sleeping for {} seconds'.format(random_delay))

# 使用随机时间间隔进行延迟
time.sleep(random_delay)

这两种方案都可以实现随机时间间隔，从而有效避免 Google 检测到 HTTP 请求的模式。

代码示例

import random
from selenium import webdriver
from scrapy import Selector as s
import csv
import itertools

# 搜索列表
lister = ['https://www.google.co.uk/search?q=MOT+in+Godmanchester&num=10',
'https://www.google.co.uk/search?q=MOT+in+Godmanchester&num=10&start=10',
'https://www.google.co.uk/search?q=MOT+in+Hanley+Grange&num=10',
'https://www.google.co.uk/search?q=MOT+in+Hanley+Grange&num=10&start=10',
'https://www.google.co.uk/search?q=MOT+in+Huntingdon&num=10',
'https://www.google.co.uk/search?q=MOT+in+Huntingdon&num=10&start=10',
'https://www.google.co.uk/search?q=MOT+in+March&num=10']

# 打开 Firefox 浏览器
driver = webdriver.Firefox()

# 打开 CSV 文件
with open("C:\Drive F data\Google\output.csv", "ab")as export:
    # 字段名称
    fieldnames = ['link', 'text1', 'text2', 'text3']

    # 创建 CSV 写入器
    writer = csv.DictWriter(export, fieldnames=fieldnames)

    # 写入 CSV 头部
    writer.writeheader()

    # 循环抓取搜索结果
    for serial, eacher in enumerate(lister, start=1):
        # 获取搜索链接
        link = (eacher)

        # 生成随机时间间隔
        random_delay = random.randint(10, 30)

        # 将随机时间间隔打印出来
        print('Sleeping for {} seconds'.format(random_delay))

        # 使用随机时间间隔延迟
        time.sleep(random_delay)

        # 访问搜索链接
        driver.get(link)

        # 等待页面加载
        time.sleep(3)

        # 打印序号和链接
        print(serial, '.', link)

        # 获取页面源码
        source = driver.page_source

        # 使用 scrapy 解析页面源码
        source1 = s(text=source, type="html")

        # 提取搜索结果
        text1 = source1.xpath('//h3[(contains(@class, "r")) and not(contains(@style, "line-height:normal"))]//text()').extract()
        text2 = source1.xpath('//h3[(contains(@class, "r")) and not(contains(@style, "line-height:normal"))]//@href').extract()
        text3 = source1.xpath('//span[@class="st"]').extract()

        # 写入 CSV 文件
        for each, each1, each2 in itertools.izip(text1, text2, text3):
            each = each.encode('utf8')
            each1 = each1.encode('utf8')
            each2 = each2.encode('utf8')

            writer.writerow({'link': link, 'text1': each, 'text2': each1, 'text3': each2})

# 关闭 Firefox 浏览器
driver.close()