python 爬取google总结

1.问题
目前主流的搜索引擎,非google莫属,但其对于非法(流量异常、爬虫)请求的封锁也是异常严厉

本人前段时间有个脚本用到了谷歌搜索,具体见python之由公司名推算出公司官网(余弦相似度)当时直接使用的是一个python开源项目

但在使用过程中,单ip的情况下爬取速度可谓感人,稍不留神还会被封,所以对于获取谷歌搜索结果的爬虫有必要进行改进

说一说爬取谷歌搜索结果的问题:

1.正常打开谷歌搜索,然后审查元素想获取目标内容的时候,会发现是一大串js。
2.访问过快就会出现流量异常
2.如何解决
对于第一个问题:

应该有看到审查元素出来的都是js,然后检索的url是这样的:

https://www.google.com.hk/search?q=hello&oq=hello&aqs=chrome…69i57j69i60l2j69i65j69i60j0.876j0j7&sourceid=chrome&ie=UTF-8&google_abuse=GOOGLE_ABUSE_EXEMPTION%3DID%3Daa946d8c657cf359:TM%3D1484917472:C%3Dr:IP%3D118.193.241.44-:S%3DAPGng0tGiKFaIr7YCaivUEmmEHOYJhG4jg%3B+path%3D/%3B+domain%3Dgoogle.com%3B+expires%3DFri,+20-Jan-2017+16:04:32+GMT

这里解决办法很粗暴,禁止掉js就好,让我们看看禁止js后是什么样的:

然后再看url: https://www.google.com.hk/search?q=hello&btnG=Search&safe=active&gbv=1

对于这个url,相信机智的你应该会明白些什么

这里可以写个简单的脚本,比如说获取,谷歌搜索第一页所有结果的html,简单写下:

URL_SEARCH = "https://{domain}/search?hl={language}&q={query}&btnG=Search&gbv=1"
URL_NUM = "https://{domain}/search?hl={language}&q={query}&btnG=Search&gbv=1&num={num}"

def search_page(query, language='en', num=None, start=0, pause=2):
    """
    Google search
    :param query: Keyword
    :param language: Language
    :return: result
    """
    time.sleep(pause)
    domain = self.get_random_domain()
    if num is None:
        url = URL_SEARCH
        url = url.format(
            domain=domain, language=language, query=quote_plus(query))
    else:
        url = URL_NUM
        url = url.format(
            domain=domain, language=language, query=quote_plus(query), num=num)
    try:
        requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
        r = requests.get(url=url,
                         allow_redirects=False,
                         verify=False,
                         timeout=30)
        charset = chardet.detect(r.content)
        content = r.content.decode(charset['encoding'])
        return content
    except Exception as e:
        logging.error(e)
        return None

到了这里,问题才刚开始,你可以做个实验,我假设你是使用代理进行谷歌搜索,如果你连续不断无间隔使用谷歌搜索某一关键字二三十下,不出意外你会被要求进行这样的验证:

这个问题可真是让人十分厌恶,我并没有很好的解决办法,能做的唯有尽量避免:

1.ip轮询 2.每次结果爬取增加休眠 3.随机user_agent是必备

第一点和第三点不必多说,对于第二点增加休眠时间则需要我们好好地进行检测。 假设在单ip随机ua情况下:

1.这种情况下不休眠的话请求个两三次就会直接被封(第二天会被解封)
2.个人觉得这不是个解决办法,因为对休眠时间把控不好的话就会造成封ip,如果不想被封,我测试的话需要休眠60s浮动,这没什么意义
3.而且这种情况下发现是直接封ip,对开发者太不友好
对于这种情况,受同事神来一句,发现一个暂时的解决办法,请看下图:

单一ip不停地访问统一谷歌域名自然很容易被察觉,谷歌全球190+的域名,难道都在实时的统计ip么,可能有,但绝对不会像单域名那样严格,来做个测试。

我将全球190+谷歌域名集中起来,像ua一样随机轮换,然后进行测试(单ip),结果还不错:

1.首先没有出现被封ip,只会提示流量异常
2.还是需要休眠,本人休眠515s没有被封过,可根据自身情况来,如果想稳妥点就530s吧
我将这些写成了一个项目,magic_google-python,若你是phper,可以看看我写的php版本php-google,具体代码可以看这里,对应的功能很简单:

from magic_google import MagicGoogle
import pprint

Or PROXIES = None

PROXIES = [{
‘http’: ‘http://192.168.2.207:1080’,
‘https’: ‘http://192.168.2.207:1080’
}]

Or MagicGoogle()

mg = MagicGoogle(PROXIES)

Crawling the whole page

result = mg.search_page(query=‘python’)

Crawling url

for url in mg.search_url(query=‘python’):
pprint.pprint(url)

Output

‘https://www.python.org/’

‘https://www.python.org/downloads/’

‘https://www.python.org/about/gettingstarted/’

‘https://docs.python.org/2/tutorial/’

‘https://docs.python.org/’

‘https://en.wikipedia.org/wiki/Python_(programming_language)’

‘https://www.codecademy.com/courses/introduction-to-python-6WeG3/0?curriculum_id=4f89dab3d788890003000096’

‘https://www.codecademy.com/learn/python’

‘https://developers.google.com/edu/python/’

‘https://learnpythonthehardway.org/book/’

‘https://www.continuum.io/downloads’

Get {‘title’,‘url’,‘text’}

for i in mg.search(query=‘python’, num=1):
pprint.pprint(i)

Output

{‘text’: ‘The official home of the Python Programming Language.’,

‘title’: ‘Welcome to Python .org’,

‘url’: ‘https://www.python.org/’}

3.总结
对google搜索结果的爬取,有以下建议:

1.ip轮询
2.ua随机
3.domain随机
4.休眠

user_agent总结:
self._user_agent = [‘Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0’,
‘Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)’,
‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)’,
‘Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1’,
‘Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1’,
‘Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11’,
‘Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11’,
‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)’,
‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)’,
‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)’,
‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)’,
‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)’,
‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)’,
‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)’,
‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)’,
‘Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5’,
‘Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5’,
‘Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5’,
‘Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1’,
‘MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1’,
‘Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10’,
‘Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13’,
‘Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+’,
‘Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0’,
‘Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124’,
‘Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)’,
‘UCWEB7.0.2.37/28/999’,
‘NOKIA5700/ UCWEB7.0.2.37/28/999’,
‘Openwave/ UCWEB7.0.2.37/28/999’,
‘Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999’,
‘Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0’,
‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; 360SE)’,
‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)’,
‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11’,
‘Mozilla/5.0 (Linux; U; Android 2.2.1; zh-cn; HTC_Wildfire_A3333 Build/FRG83D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1’,
‘Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50’,
‘Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5’,
‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; TencentTraveler 4.0; .NET CLR 2.0.50727)’,
‘MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1’,
‘Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1’,
‘Mozilla/5.0 (Androdi; Linux armv7l; rv:5.0) Gecko/ Firefox/5.0 fennec/5.0’,
‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)’,
‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)’,
‘Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11’,
‘Opera/9.80 (Android 2.3.4; Linux; Opera mobi/adr-1107051709; U; zh-cn) Presto/2.8.149 Version/11.10’,
‘UCWEB7.0.2.37/28/999’,
‘NOKIA5700/ UCWEB7.0.2.37/28/999’,
‘Openwave/ UCWEB7.0.2.37/28/999’,
‘Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999’ ]

google域名总结:
self.domain = [
‘https://www.google.com/’,
‘https://www.google.ad/’,
‘https://www.google.ae/’,
‘https://www.google.com.af/’,
‘https://www.google.com.ag/’,
‘https://www.google.com.ai/’,
‘https://www.google.al/’,
‘https://www.google.am/’,
‘https://www.google.co.ao’,
‘https://www.google.com.ar/’,
‘https://www.google.as/’,
‘https://www.google.at/’,
‘https://www.google.com.au/’,
‘https://www.google.az/’,
‘https://www.google.ba/’,
‘https://www.google.com.bd/’,
‘https://www.google.be/’,
‘https://www.google.bf/’,
‘https://www.google.bg/’,
‘https://www.google.com.bh/’,
‘https://www.google.bj/’,
‘https://www.google.com.bn/’,
‘https://www.google.com.bo’,
‘https://www.google.com.br/’,
‘https://www.google.bs/’,
‘https://www.google.at/’,
‘https://www.google.bt/’,
‘https://www.google.co.bw/’,
# ‘https://www.google.by/’,
‘https://www.google.com.bz/’,
]

  • 4
    点赞
  • 21
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
使用Python爬取谷歌地图非常方便。可以利用Python的第三方库selenium来实现。首先,需要安装selenium库并下载相应的浏览器驱动(如Chrome驱动)。然后,可以使用selenium库中的webdriver类来启动浏览器,并打开谷歌地图的网页。接下来,可以通过找到输入框元素,并输入搜索关键字来搜索地点。使用get_attribute方法可以获取输入框中的内容。再通过点击搜索按钮进行搜索。最后,可以通过xpath找到搜索结果中的名称元素,并获取其文本内容。如果找不到该元素,则返回"no"表示没有找到对应的地点。以下是一个简单的代码示例: ```python from selenium import webdriver import time from selenium.common.exceptions import NoSuchElementException def getGG(name): driver = webdriver.Chrome('D:/chromedriver/chromedriver.exe') # 加载Chrome浏览器驱动 driver.implicitly_wait(10) # 隐式等待时间最长等待10秒 driver.get("https://www.google.com/maps") # 打开谷歌地图网页 time.sleep(3) # 输入搜索关键字并获取内容 driver.find_element_by_id("searchboxinput").send_keys(name) value = driver.find_element_by_id("searchboxinput").get_attribute("value") # 点击搜索按钮 driver.find_element_by_id("searchbox-searchbutton").click() try: valueName = driver.find_element_by_xpath("//*[@id='pane']/div/div<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *3* [Python爬取谷歌地图切片、天地图切片](https://blog.csdn.net/weixin_30951743/article/details/99912906)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_1"}}] [.reference_item style="max-width: 50%"] - *2* [python- 机器人抓取谷歌地图数据](https://blog.csdn.net/weixin_39831786/article/details/92804852)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_1"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值