scrapy+pymongo+selenium
scrapy
运行
1.新建run.py 每次运行这个
from scrapy import cmdline
cmdline.execute('scrapy crawl google'.split())
2.命令行
scrapy crawl google
yield scrapy.Request()不响应
1.allowed_domains = [“xxxxx”] 没写对
allowed_domains = ['play.google.com/store/apps']#对
allowed_domains = ['https://play.google.com/store/apps']#错
2.dont_filter=True 添加(有可能是传入网址过滤掉,dont_filter=True为不过滤)
yield scrapy.Request(
url=item['GURL'],
callback=self.parse_addr_list,
meta={"item": item},
dont_filter=True
)
获取settings配置
1.现在用不了了
from scrapy.conf import settings
2.在用的方法
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
使用
host = settings['MONGODB_HOST']
3.crawler.settings.get
user_agent=crawler.settings.get('USER_AGENTS') #从settings中读取UA
pymongo
pymongo.errors.CursorNotFound: Cursor not found 错误处理
超时原因:
数据量太大,mongo 的性能处理不过来
数据在处理过程中太耗时
解决方案:
为find() 函数设置 no_cursor_timeout = True,表示游标连接不会主动关闭(需要手动关闭)
items = myset.find(no_cursor_timeout = True)
for item in items:
#处理数据
items.close()
selenium
判断下拉到底
def is_element_present(self, spider, by, value):
try:
element = spider.chrome.find_element(by=by, value=value)
except NoSuchElementException as e:
return False
return True
temp_height = 0
if flag!=0:
while True:
# 循环将滚动条下拉
spider.chrome.execute_script("window.scrollBy(0,8000)")
# sleep一下让滚动条反应一下
time.sleep(1)
spider.chrome.implicitly_wait(5)
# 获取当前滚动条距离顶部的距离
check_height = spider.chrome.execute_script(
"return document.documentElement.scrollTop || window.pageYOffset || document.body.scrollTop;")
# 如果两者相等说明到底了
if check_height == temp_height:
# check=self.is_element_present(spider,By.XPATH,'//div[@class="U26fgb O0WRkf oG5Srb C0oVfc n9lfJ M9Bg4d"]')
#
# # print(buttons.text)
# if check:
# buttons = spider.chrome.find_element(By.XPATH,
# '//div[@class="U26fgb O0WRkf oG5Srb C0oVfc n9lfJ M9Bg4d"]')
# spider.chrome.execute_script("window.scrollBy(0,-3000)")
# print('点击显示更多内容')
# buttons.click()
# time.sleep(0.1)
# else:
#
# print('下拉到底了')
# break
break
temp_height = check_height
# print(check_height)
设置cookie
print('设置cookie')
cookies=json.load(open('cookie.json','r'))
for cookie in cookies:
if 'sameSite' in cookie:
cookie['sameSite'] = 'Strict'
browser.add_cookie(cookie)
Selenium 添加cookie失败 assert cookie_dict[‘sameSite’] in [‘Strict’, ‘Lax’] AssertionError()
原因:谷歌不允许发送’sameSite‘cookie设置为‘Strict’, ‘Lax’之外的值
解决方法:修改即可(见上面代码
selenium.common.exceptions.TimeoutException: Message: timeout
2022-04-11 22:12:35 [scrapy.core.scraper] ERROR: Error downloading <GET https://play.google.com/store/apps/details?id=com.vpnbottle.melon.free.unblock.fast.vpn&hl=en&showAllReviews=true>
Traceback (most recent call last):
File "D:\python\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "D:\python\lib\site-packages\scrapy\core\downloader\middleware.py", line 36, in process_request
response = yield deferred_from_coro(method(request=request, spider=spider))
File "E:\程序\python\test\home\gsReview\middlewares.py", line 37, in process_request
spider.chrome.get(url)
File "D:\python\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 436, in get
self.execute(Command.GET, {'url': url})
File "D:\python\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 424, in execute
self.error_handler.check_response(response)
File "D:\python\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 247, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: timeout: Timed out receiving message from renderer: 300.000
(Session info: headless chrome=100.0.4896.75)
根本原因:
加载内容过多,导致的超时。
解决方法:
1、临时解决方法:chrome_options.add_argument(‘–headless’),该‘浏览器不提供可视化页面’设置去掉
2、添加不加载图片设置,提升速度:chrome_options.add_argument(‘blink-settings=imagesEnabled=false’)
html
lang
<html lang='en'></html> //英文
<html lang='zh'></html> //中文
<html lang='ja'></html> //日文
<html lang='en-us'></html> //美式英文
<html lang='fr'></html> //法文
<html lang='da'></html> //德文
<html lang='it'></html> //意大利文
<html lang='ko'></html> //韩文
<html lang='pl'></html> //波兰文
<html lang='ru'></html> //俄文
<html lang='es'></html> //西班牙文