Python3 + Scrapy + Selenium + Chrome Headless + MySQL

最新推荐文章于 2024-05-06 21:21:57 发布

熊猫也是猫啊

最新推荐文章于 2024-05-06 21:21:57 发布

阅读量473

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/weixin_43189735/article/details/84317447

版权

爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Python3 + Scrapy + Selenium + Chrome Headless + MySQL

Py3 MySQL 数据库驱动 - PyMySQL
配置Scrapy
配置SeleniumWebdriver

Py3 MySQL 数据库驱动 - PyMySQL

点击打开 PyMySQL @ github

pip3 install PyMySQL

配置Scrapy

点击打开 Scrapy官网

pip install scrapy

使用anaconda的话

which python
conda install -c conda-forge scrapy

官方测试爬虫

#myspider.py
import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for title in response.css('.post-header>h2'):
            yield {'title': title.css('a ::text').extract_first()}

        for next_page in response.css('div.prev-post > a'):
            yield response.follow(next_page, self.parse)
EOF

输出如下

> scrapy runspider myspider.py
......
2018-12-26 23:58:23 [scrapy.core.engine] INFO: Spider opened
2018-12-26 23:58:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-12-26 23:58:23 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-12-26 23:58:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://blog.scrapinghub.com> (referer: None)
2018-12-26 23:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com>
{'title': 'Do What is Right Not What is Easy!'}
2018-12-26 23:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com>
{'title': 'Shubber GetTogether 2018'}
2018-12-26 23:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com>
{'title': 'Data Quality Assurance for Enterprise Web Scraping'}
2018-12-26 23:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com>
{'title': 'What I Learned as a Google Summer of Code student at Scrapinghub'}
2018-12-26 23:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com>
{'title': 'GDPR Compliance For Web Scrapers: The Step-By-Step Guide'}
2018-12-26 23:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com>
{'title': 'For E-Commerce Data Scientists: Lessons Learned Scraping 100 Billion Products Pages'}
2018-12-26 23:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com>
{'title': 'A Sneak Peek Inside What Hedge Funds Think of Alternative Financial Data'}
2018-12-26 23:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com>
{'title': 'Want to Predict Fitbit’s Quarterly Revenue? Eagle Alpha Did It Using Web Scraped Product Data'}
2018-12-26 23:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com>
{'title': 'How Data Compliance Companies Are Turning To Web Crawlers To Take Advantage of the GDPR Business Opportunity'}
2018-12-26 23:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com>
{'title': 'Looking Back at 2017'}
2018-12-26 23:58:27 [scrapy.core.engine] INFO: Closing spider (finished)
......

配置SeleniumWebdriver

SeleniumWebdriver的配置

熊猫也是猫啊

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python3 + Scrapy + Selenium + Chrome Headless + MySQL

Python3 + Scrapy + Selenium + Chrome Headless + MySQLPy3 MySQL 数据库驱动 - PyMySQL安装ChromeDriver安装FireFox Geckodrivervi python_org_search.pyPy3 MySQL 数据库驱动 - PyMySQL点击打开 PyMySQL @ githubpip3 install Py...
复制链接

扫一扫