Python3 + Scrapy + Selenium + Chrome Headless + MySQL
Py3 MySQL 数据库驱动 - PyMySQL
pip3 install PyMySQL
配置Scrapy
pip install scrapy
使用anaconda的话
which python
conda install -c conda-forge scrapy
官方测试爬虫
#myspider.py
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.scrapinghub.com']
def parse(self, response):
for title in response.css('.post-header>h2'):
yield {'title': title.css('a ::text').extract_first()}
for next_page in response.css('div.prev-post > a'):
yield response.follow(next_page, self.parse)
EOF
输出如下
> scrapy runspider myspider.py
......
2018-12-26 23:58:23 [scrapy.core.engine] INFO: Spider opened
2018-12-26 23:58:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-12-26 23:58:23 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-12-26 23:58:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://blog.scrapinghub.com> (referer: None)
2018-12-26 23:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com>
{'title': 'Do What is Right Not What is Easy!'}
2018-12-26 23:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com>
{'title': 'Shubber GetTogether 2018'}
2018-12-26 23:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com>
{'title': 'Data Quality Assurance for Enterprise Web Scraping'}
2018-12-26 23:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com>
{'title': 'What I Learned as a Google Summer of Code student at Scrapinghub'}
2018-12-26 23:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com>
{'title': 'GDPR Compliance For Web Scrapers: The Step-By-Step Guide'}
2018-12-26 23:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com>
{'title': 'For E-Commerce Data Scientists: Lessons Learned Scraping 100 Billion Products Pages'}
2018-12-26 23:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com>
{'title': 'A Sneak Peek Inside What Hedge Funds Think of Alternative Financial Data'}
2018-12-26 23:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com>
{'title': 'Want to Predict Fitbit’s Quarterly Revenue? Eagle Alpha Did It Using Web Scraped Product Data'}
2018-12-26 23:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com>
{'title': 'How Data Compliance Companies Are Turning To Web Crawlers To Take Advantage of the GDPR Business Opportunity'}
2018-12-26 23:58:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://blog.scrapinghub.com>
{'title': 'Looking Back at 2017'}
2018-12-26 23:58:27 [scrapy.core.engine] INFO: Closing spider (finished)
......