目标站点
第一步:新建项目
KeysdeMacBook:Desktop keys$ scrapy startproject MyCrawl
New Scrapy project 'MyCrawl', using template directory '/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/templates/project', created in:
/Users/keys/Desktop/MyCrawl
You can start your first spider with:
cd MyCrawl
scrapy genspider example example.com
第二步:创建爬虫
KeysdeMacBook:Desktop keys$ cd MyCrawl/
KeysdeMacBook:MyCrawl keys$ scrapy genspider FirstSpider www.shushu8.com/huanhaichenfu
第三步:配置item.py
import scrapy
class MycrawlItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
text = scrapy.Field()
第四步:编写爬虫
# -*- coding: utf-8 -*-
import scrapy
from MyCrawl.items import MycrawlItem
class FirstspiderSpider(scrapy.Spider):
name = 'FirstSpider'
allowed_domains = ['www.shushu8.com/huanhaichenfu']
start_urls = ['http://www.shushu8.com/huanhaichenfu/'+str(i+1) for i in range(502)]
def parse(self, response):
url = response.url
title = response.xpath('//*[@id="main"]/div[2]/div/div[1]/h1/text()').extract_first('')
text = response.css('#content::text').extract()
myitem = MycrawlItem()
myitem['url'] = url
myitem['title'] = title
myitem['text'] = ','.join(text)
yield myitem
第五步:配置pipeline.py
# -*- coding: utf-8 -*-
import pymysql
class MysqlPipeline(object):
# 采用同步的机制写入mysql
def __init__(self):
self.conn = pymysql.connect(
'127.0.0.1',
'root',
'rootkeys',
'Article',
charset="utf8",
use_unicode=True)
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
insert_sql = """
insert into huanhaichenfu(url, title, text)
VALUES (%s, %s, %s)
"""
# 使用VALUES实现传值
self.cursor.execute(
insert_sql,
(item["url"],
item["title"],
item["text"]))
self.conn.commit()
第六步:配置setting.py
# -*- coding: utf-8 -*-
BOT_NAME = 'MyCrawl'
SPIDER_MODULES = ['MyCrawl.spiders']
NEWSPIDER_MODULE = 'MyCrawl.spiders'
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
'MyCrawl.pipelines.MysqlPipeline': 1,
}
第七步:运行爬虫
import os
import sys
from scrapy.cmdline import execute
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
run_spider = 'FirstSpider'
if __name__ == '__main__':
print('Running Spider of ' + run_spider)
execute(['scrapy', 'crawl', run_spider])