一、开发前准备
1、开发环境准备
- 基础准备(win10)
- 参考:python2与python3共存安装
- 参考: pycharm安装及永久激活
- 参考: mysql及navicat安装与使用及navicat破解
- 参考:安装cmder替代cmd 推荐
- 虚拟环境搭建
- 进入到想要存放虚拟环境的目录下,安装virtualenvwrapper:
pip install virtualenvwrapper-win
- 创建虚拟环境:
mkvirtualenv -p C:\Python35\python.exe myenv # myenv为虚拟环境名
指定目录下创建虚拟环境变量- 虚拟环境常用命令:workon、deactivate、rmvirtualenv
- 通常需要安装的包
- pip install scrapy # scrapy包,可能需要从 https://www.lfd.uci.edu/~gohlke/pythonlibs/,分别安装twisted及scrapy。如:
pip install Twisted-17.9.0-cp36-cp36m-win_amd64.whl
piip install Scrapy-1.5.0-py2.py3-none-any.whl
- pip install pypiwin32
# 如果出现win32api错误,尝试 python pywin32_postinstall.py -install- pip install Pillow # 处理图片
- pip install mysqlclient # MySQLdb包
- pip install fake-useragent # 随机user-agent
- pip install requests # 安装requests
- pip install selenium # 安装selenium
- pip install pyvirtualdisplay # chrome无界面运行(linux环境下可用)
- 如果有问题,可到 https://www.lfd.uci.edu/~gohlke/pythonlibs/ 寻找解决办法
- 利用豆瓣源可加速安装
pip install -i https://pypi.douban.com/simple mywrap # mywrap为包名
2、新建爬虫
创建爬虫项目:
进入cmd,进入项目想要存放的目录;
进入到拟创建项目的虚拟环境;
执行命令:scrapy startproject <project_name> [project_dir]
创建爬虫模板:
cd project_name
# project_name为项目名,需要进入到项目目录下面
scrapy genspider [options] <name> <domain>
# 生成爬虫模板常用爬虫命令:输入scrapy ,可查看帮助
进入pycharm,设置项目解释器为项目的虚拟环境
3、写main.py文件,便于调试
便于直接使用pycharm调试,需要配置main.py文件如下,当运行main.py时,可通过断点进行调试
# main.py # 已存在,需要复制下面内容到文件中
from scrapy.cmdline import execute
import sys
import os
sys.path.append(os.path.dirname(os.path.abspath(__file__))) # 添加默认路径
execute(["scrapy", "crawl", "jobbole"]) # 根据实际爬虫进行修改
# execute(["scrapy", "crawl", "zhihu"])
# execute(["scrapy", "crawl", "lagou"])
4、创建自动生成部分代码的函数
-
分析网页及需求,规划需要哪些字段及字段属性
-
编写自动生成数据库创建、插入数据、更新数据、item赋值语句等代码的函数
# gen_item_sql.py # 用于自动生成部分代码,减少手工输入,其中字段要视具体情况进行修改 def gen_item_sql(item_list, table_name, class_item): ''' 自动生成代码辅助函数 ''' # 创建mysql表 print('-' * 50, 'sql 表的创建', '-' * 30) item_p = [] for item in item_list: if item[-4:] == 'nums': item_p.append("%s int(11) DEFAULT 0 NOT NULL" % (item,)) elif item[-4:] == "date": item_p.append("%s date DEFAULT NULL" % (item,)) elif item[-8:] == "datetime": item_p.append("%s datetime DEFAULT NULL" % (item,)) elif item[-7:] == "content": item_p.append("%s longtext DEFAULT NULL" % (item,)) elif item[-2:] == "id": item_p.append("%s varchar(50) NOT NULL" % (item,)) else: item_p.append("%s varchar(300) DEFAULT NULL" % (item,)) print("CREATE TABLE %s (" % (table_name,)) print(",".join(item_p)) print(") ENGINE=InnoDB DEFAULT CHARSET=utf8;") # 在spider文件中,定义类解析item:------------------------------ print('-' * 50, '在spider中,定义def parse_item(self, response):', '-' * 30) print('def parse_item(self, response):') print("\titem_loader=MyItemLoader(item={0}(),response=response)".format(class_item)) for item in item_list: print('\titem_loader.add_xpath(\'%s\',\'\')' % (item,)) print("\titem = item_loader.load_item()") print("\tyield item") # 在items.py文件中,定义Item类:------------------------------ print('-' * 50, '在items.p文件中,定义Item类:', '-' * 30) print("class {0}(scrapy.Item):".format(class_item)) for item in item_list: print("\t%s =scrapy.Field()" % (item,)) print() print("\tdef get_insert_sql(self):") print("\t# 获取插入的sql语句及需要传递的参数") sql_str = '\t\tinsert_sql = \'\'\'' sql_str += '\ninsert into %s ( %s )' % (table_name, ','.join(item_list)) sql_str += '\nVALUES ( %s )' % (','.join(['%s' for _ in item_list]),) sql_str += '\nON DUPLICATE KEY UPDATE %s' % (','.join(['%s= VALUES(%s)' % (x, x) for x in item_list]),) sql_str += '\n\'\'\'' print(sql_str) print('\t\tparams = ( %s )' % (','.join(['self["%s"]' % (x,) for x in item_list]),)) print("\t\treturn insert_sql,params") # 在items.py文件中,定义Item类:------------------------------ print('-' * 50, '对item赋值', '-' * 30) for item in item_list: print('\titem["%s"] =' % (item,)) if __name__ == "__main__": jobbole_fields = ['title', 'create_datetime', 'url', 'url_object_id', 'tags', 'content', 'front_image_url', 'front_image_path', 'praise_nums', 'comment_nums', 'fav_nums'] gen_item_sql(jobbole_fields, 'jobbole_article',class_item = "JobboleArticleItem")
5、创建数据库(如:mysql)
- 通过复制上面自动生成的创建数据表的sql代码,到navicat中自动生成相应的数据表;
- 如果字段内容中有moji表情图标,数据库中数据表对应字段应设为utf8mb4
二、scrapy架构烂熟于心
- The Engine gets the initial Requests to crawl from the Spider.
- The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
- The Scheduler returns the next Requests to the Engine.
- The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see process_request()).
- Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see process_response()).
- The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see process_spider_input()).
- The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see process_spider_output()).
- The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.
- The process repeats (from step 1) until there are no more requests from the Scheduler.
三、爬取数据
1、基于base生成的爬虫模板
即通过scrapy genspider -t base <name> <domain>
# 生成的爬虫模板
关键是要重载两个函数:
- 定义
def parse(self, response):
,实现:
- 从起始网页开始,获取页面中符合条件(正则匹配)的url链接
- yield scrapy.Request()相应的url,通过回调函数parse_item,完成进行字段解析;
- 如有下一页面的url,则获取其url,yield scrapy.Request(),回调函数为parse,进一步遍历;
- 定义
def parse_item(self, response):
实现
- 对页面的解析,提取出页面中需要的信信息,存入到item中;
- 如解析中发现有新的可进一步爬取的url,yield scrapy.Request(),回调函数为parse,进一步遍历;
spider原码解读参看: https://blog.csdn.net/qd_ltf/article/details/79792957
2、基于crawl生成的爬虫模板
即通过scrapy genspider -t crawl <name> <domain>
# 生成的爬虫模板
关键是要重载def parse_item(self, response):
实现页面的解析;
可以重载以下函数:
def parse_start_url(self, response):
return []
def process_results(self, response, results):
return results
crawlspider原码解读参看: https://blog.csdn.net/qd_ltf/article/details/79782005
3、字段解析时,shell调试
解析页面时,通常用css或xpath实现字段解析,也通常可以在shell下调试,如:
scrapy shell http://blog.jobbole.com/112744/ # shell命令调试
re2 = response.xpath('//*[@id="post-112744"]/div[1]/h1/text()').extract()
四、保存数据(pipline)
可以根据需要,可以直接把下面部分代码拷贝到相应的文件中
1、保存到mysql
# settings.py # 需要在setting.py文件中添加以下内容
MYSQL_HOST = '127.0.0.1'
MYSQL_DB_NAME = 'spider' # 数据库名称
MYSQL_USER = 'root'
MYSQL_PASSWORD = 'root'
SQL_DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S" # 时间格式
SQL_DATE_FORMAT = "%Y-%m-%d" # 日期格式
ITEM_PIPELINES = {
'ltfspider.pipelines.MysqlTwistedPipline': 300, #根据具体的项目调整
}
# ------------------------------------
# pipelines.py # 需要在pipelines.py文件中添加以下内容
import MySQLdb
import MySQLdb.cursors
from twisted.enterprise import adbapi
class MysqlTwistedPipline(object):
'''
使用twisted将mysql插入变成异步操作
'''
def __init__(self, db_pool):
self.db_pool = db_pool
@classmethod
def from_settings(cls, settings):
# 从settings中获取数据库设置,并返回一个pipline的实例
return cls(adbapi.ConnectionPool(
"MySQLdb",
host=settings['MYSQL_HOST'],
db=settings['MYSQL_DB_NAME'],
user=settings['MYSQL_USER']