1.创建scrapy项目,命令: scrapy startproject scrapyspider(项目名称)
2.在创建项目的根目录下创建spider,命令:scrapy genspider myspider(爬虫名称) www.baidu.com(爬取url)
3.使用pycharm打开爬虫项目,爬虫模板如下
class JobboleSpider(scrapy.Spider): name = 'jobbole' allowed_domains = ['blog.jobbole.com'] start_urls = ['http://blog.jobbole.com/all-posts/'] def parse(self, response): pass
4.如上代码parse函数是对start_urls中的url进行解析的函数,如下代码
def parse(self, response): # 1.获取文章列表页中文章的url交给scrapy下载后并交给解析函数进行具体字段的解析 post_nodes = response.xpath("//div[@id='archive']/div[contains(@class,'floated-thumb')]/div[@class='post-thumb']/a") for post_node in post_nodes: image_url = post_node.xpath("img/@src").extract_first() url = post_node.xpath("@href").extract_first() yield Request(url=parse.urljoin(response.url, url), meta={"front_image_url":parse.urljoin(response.url, image_url)}, callback=self.parse_detail) # 2.获取下一页的url交给scrapy进行下载 next_url = response.xpath("//a[@class='next page-numbers']/@href").extract_first("") if next_url: yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse)
(1)调用scrapy中request函数将具体页面内容交给callback(回调函数)parse_detail进行处理,并且在request中传入参数图片的url
添加的参数 : meta={"front_image_url":parse.urljoin(response.url, image_url)}
(2)将获取的下一页的列表页request出去,交给回调函数parse,就是这个函数进行列表页处理
5.具体页面解析函数parse_detail
def parse_detail(self, response): article_item = JobBoleArticleItem() front_image_url = response.meta.get("front_image_url", "") title = response.xpath("//div[@class='entry-header']/h1/text()").extract_first() create_date = response.xpath("//p[@class='entry-meta-hide-on-mobile']/text()").extract_first().replace('·','').strip() praise_nums = int(response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract_first()) if praise_nums: praise_nums = int(praise_nums) else: praise_nums = 0 fav_nums = response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract_first() match_re = re.match(".*?(\d+).*", fav_nums) if match_re: fav_nums = int(match_re.group(1)) else: fav_nums = 0 comment_nums = response.xpath("//a[@href='#article-comment']/span/text()").extract_first() match_re = re.match(".*?(\d+).*", comment_nums) if match_re: comment_nums = int(match_re.group(1)) else: comment_nums = 0 tag_list = response.xpath("//p[@class='entry-meta-hide-on-mobile']/a/text()").extract() #TODO 有问题,原因:endswith函数拼写错误 tag_list = [element for element in tag_list if not element.strip().endswith("评论")] tags = ",".join(tag_list) article_item["title"] = get_md5(response.url) article_item["title"] = title article_item["url"] = response.url try: create_date = datetime.datetime.strftime(create_date, "%Y/%m/%d").date() except Exception as e: create_date = datetime.datetime.now().date() article_item["create_date"] = create_date # scrapy中获取的是image数组,需要将值改为数组类型 article_item["front_image_url"] = [front_image_url] article_item["praise_nums"] = praise_nums article_item["comment_nums"] = comment_nums article_item["fav_nums"] = fav_nums article_item["tags"] = tags yield article_item
(1)从response中解析出具体内容,并对内容进行判断
(2)其中涉及到从response中手动传入的参数front_image_url,获取如下:front_image_url = response.meta.get("front_image_url", "")
(3)将处理后的内容放到item中,yield(抛出)item
6.item需要在items.py中自己定义,定义如下:
class JobBoleArticleItem(scrapy.Item): title = scrapy.Field() create_date = scrapy.Field() url = scrapy.Field() url_object_id = scrapy.Field() front_image_url = scrapy.Field() front_image_path = scrapy.Field() praise_nums = scrapy.Field() comment_nums = scrapy.Field() fav_nums = scrapy.Field() tags = scrapy.Field()
7.数据的导出(导出到数据库或本地文件中)以及图片的下载,需要在piplines.py中定义文件下载和数据存储的piplines以及在settings.py文件中配置
(1)数据存储到mysql中,调用scrapy中twisted框架进行异步存储(原因:爬取速度过快时,数据存储会限制爬取)
class JobBoleMysqlTwistedPipline(object): def __init__(self, dbpool): self.dbpool = dbpool # python静态函数,从settings中读取数据库的配置 @classmethod def from_settings(cls, settings): dbparms = dict( host = settings["MYSQL_HOST"], db = settings["MYSQL_DBNAME"], user = settings["MYSQL_USER"], passwd = settings["MYSQL_PASSWORD"], charset = "utf8", cursorclass = MySQLdb.cursors.DictCursor, use_unicode = True ) dbpool = adbapi.ConnectionPool("MySqldb", **dbparms) return cls(dbpool) def process_item(self, item, spider): # 使用twisted将mysql插入变成异步执行 self.dbpool.runInteraction(self.do_insert, item) def do_insert(self, cursor, item): insert_sql = """ insert into jobbole(title, create_date, url, url_object_id, front_image_url, comment_nums, fav_nums, praise_nums, tags) values (%s, %s, %s, %s, %s, %s, %s, %s, %s) """ cursor.execute(insert_sql, (item["title"], item["create_date"], item["url"], item["url_object_id"], item["front_image_url"], item["comment_nums"], item["fav_nums"], item["praise_nums"], item["tags"]))
(2)图片下载
# 继承了scrapy中ImagesPipeline,重写图片地址方法,具体下载是scapy完成 class JobBoleImagePipeline(ImagesPipeline): def item_completed(self, results, item, info): for ok, value in results: image_file_path = value["path"] item["front_image_path"] = image_file_path return item
(3)修改settings中对pipline的配置,将添加的pipeline添加到里面即可,后面数字越小越先执行
ITEM_PIPELINES = { # 'webspider.pipelines.JobBoleImagePipeline': 1, 'webspider.pipelines.JobBoleMysqlTwistedPipline': 1, }