1、随便找一个创新创业网址:https://www.sohu.com/a/257107420_100000347
随后对该网页数据进行爬取
参考网上https://blog.csdn.net/jessica__lu/article/details/85797342
网页构成三部分
HTML:HTML称为超文本标记语言,是一种标识性的语言。它包括一系列标签.通过这些标签可以将网络上的文档格式统一,使分散的Internet资源连接为一个逻辑整体。HTML文本是由HTML命令组成的描述性文本,HTML命令可以说明文字,图形、动画、声音、表格、链接等. 带< >这种的。
CSS:层叠样式表(英文全称:Cascading Style Sheets)是一种用来表现HTML(标准通用标记语言的一个应用)或XML(标准通用标记语言的一个子集)等文件样式的计算机语言。CSS不仅可以静态地修饰网页,还可以配合各种脚本语言动态地对网页各元素进行格式化。带</>这种的。
JavaScript: 是一种具有函数优先的轻量级,解释型或即时编译型的编程语言。
3.19
基础的命令及含义
下面原文参考链接https://blog.csdn.net/jessica__lu/article/details/85868403
3.20
S2:描述要爬取的东西在哪
核心: = Soup.select()
首先打开网页,对于需要抓取的地方右键检查,对选中的代码copy selector.
小目标:去掉没用的结构,并且把数据按照一定格式装在容器中方便去查询
写一个for循环取出所有元素并对元素使用方法,使用get_text 释放标签中所包含的文本信息.
同时对信息进行整合,全部放入字典中方便查询
使用zip对所有元素进行一次性循环,并对其中的元素进行字典构造
需要注意其中:
(1)图片链接属于标签中的一个属性,用get方法获得
(2)‘cate’:list(cate.stripped_strings),这样写 cate的分类会对应列表.
此处使用find all() 方法统计样式共有几处,在使用len()方法去计算列表中的数量。
3.21
运行时哪里有问题就在哪里添加对应的包
之前试了好多次,这里老是显示添加bs4包失败,百度找了好久,试了好多次最后这个显示的延迟问题得以解决(网上说是国内网速的问题)
(上图红色箭头网址是下载的包的地址,如果出现问题时可以自行点击那个网址去下载,随后解压到Lib文件夹下就好了)
解决方案
去命令提示符下手动输入pip install --default-timeout=100 bs4即可
随后去available packages那里添加搜索bs4包,点击install,最后显示如下:
bs4包安装成功。
下面运行代码试试:https://blog.csdn.net/jessica__lu/article/details/86421580(源代码链接)
缺失的pymysql包也用上面的方法进行安装。但是!!!又有问题了!!!!!!
https://www.cnblogs.com/ly-520/p/11015685.html
https://blog.csdn.net/qq_43517653/article/details/99701246
https://blog.csdn.net/u010151698/article/details/79371234
scrapy百度链接:
https://baike.baidu.com/item/scrapy/7914913?fr=aladdin
为了更好的使用爬虫的api–保证不过时,详细使用流程参考官方文档 https://docs.scrapy.org/en/latest/intro/tutorial.html ,此处继续采用 http://quotes.toscrape.com/ 网站作为典型(因为结构简单好操作-,而且不会改变内容以便爬虫生效时间更长),爬取网站的quotes列表和author列表保存到一个mysql数据库对应的表中, QuotesSpider 和 AuthorSpider 两类分别爬取。
首先创建数据库表
数据库表名scrapy
use scrapy;
# 创建数据表
CREATE TABLE quotes (
text VARCHAR(5000) NOT NULL,
author VARCHAR(100) NOT NULL,
tags VARCHAR(1000)
);
CREATE TABLE author (
name VARCHAR(100) NOT NULL,
birthdate VARCHAR(100),
bio VARCHAR(5000)
);
# 一些调试语句
# INSERT INTO quotes(text,author,tags) VALUES('aaa','aaa','aaa');
# SELECT * FROM quotes;
# DELETE FROM quotes;
创建成功后如上图显示。
anaconda下安装scrapy框架。
打开anaconda promt命令行输入上述命令安装Scrapy.
验证是否安装成功:输入scrapy
https://baijiahao.baidu.com/s?id=1621695863688073415&wfr=spider&for=pc
继续在anaconda prompt中新创建scrapy项目(项目创建在了桌面上)
输入代码:scrapy startproject tutorial(最后一个是命名)
这是里面的文件
然后打开pycharm,打开新创建的文件
随后按照上述https://blog.csdn.net/u010151698/article/details/79371234的网页链接粘贴代码试运行。
quotes_spider.py
# quotes_spider.py
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
# def start_requests(self):
# urls = [
# 'http://quotes.toscrape.com/page/1/',
# 'http://quotes.toscrape.com/page/2/',
# ]
# for url in urls:
# yield scrapy.Request(url=url, callback=self.parse)
start_urls = [
'http://quotes.toscrape.com/page/1/',
# 'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
# print(response.body)
# page = response.url.split("/")[-2]
# filename = 'quotes-%s.html' % page
# with open(filename, 'wb') as f:
# f.write(response.body)
# self.log('Saved file %s' % filename)
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': ''.join(quote.css('div.tags a.tag::text').extract()),
}
# next_page = response.css('li.next a::attr(href)').extract_first()
# if next_page is not None:
# next_page = response.urljoin(next_page)
# yield scrapy.Request(next_page, callback=self.parse)
for a in response.css('li.next a'):
yield response.follow(a, callback=self.parse)
author_spider.py
# author_spider.py
import scrapy
class AuthorSpider(scrapy.Spider):
name = 'author'
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
# follow links to author pages
for href in response.css('.author + a::attr(href)'):
yield response.follow(href, self.parse_author)
# follow pagination links
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
将爬到的item写入到mysql中,主要是实现管道TutorialPipeline类,
Pipelines——管道:此处是对已抓取、解析后的内容的处理,可以通过管道写入本地文件、数据库。(参考链接:https://segmentfault.com/a/1190000008135000)
pipelines.py
# pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
class TutorialPipeline(object):
quotes_name = 'quotes'
author_name = 'author'
quotesInsert = '''insert into quotes(text,author,tags)
values('{text}','{author}','{tags}')'''
authorInsert = '''insert into author(name,birthdate,bio)
values('{name}','{birthdate}','{bio}')'''
def __init__(self, settings):
self.settings = settings
def process_item(self, item, spider):
print(item)
if spider.name == "quotes":
sqltext = self.quotesInsert.format(
text=pymysql.escape_string(item['text']),
author=pymysql.escape_string(item['author']),
tags=pymysql.escape_string(item['tags']))
# spider.log(sqltext)
self.cursor.execute(sqltext)
elif spider.name == "author":
sqltext = self.authorInsert.format(
name=pymysql.escape_string(item['name']),
birthdate=pymysql.escape_string(item['birthdate']),
bio=pymysql.escape_string(item['bio']))
# spider.log(sqltext)
self.cursor.execute(sqltext)
else:
spider.log('Undefined name: %s' % spider.name)
return item
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings)
def open_spider(self, spider):
# 连接数据库
self.connect = pymysql.connect(
host=self.settings.get('MYSQL_HOST'),
port=self.settings.get('MYSQL_PORT'),
db=self.settings.get('MYSQL_DBNAME'),
user=self.settings.get('MYSQL_USER'),
passwd=self.settings.get('MYSQL_PASSWD'),
charset='utf8',
use_unicode=True)
# 通过cursor执行增删查改
self.cursor = self.connect.cursor();
self.connect.autocommit(True)
def close_spider(self, spider):
self.cursor.close()
self.connect.close()
#items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class TutorialItem(scrapy.Item):
# for quotes
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
# for author table
name = scrapy.Field()
birthdate = scrapy.Field()
bio = scrapy.Field()
settings.py最后添加如下mysql配置:
# -*- coding: utf-8 -*-
# Scrapy settings for tutorial project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'tutorial'
SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tutorial (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'tutorial.middlewares.TutorialSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'tutorial.middlewares.TutorialDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'tutorial.pipelines.TutorialPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
ITEM_PIPELINES = {
'tutorial.pipelines.TutorialPipeline': 300,
}
MYSQL_HOST = 'localhost'
MYSQL_DBNAME = 'scrapy'
MYSQL_USER = 'root'
MYSQL_PASSWD = '123456'
MYSQL_PORT = 3306
随后添加python编译器。file-setting–add
https://www.cnblogs.com/airnew/p/10152438.html
按照下面的顺序安装类库:lxml->zope.interface->pyopenssl->twisted->scrapy。(运行时告诉我还缺一个pymysql类库,也去那里找然后下载)
在这里安装scrapy时出现错误,系统提示去命令行terminal中输入pip install scrapy进行安装,如下图所示:
下载结束,然后输入scrapy即可查看
即安装成功,试运行。
运行爬虫
也是在terminal里面运行,运行代码scrapy crawl quotes
然后去数据库中查看是否爬虫成功
爬虫成功
另一个表同上:命令行输入代码scrapy crawl author
爬虫成功。
scrapy基础知识之 parse()方法的工作机制思考:
1.因为使用的yield,而不是return。parse函数将会被当做一个生成器使用。scrapy会逐一获取parse方法中生成的结果,并判断该结果是一个什么样的类型;
2.如果是request则加入爬取队列,如果是item类型则使用pipeline处理,其他类型则返回错误信息。
3.scrapy取到第一部分的request不会立马就去发送这个request,只是把这个request放到队列里,然后接着从生成器里获取;
4.取尽第一部分的request,然后再获取第二部分的item,取到item了,就会放到对应的pipeline里处理;
5.parse()方法作为回调函数(callback)赋值给了Request,指定parse()方法来处理这些请求 scrapy.Request(url, callback=self.parse)
6.Request对象经过调度,执行生成 scrapy.http.response()的响应对象,并送回给parse()方法,直到调度器中没有Request(递归的思路)
7.取尽之后,parse()工作结束,引擎再根据队列和pipelines中的内容去执行相应的操作;
8.程序在取得各个页面的items前,会先处理完之前所有的request队列里的请求,然后再提取items。
9.这一切的一切,Scrapy引擎和调度器将负责到底。
//items.py:通过文件的注释我们了解到这个文件的作用是定义我们所要爬取的信息的相关属性。Item对象是种容器,用来保存获取到的数据。
middlewares.py:Spider中间件,在这个文件里我们可以定义相关的方法,用以处理蜘蛛的响应输入和请求输出。
pipelines.py:在item被Spider收集之后,就会将数据放入到item pipelines中,在这个组件是一个独立的类,他们接收到item并通过它执行一些行为,同时也会决定item是否能留在pipeline,或者被丢弃。
settings.py:提供了scrapy组件的方法,通过在此文件中的设置可以控制包括核心、插件、pipeline以及Spider组件。