爬取csdn学院中的课程信息(编程语言的)
任务:爬取csdn学院中的课程信息(编程语言的)
网址:https://edu.csdn.net/courses/o280/p1 (第一页)
https://edu.csdn.net/courses/o280/p2 (第二页)
① 创建项目在命令行编写下面命令,创建项目demoscrapy startproject educsdn项目目录结构:educsdn
├── educsdn
│ ├── __init__.py
│ ├── __pycache__
│ ├── items.py # Items的定义,定义抓取的数据结构
│ ├── middlewares.py # 定义Spider和DownLoader的Middlewares中间件实现。
│ ├── pipelines.py # 它定义Item Pipeline的实现,即定义数据管道
│ ├── settings.py # 它定义项目的全局配置
│ └── spiders # 其中包含一个个Spider的实现,每个Spider都有一个文件
│ ├── __init__.py
│ └── __pycache__
└── scrapy.cfg #Scrapy部署时的配置文件,定义了配置文件路径、部署相关信息等内容
② 进入educsdn项目目录,创建爬虫spider类文件(courses课程)执行genspider命令,第一个参数是Spider的名称,第二个参数是网站域名。scrapy genspider courses edu.csdn.net
$ tree
├── demo
│ ├── __init__.py
│ ├── __pycache__
│ │ ├── __init__.cpython-36.pyc
│ │ └── settings.cpython-36.pyc
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ ├── __pycache__
│ │ └── __init__.cpython-36.pyc
│ └── courses.py #在spiders目录下有了一个爬虫类文件courses.py
└── scrapy.cfg
# courses.py的文件代码如下:
# -*- coding: utf-8 -*-
import scrapy
class CoursesSpider(scrapy.Spider):
name = 'courses'
allowed_domains = ['edu.csdn.net']
start_urls = ['http://edu.csdn.net/']
def parse(self, response):
pass
③ 创建ItemItem是保存爬取数据的容器,它的使用方法和字典类型,但相比字典多了些保护机制。
创建Item需要继承scrapy.Item类,并且定义类型为scrapy.Field的字段:(课程标题、课程地址、图片、授课老师,视频时长、价格)
具体代码如下:(修改类名为CoursesItem)import scrapy
class CoursesItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
url = scrapy.Field()
pic = scrapy.Field()
teacher = scrapy.Field()
time = scrapy.Field()
price = scrapy.Field()
#pass
④ 解析Response在fang.py文件中,parse()方法的参数response是start_urls里面的链接爬取后的结果。
提取的方式可以是CSS选择器、XPath选择器或者是re正则表达式。# -*- coding: utf-8 -*-
import scrapy
from educsdn.items import CoursesItem
class CoursesSpider(scrapy.Spider):
name = 'courses'
allowed_domains = ['edu.csdn.net']
start_urls = ['https://edu.csdn.net/courses/o280/p1']
p=1
def parse(self, response):
#解析并输出课程标题
#print(response.selector.css("div.course_dl_list span.title::text").extract())
#获取所有课程
dlist = response.selector.css("div.course_dl_list")
#遍历课程,并解析信息后封装到item容器中
for dd in dlist:
item = CoursesItem()
item['title'] = dd.css("span.title::text").extract_first()
item['url'] = dd.css("a::attr(href)").extract_first()
item['pic'] = dd.css("img::attr(src)").extract_first()
item['teacher'] = dd.re_first("
讲师:(.*?)
")item['time'] = dd.re_first("([0-9]+)课时")
item['price'] = dd.re_first("¥([0-9\.]+)")
#print(item)
#print("="*70)
yield item
#获取前10页的课程信息
self.p += 1
if self.p <= 10:
next_url = 'https://edu.csdn.net/courses/o280/p'+str(self.p)
url = response.urljoin(next_url) #构建绝对url地址(这里可省略)
yield scrapy.Request(url=url,callback=self.parse)
⑤、创建数据库和表:在mysql中创建数据库csdndb和数据表coursesCREATE TABLE `courses` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`title` varchar(255) DEFAULT NULL,
`url` varchar(255) DEFAULT NULL,
`pic` varchar(255) DEFAULT NULL,
`teacher` varchar(32) DEFAULT NULL,
`time` varchar(16) DEFAULT NULL,
`price` varchar(16) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
⑥、使用Item PipelineItem Pipeline为项目管道,当Item生产后,他会自动被送到Item Pipeline进行处理:
我们常用Item Pipeline来做如下操作:清理HTML数据
验证抓取数据,检查抓取字段
查重并丢弃重复内容
将爬取结果保存到数据库里。import pymysql
from scrapy.exceptions import DropItem
class EducsdnPipeline(object):
def process_item(self, item, spider):
if item['price'] == None:
raise DropItem("Drop item found: %s" % item)
else:
return item
class MysqlPipeline(object):
def __init__(self,host,database,user,password,port):
self.host = host
self.database = database
self.user = user
self.password = password
self.port = port
self.db=None
self.cursor=None
@classmethod
def from_crawler(cls,crawler):
return cls(
host = crawler.settings.get("MYSQL_HOST"),
database = crawler.settings.get("MYSQL_DATABASE"),
user = crawler.settings.get("MYSQL_USER"),
password = crawler.settings.get("MYSQL_PASS"),
port = crawler.settings.get("MYSQL_PORT")
)
def open_spider(self,spider):
self.db = pymysql.connect(self.host,self.user,self.password,self.database,charset='utf8',port=self.port)
self.cursor = self.db.cursor()
def process_item(self, item, spider):
sql = "insert into courses(title,url,pic,teacher,time,price) values('%s','%s','%s','%s','%s','%s')"%(item['title'],item['url'],item['pic'],item['teacher'],str(item['time']),str(item['price']))
#print(item)
self.cursor.execute(sql)
self.db.commit()
return item
def close_spider(self,spider):
self.db.close()
⑦ 修改配置文件打开配置文件:settings.py 开启并配置ITEM_PIPELINES信息,配置数据库连接信息ITEM_PIPELINES = {
'educsdn.pipelines.EducsdnPipeline': 300,
'educsdn.pipelines.MysqlPipeline': 301,
}
MYSQL_HOST = 'localhost'
MYSQL_DATABASE = 'csdndb'
MYSQL_USER = 'root'
MYSQL_PASS = ''
MYSQL_PORT = 3306
⑧、运行爬取:执行如下命令来启用数据爬取
scrapy crawl courses
END