1.Scrapy爬虫
1.Scrapy框架的安装:
(1)什么是Scrapy框架:Scrapy是一个Python爬虫框架
(2)少坑版安装方式:
2.Scrapy框架常见命令实战:
全局命令(scrapy -h):fatch(爬);runspider(运行一个爬虫)......
项目命令:
3.Scrapy爬虫:
第一个Scrapy爬虫:以爬取糗事百科为例
scrapy startproject name(新建爬虫)
scrapy crawl name(运行爬虫)
4.Scrapy自动爬虫实战:
(1)糗事百科自动爬虫实战(crawl):
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http import Request
from qsauto.items import QsautoItem
class Qiushi1Spider(CrawlSpider):
name = 'qiushi1'
allowed_domains = ['qiushibaike.com']
'''
start_urls = ['http://qiushibaike.com/']
'''
def start_request(self):
ua={"user-Agent":'Mozilla/5.0(windows NT 6.1; WOW64) Applewebkit/537.36(KHTML, like Gecko) Chrome/49.0.2623.22 SE 2.X MetaSr 1.0'}
yield Request('http://www.qiushibaike.com/',headers=ua)
rules = (
Rule(LinkExtractor(allow=r'acticle'), callback='parse_item', follow=True),
)
def parse_item(self, response):
i = QsautoItem
#item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
#item['name'] = response.xpath('//div[@id="name"]').get()
#item['description'] = response.xpath('//div[@id="description"]').get()
i=FirstItem
i["content"]=response.xpath("//div[@class='content']/span/text()").extract()
i["link"]=response.xpath("//a[@class='contentHerf']/herf").extract()
print(i["content"])
print(i["link"])
return i
2.自动模拟登陆爬虫实战
(1)自动模拟登陆爬虫实战(豆瓣网):
3.当当网爬虫实战
(1)当当商城爬虫实战(如何将爬到的内容写进数据库):
import scrapy
from dangdang.items import DangdangItem
from scrapy.http import Request
class DdSpider(scrapy.Spider):
name = 'dd'
allowed_domains = ['dangdang.com']
start_urls = ['http://dangdang.com/']
def parse(self, response):
item=DangdangItem()
item["tital"]=response.xpath("//a[@class='pic']/@tital").extract()
item["link"] = response.xpath("//a[@class='pic']/@href").extract()
item["comment"] = response.xpath("//a[@name='_1_p']/text").extract()
yield item
for i in range()
url="http://category.dangdang.com/pg"+str(i)+"-cp01.54.06.00.00.00.html"
yield Request(url,callback=self.parse)
pipelines:
class DangdangPipeline:
def process_item(self, item, spider):
for i in range(0,len(item["tital"])):
tital=item["tital"]
link=item["link"]
comment=item["comment"]
print(tital)
print(link)
print(comment)
return item
items:
class DangdangItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
tital=scrapy.Field()
link=scrapy.Field()
comment=scrapy.Field()