命令提示符下载需要的依赖包
python -m pip install --upgrade pip 这条是更新
pip install wheel
pip install lxml
pip install twisted
pip install pywin32
pip install scrapy
下一步创建项目
cd desktop 切换工作目录到桌面,将文件保存到桌面
scrapy startproject TXmovies 创建项目
cd TXmovies
scrapy genspider txms v.qq.com 爬虫名txms
下一步 找到项目打开项目修改setting文件
ROBOTSTXT_OBEY = False 不遵守机器人协议
DOWNLOAD_DELAY = 1 下载间隙
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent':'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36'
}
ITEM_PIPELINES = {
'TXmovies.pipelines.TxmoviesPipeline': 300,
}
下一步提取数据 item定义需要提取的内容
import scrapy
class TxmoviesItem(scrapy.Item):
name = scrapy.Field() 电影名
description = scrapy.Field() 电影描述
下一步爬虫程序
import scrapy
from ..items import TxmoviesItem
class TxmsSpider(scrapy.Spider):
name = 'txms'
allowed_domains = ['v.qq.com']
start_urls = ['https://v.qq.com/x/bu/pagesheet/list?append=1&channel=cartoon&iarea=1&listpage=2&offset=0&pagesize=30']
offset=0
def parse(self, response):
items=TxmoviesItem()
lists=response.xpath('//div[@class="list_item"]')
for i in lists:
items['name']=i.xpath('./a/@title').get()
items['description']=i.xpath('./div/div/@title').get()
yield items
if self.offset < 120:
self.offset += 30
url = 'https://v.qq.com/x/bu/pagesheet/list?append=1&channel=cartoon&iarea=1&listpage=2&offset={}&pagesize=30'.format(
str(self.offset))
yield scrapy.Request(url=url,callback=self.parse)
下一步交给管道输出
class TxmoviesPipeline(object):
def process_item(self, item, spider):
print(item)
return item
最后新建run.py
from scrapy import cmdline
cmdline.execute('scrapy crawl txms'.split())