1:scrapy的使用
cmd控制台安装包的安装
python -m pip install --upgrade pip
pip install wheel
pip install lxml
pip install twisted
pip install pywin32
pip install scrapy
2:创建项目
scrapy startproject TXmovies
cd TXmovies
scrapy genspidertxmsv.qq.com
3:修改setting
ROBOTSTXT_OBEY=False
DOWNLOAD_DELAY=1DEFAULT_REQUEST_HEADERS={'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language':'en','User-Agent':'Mozilla/5.0(WindowsNT6.2;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/27.0.1453.94Safari/537.36'}ITEM_PIPELINES={'TXmovies.pipelines.TxmoviesPipeline':300,}
4:确认要提取的数据 item项
#-*-coding:utf-8-*-#Defineherethemodelsforyourscrapeditems
#
#Seedocumentationin:
#https://docs.scrapy.org/en/latest/topics/items.html
importscrapy
classTxmoviesItem(scrapy.Item):
#definethefieldsforyouritemherelike:
#name=scrapy.Field()
name=scrapy.Field()
description=scrapy.Field()
5:爬虫程序 (略)
首先腾讯视频的url为
https://v.qq.com/x/bu/pagesheet/listappend=1&channel=cartoon&iarea=1&listpage=2&offset=0&pagesize=30
我们注意到offset这一项,第一页的offset为0,第二页为30,依次推列。在程序中这一项用于控制抓取第一页,但是也要给一个范围,不可能无限大,否则会报错,可以去看看腾讯一共有多少页视频,也可以写一个异常捕获机制,捕捉到请求出错则退出
6:管道输出
class TxmoviesPipeline(object):
def process_item (self,item,spider):
print (item)
return item
7:run执行输出
from scrapy import cmdline
cmdline.execute('scrapycrawltxms'.split())