(上课记录)
1、先在cmd中安装scrapy
python -m pip install --upgrade pip
pip install wheel
pip install lxml
pip install twisted
pip install pywin32
pip install scrapy
打开一个终端输入(建议放到合适的路径下,默认是C盘)
2、创建项目
scrapy startproject TXmovies
cd TXmovies
scrapy genspider txmsv.qq.com
这样桌面就会出现一个TXmovies的文档
3:修改setting
ROBOTSTXT_OBEY=False
DOWNLOAD_DELAY=1DEFAULT_REQUEST_HEADERS={'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language':'en','User-Agent':'Mozilla/5.0(WindowsNT6.2;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/27.0.1453.94Safari/537.36'}ITEM_PIPELINES={'TXmovies.pipelines.TxmoviesPipeline':300,}
4:确认要提取的数据 item项
#-*-coding:utf-8-*-#Defineherethemodelsforyourscrapeditems
#
#Seedocumentationin:
#https://docs.scrapy.org/en/latest/topics/items.html
importscrapy
classTxmoviesItem(scrapy.Item):
#definethefieldsforyouritemherelike:
#name=scrapy.Field()
name=scrapy.Field()
description=scrapy.Field()
5:爬虫程序 (略)
首先腾讯视频的url为
https://v.qq.com/x/bu/pagesheet/listappend=1&channel=cartoon&iarea=1&listpage=2&offset=0&pagesize=30
我们注意到offset这一项,第一页的offset为0,第二页为30,依次推列。在程序中这一项用于控制抓取第一页,但是也要给一个范围,不可能无限大,否则会报错,可以去看看腾讯一共有多少页视频,也可以写一个异常捕获机制,捕捉到请求出错则退出
6:管道输出
class TxmoviesPipeline(object):
def process_item (self,item,spider):
print (item)
return item
7:run执行输出
from scrapy import cmdline
cmdline.execute('scrapycrawltxms'.split())