2:url有规律的多页面爬取

举例网站:http://www.luoo.net/music/期刊号

e.g:http://www.luoo.net/music/760

打算爬取其title:Hello World;pic;desc:本期音乐为......《8-bit Love》。

步骤:

1):建立项目

  在shell中你对应的目录下:scrapy startproject luoo

  在pycharm中打开luoo文件夹

2):编写items.py

1 import scrapy
2 class LuooItem(scrapy.Item):
3     url = scrapy.Field()
4     title = scrapy.Field()
5     pic = scrapy.Field()
6     desc = scrapy.Field()

 



3):编写spider
  在spiders文件夹下建立luoospider.py

  
 1 import scrapy
 2 from luoo.items import LuooItem
 3 
 4 class LuooSpider(scrapy.Spider):
 5     name = "luoo"
 6     allowed_domains = ["luoo.net"]
 7     start_urls = []
 8     for i in range(750,763):
 9         url = 'http://www.luoo.net/music/%s'%(str(i))
10         start_urls.append(url)
11 
12     def parse(self, response):
13         item = LuooItem()
14         item['url'] = response.url
15         item['title'] = response.xpath('//span[@class="vol-title"]/text()').extract()
16         item['pic'] = response.xpath('//img[@class="vol-cover"]/@src').extract()
17         item['desc'] = response.xpath('//div[@class="vol-desc"]/text()').extract()
18         return item

 

4)pipelines.py不动
5)在command中进入luoo目录
  scrapy list 列出可用的爬虫(luoo)
  scrapy crawl luoo -o result.csv(执行爬虫并且以result.csv保存到当前目录下)
6)用notepad++打开result.py并且更改格式为ANSI后保存,再用excel打开就不会有乱码了

*遗留to do:
1)数据考虑后期迁移到mysql数据库
2)单独把图片保存到图片格式的文件夹中


memory:顺便附上两个月前用urllib库实现的此功能代码(python3.4)
     现在看看用scrapy真的是方便太多了,更别提其牛逼呼呼的可扩展性:
 1 import urllib.request
 2 import re
 3 import time
 4 
 5 def openurl(urls):   
 6     htmls=[]
 7     for url in urls:
 8         req=urllib.request.Request(url)
 9         req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.107 Safari/537.36')    
10   #      Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0
11         response = urllib.request.urlopen(url)
12         htmls.append(response.read())
13         time.sleep(5)
14     return htmls
15 
16 def jiexi(htmls):
17     pics=[]
18     titles=[]
19     contents=[]
20     for html in htmls:
21         html = html.decode('utf-8')
22         pics.append(re.findall('<div class="player-wrapper".*?>.*?<img.*?src="(.*?).jp.*?".*?alt=".*"',html,re.S))
23         titles.append(re.findall('class="vol-title">(.*?)</span>',html,re.S))
24         contents.append(re.findall('<div.*?class="vol-desc">.*?(.*?)</div>',html,re.S))
25         
26     i = len(titles)
27     with open('C:\\Users\\Administrator\\Desktop\\test.txt', 'w') as f:
28         for x in range(i):
29             print("正在下载期刊:%d" %(746-x))
30             f.write("期刊名:"+str(titles[x])[2:-2]+"\n")
31             f.write("图片链接:"+str(pics[x])[2:-2]+".jpg\n")
32             content = str(contents[x])[4:-2]
33             content.strip()
34             print(content.count("""<br>\n""")) 
35             content.replace("""<br>\n""","#")
36             f.write("配诗:"+content+"\n\n\n")
37 
38 
39 yur='http://www.luoo.net/music/'
40 urls = []
41 for i in range(657,659):
42     urls.append(yur + str(i))
43 
44 htmls = openurl(urls)
45 pics = jiexi(htmls)

 

转载于:https://www.cnblogs.com/pengsixiong/p/4908595.html

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值