Python爬取豆瓣音乐存储MongoDB数据库(Python爬虫实战1)
1. 爬虫设计的技术
1)数据获取,通过http获取网站的数据,如urllib,urllib2,requests等模块;
2)数据提取,将web站点所获取的数据进行处理,获取所需要的数据,常使用的技术有:正则re,BeautifulSoup,xpath;
3)数据存储,将获取的数据有效的存储,常见的存储方式包括:文件file,csv文件,Excel,MongoDB数据库,MySQL数据库
2. 环境信息
1)python2.7
2)mongo2.6
3)使用模块包括re,requests,lxml,pymongo
3. 代码内容
1 #!/usr/bin/python 2 #-*- coding:utf8 -*- 3 #author: HappyLau,blog:http://www.cnblogs.com/cloudlab/ 4 #目的:爬取豆瓣top250的音乐信息,将爬取的数据存入到MongoDB数据库中 5 6 import re 7 import sys 8 import requests 9 import pymongo 10 from time import sleep 11 from lxml import etree 12 13 reload(sys) 14 sys.setdefaultencoding('utf8') 15 16 17 def get_web_html(url): 18 ''' 19 @params: url 通过requests获取web站点的HTML源代码数据,并返回 20 ''' 21 headers = { 22 "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0" 23 } 24 try: 25 req = requests.get(url,headers=headers) 26 if req.status_code == 200: 27 response = req.text.encode('utf8') 28 else: 29 response = '' 30 except Exception as e: 31 print e 32 return response 33 34 def get_music_url(url): 35 ''' 36 @params: url提供页面的url地址,获取音乐详细的URL地址,通过正则表达式获取 37 ''' 38 music_url_list = [] 39 selector = etree.HTML(get_web_html(url)) 40 music_urls = selector.xpath('//div[@class="pl2"]/a/@href') 41 for music_url in music_urls: 42 music_url_list.append(music_url) 43 sleep(1) 44 return music_url_list 45 46 def get_music_info(url): 47 ''' 48 @params: 爬取url地址中音乐的特定信息 49 ''' 50 print "正在获取%s音乐地址的URL地址信息..." % (url) 51 response = get_web_html(url) 52 selector = etree.HTML(response) 53 music_name = selector.xpath('//div[@id="wrapper"]/h1/span/text()')[0].strip() 54 author = selector.xpath('//div[@id="info"]/span/span/a/text()')[0].strip() 55 styles = re.findall(r'<span class="pl">流派:</span> (.*?)<br />',response,re.S|re.M) 56 if len(styles) == 0: 57 style = '未知' 58 else: 59 style = styles[0].strip() 60 publish_time = re.findall('<span class="pl">发行时间:</span> (.*?)<br />',response,re.S|re.M)[0].strip() 61 publish_users= re.findall('<span class="pl">出版者:</span> (.*?)<br />',response,re.S|re.M)[0].strip() 62 if len(publish_users) == 0: 63 publish_user = '未知' 64 else: 65 publish_user = publish_users[0].strip() 66 scores = selector.xpath('//strong[@class="ll rating_num"]/text()')[0].strip() 67 music_info_data = { 68 "music_name": music_name, 69 "author": author, 70 "style": style, 71 "publish_time": publish_time, 72 "publish_user": publish_user, 73 "scores": scores 74 } 75 write_into_mongo(music_info_data) 76 77 def write_into_mongo(data): 78 ''' 79 @params: data,将数据封装为字典,然后将其写入到MongoDB数据库中 80 ''' 81 print "正在插入数据%s" % (data) 82 try: 83 client = pymongo.MongoClient('localhost',27017) 84 db = client.db 85 table = db['douban_book'] 86 table.insert_one(data) 87 except Exception as e: 88 print e 89 90 def main(): 91 '''主函数''' 92 urls = ['https://music.douban.com/top250?start={}'.format(i) for i in range(0,230,25)] 93 for url in urls: 94 for u in get_music_url(url): 95 get_music_info(u) 96 97 98 if __name__ == "__main__": 99 main()
4. 小结
使用正则re表达是获取音乐的流派时,通过"查看源代码元素"获取的代码内容和requests获取的数据结果有所差别,以requests.get()获取的结果为准。同时,在数据获取的过程中,使用re无法截取到有效的数据,后发现是编码问题导致,通过修改web网站数据的编码为utf8,即修改为req.text.encode('utf8')或者修改为req.content也能够实现相同的效果。