Python爬取豆瓣音乐存储MongoDB数据库

###1. 爬虫设计的技术
  1)数据获取,通过http获取网站的数据,如urllib,urllib2,requests等模块;

2)数据提取,将web站点所获取的数据进行处理,获取所需要的数据,常使用的技术有:正则re,BeautifulSoup,xpath;

3)数据存储,将获取的数据有效的存储,常见的存储方式包括:文件file,csv文件,Excel,MongoDB数据库,MySQL数据库

###2. 环境信息
  1)python2.7

2)mongo2.6

3)使用模块包括re,requests,lxml,pymongo

###3. 代码内容

import re
import sys
import requests
import pymongo
from time import sleep
from lxml import etree
'''
遇到不懂的问题?Python学习交流群:821460695满足你的需求,资料都已经上传群文件,可以自行下载!
'''
reload(sys)
sys.setdefaultencoding('utf8')


def get_web_html(url):
    '''
    @params: url 通过requests获取web站点的HTML源代码数据,并返回
    '''
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"
    }
    try:
        req = requests.get(url,headers=headers)
        if req.status_code == 200:
            response = req.text.encode('utf8')
        else:
            response = ''
    except Exception as e:
        print e
    return response

def get_music_url(url):
    '''
    @params: url提供页面的url地址,获取音乐详细的URL地址,通过正则表达式获取
    '''
    music_url_list = []
    selector = etree.HTML(get_web_html(url))
    music_urls = selector.xpath('//div[@class="pl2"]/a/@href')
    for music_url in music_urls:
        music_url_list.append(music_url)    
        sleep(1)
    return music_url_list

def get_music_info(url):
    '''
    @params: 爬取url地址中音乐的特定信息
    '''
    print "正在获取%s音乐地址的URL地址信息..." % (url)
    response = get_web_html(url)
    selector = etree.HTML(response)
    music_name = selector.xpath('//div[@id="wrapper"]/h1/span/text()')[0].strip()
    author = selector.xpath('//div[@id="info"]/span/span/a/text()')[0].strip()
    styles = re.findall(r'<span class="pl">流派:</span>&nbsp;(.*?)<br />',response,re.S|re.M)
    if len(styles) == 0:
        style = '未知'
    else:
        style = styles[0].strip()
    publish_time = re.findall('<span class="pl">发行时间:</span>&nbsp;(.*?)<br />',response,re.S|re.M)[0].strip()
    publish_users= re.findall('<span class="pl">出版者:</span>&nbsp;(.*?)<br />',response,re.S|re.M)[0].strip()
    if len(publish_users) == 0:
        publish_user = '未知'
    else:
        publish_user = publish_users[0].strip()
    scores = selector.xpath('//strong[@class="ll rating_num"]/text()')[0].strip()
    music_info_data = {
        "music_name": music_name,
        "author": author,
        "style": style,
        "publish_time": publish_time,
        "publish_user": publish_user,
        "scores": scores
    }
    write_into_mongo(music_info_data)

def write_into_mongo(data):
    '''
    @params: data,将数据封装为字典,然后将其写入到MongoDB数据库中
    '''
    print "正在插入数据%s" % (data)
    try:
        client = pymongo.MongoClient('localhost',27017)
        db = client.db
        table = db['douban_book']
        table.insert_one(data)
    except Exception as e:
        print e

def main():
    '''主函数'''
    urls = ['https://music.douban.com/top250?start={}'.format(i) for i in range(0,230,25)]
    for url in urls:
        for u in get_music_url(url):
            get_music_info(u)
            

if __name__ == "__main__":
    main()
好的,首先需要安装pymongo,可以使用以下命令安装: ``` pip install pymongo ``` 接下来,可以按照以下步骤进行爬取存储MongoDB中: 1. 导入需要的库 ```python import requests from bs4 import BeautifulSoup import pymongo ``` 2. 连接MongoDB数据库 ```python client = pymongo.MongoClient("mongodb://localhost:27017/") db = client["douban_book"] collection = db["books"] ``` 3. 爬取数据并解析 ```python url = "https://book.douban.com/top250" headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"} res = requests.get(url, headers=headers) soup = BeautifulSoup(res.text, "html.parser") ``` 4. 提取数据并存储MongoDB中 ```python for book in soup.find_all("tr", class_="item"): title = book.find("div", class_="pl2").a["title"] link = book.find("div", class_="pl2").a["href"] rating = book.find("span", class_="rating_nums").get_text() author = book.find("p", class_="pl").get_text() collection.insert_one({"title": title, "link": link, "rating": rating, "author": author}) ``` 完整代码如下: ```python import requests from bs4 import BeautifulSoup import pymongo client = pymongo.MongoClient("mongodb://localhost:27017/") db = client["douban_book"] collection = db["books"] url = "https://book.douban.com/top250" headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"} res = requests.get(url, headers=headers) soup = BeautifulSoup(res.text, "html.parser") for book in soup.find_all("tr", class_="item"): title = book.find("div", class_="pl2").a["title"] link = book.find("div", class_="pl2").a["href"] rating = book.find("span", class_="rating_nums").get_text() author = book.find("p", class_="pl").get_text() collection.insert_one({"title": title, "link": link, "rating": rating, "author": author}) ``` 注意,如果是第一次使用MongoDB,需要先启动MongoDB服务,可以按照以下步骤进行启动: 1. 在命令行中输入以下命令: ``` mongod --dbpath D:\mongodb\data ``` 其中D:\mongodb\data是MongoDB数据存储的路径,可以根据需要进行修改。 2. 启动MongoDB服务后,再运行上述Python代码即可将爬取到的数据存储MongoDB中。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值