猫眼即将上映的电影数据

最新推荐文章于 2021-03-14 12:23:04 发布

国民小师弟

最新推荐文章于 2021-03-14 12:23:04 发布

阅读量996

点赞数

分类专栏： Python

Python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

主要是引发学习爬虫的兴趣。其中遇到的问题大部分是根据网上末班生成，如果有侵权，立删。

开始：首先是需要的各个模块。

from lxml import etree
import requests
from bs4 import BeautifulSoup
import pymysql#使用mysql数据库存储爬取的数据

定义获取网页的函数：

def getHTMLText(url):
    '''
    此函数用于获取网页的html文档
    '''
    try:
        #获取服务器的响应内容，并设置最大请求时间为6秒
        res = requests.get(url, timeout = 6)
        #判断返回状态码是否为200
        res.raise_for_status()
        #设置该html文档可能的编码
        res.encoding = res.apparent_encoding
        #返回网页HTML代码
        return res.text
    except:
        return '产生异常'

获取到网页的内容后，需要从这主要网页中找到带上影的电影的超链接，这里我们选择标签，超链接提取出来之间我们需要新建表来存储获取的数据。

def createtb():
    db = pymysql.connect("localhost","root","root","mysql")
    cursor = db.cursor()
    cursor.execute("DROP TABLE IF EXISTS get_html")
    sql_get_html = """CREATE TABLE get_html (
            surl CHAR(200) )"""

    cursor.execute("DROP TABLE IF EXISTS newmovies")
    sql_newmovies = """CREATE TABLE newmovies (
        titles varchar(200) ,
        times varchar(200),
        notes varchar(800) )"""
    cursor.execute(sql_get_html)
    cursor.execute(sql_newmovies)
    db.close()

然后获取超链接：网页中有很多链接，根据主要特征选取超链接

def getsurl():
    db = pymysql.connect("localhost","root","root","mysql")
    cursor = db.cursor()
    '''
    获取每个主网页中的超链接
    '''
    #目标网页，这个可以换成一个你喜欢的网站
    #url = 'https://maoyan.com/films?showType=2'
    for tb in range(4):#网页中获取网页个数
    #url='https://maoyan.com/board/4?offset={}'.format(tb*10)
        url = 'https://maoyan.com/films?showType=2&offset={}'.format(tb*30)
        demo = getHTMLText(url)
        #解析HTML代码，使用代码审查，x_path
        soup = BeautifulSoup(demo, 'html.parser')
        #模糊搜索HTML代码的所有包含href属性的<a>标签
        a_labels = soup.find_all('a', attrs={'href': True})
        #获取所有<a>标签中的href对应的值，即超链接
        for a in a_labels:
            #print(a.get('href'))
            insert_colmn = ("INSERT INTO get_html(surl)" "VALUES(%s)")
            data_colmn = (a.get('href'))
            cursor.execute(insert_colmn, data_colmn)
            insert_colmn = ("delete from  get_html where  surl  not like '/films/%'")
            cursor.execute(insert_colmn)
            db.commit()
    insert_colmn1 = ("update  get_html set surl=REPLACE(TRIM(surl),'/films/','https://maoyan.com/films/')")
    cursor.execute(insert_colmn1)
    cursor.close()
    db.commit()#一定需要提交
    db.close()

已经去重并且将数据标准化存取到数据库中get_html，再需要每次取到数据，针取到的网址进行爬取电影名称，电影播放时间和描述。

def getnewmovies():
    db = pymysql.connect("localhost","root","root","mysql")
    cursor = db.cursor()
    cursor1 = db.cursor()
    select_sql = ("SELECT *  FROM get_html")
    #num_sql = ("SELECT surl  FROM get_html")
    try:
        cursor1.execute(select_sql)
        #row=cursor.fetchall()
        url=cursor1.fetchone()[0]
        for i in cursor1:

            date=requests.get(url).text
            s=etree.HTML(date)
            titles=s.xpath('/html/body/div[3]/div/div[2]/div[1]/h3/text()')[0].strip()
            times  =s.xpath('/html/body/div[3]/div/div[2]/div[1]/ul/li[3]/text()')[0].strip()
            notes=s.xpath('//*[@id="app"]/div/div[1]/div/div[2]/div[1]/div[1]/div[2]/span/text()')[0].strip()
            #print("{} {} {}".format(titles,times,notes))
            
            insert_colmn = ("INSERT INTO newmovies(titles,times,notes)" "VALUES(%s,%s,%s)")
            data_colmn = (titles,times,notes)
            cursor.execute(insert_colmn, data_colmn)
            db.commit()
            url=cursor1.fetchone()[0] 
            #print(i)
    except:#在运行完成的时候会出现错误，应该是循环问题
        print("Select is failed")
    cursor.close()
    cursor1.close()
    db.close()

主要的部分完成了，接下来就需要去执行

def main():
    createtb()#新建表
    getsurl()#获取所有即将上映的主链接中的超链接
    getnewmovies()


main()

本次爬取的过程参考了很多大神的脚本，谢谢各位。接下来准备开发爬取豆瓣电影评论关键词。

国民小师弟

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录