爬虫实践：电影排行榜和图片批量下载（看的大佬的）

最新推荐文章于 2024-05-02 22:07:39 发布

酒中醉去梦中来

最新推荐文章于 2024-05-02 22:07:39 发布

阅读量436

点赞数

分类专栏： python爬虫

python爬虫专栏收录该内容

11 篇文章 0 订阅

订阅专栏

#爬取的网址：http://dianying.2345.com/top/
#电影的名字，主演，简介，和标题图
'''
爬取最新电影排行榜单
url：http://dianying.2345.com/top/
使用 requests --- bs4 线路
Python版本： 3.7
'''
import requests as rs
import bs4

def get_html(url):
    try:
        r=rs.get(url,timeout=30)#超时时间
        r.raise_for_status()#https://www.jianshu.com/p/159bea26f7b5判断网络状态是否正常
        r.encoding='gbk'#采用gbk编码
        return r.text
    except:
        return "出错"

def get_content(url):
    html=get_html(url)
    soup=bs4.BeautifulSoup(html,'lxml')#一个解析库；https://blog.csdn.net/zhangzejia/article/details/79658221
    # 找到电影排行榜的ul列表
    movies_list = soup.find('ul', class_='picList clearfix')
    movies = movies_list.find_all('li',limit = 36)#limit个数
    #print(movies)

    for top in movies:
        # 找到图片连接，
        img_url = top.find('img')['src']

        name = top.find('span', class_='sTit').a.text
        # 这里做一个异常捕获，防止没有上映时间的出现
        try:
            time = top.find('span', class_='sIntro').text
            print(time)
        except:
            time = "暂无上映时间"

        # 这里用bs4库迭代找出“pACtor”的所有子孙节点，即每一位演员解决了名字分割的问题
        actors = top.find('p', class_='pActor')
        actor = ''
        for act in actors.contents:
            actor = actor + act.string + '  '
        # 找到影片简介
        intro = top.find('p', class_='pTxt pIntroShow').text
        with open('image/' + name + '.txt', 'wb+') as f:
            txt="片名：{}\t{}\n{}\n{} \n \n ".format(name, time, actor, intro)
            print(txt)
            f.write(txt.encode())


        # 我们来吧图片下载下来：
        name=name.replace(':','-')#有的图片无法显示，是因为名字中有：，图片名字不允许出现：
        with open('image/' + name + '.jpg', 'wb+') as f:
            img_url2 ='http:'+ img_url
            f.write(rs.get(img_url2).content)
            '''无需切割，可以加载图片
                           f.write(requests.get("http:"+img_url.split("jpg")[-2]+"jpg").content)。
                           爬取的图片链接有两种形式：//imgwx4.2345.com/dypcimg/img/c/63/sup190594_223x310.jpg?1492397216
                           或//imgwx3.2345.com/dypcimg/img/e/63/sup189627_223x310.jpg，
                           前面一种不能直接爬，所以用split将jpg后面的去掉，再加http: 
                           以jpg为分隔，将img_url分割开
                           '''
            print(img_url2)
            f.close()

def main():
    url = 'http://dianying.2345.com/top/'
    get_content(url)

if __name__ == "__main__":
    main()

酒中醉去梦中来

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫实践：电影排行榜和图片批量下载（看的大佬的）

#爬取的网址：http://dianying.2345.com/top/#电影的名字，主演，简介，和标题图'''爬取最新电影排行榜单url：http://dianying.2345.com/top/使用 requests --- bs4 线路Python版本： 3.7'''import requests as rsimport bs4def get_html(url): ...
复制链接

扫一扫