爬去TOP250电影的英文名字

爬取TOP250电影的英文名字

代码如下:

import requests
from bs4 import BeautifulSoup
def get_movies():
    headers={
        'user-agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
        'Host':'movie.douban.com'
    }
    movie_list=[]
    for i in range(0,10):
        link='https://movie.douban.com/top250?start='+str(i*25)
        r=requests.get(link,headers=headers,timeout=10)
        print (str(i+1),'页码响应状态码:',r.status_code)
        soup=BeautifulSoup(r.text,'lxml')
        div_list=soup.find_all('div',class_='hd')
        for each in div_list:
           # movie=each.a.span.text.strip()
            movie=each.a.contents[3].text.strip()
            movie=movie[2:]
            movie_list.append(movie)
            #print(each.a.contents[3].text.strip())
    return movie_list
movies=get_movies()
print(movies)

代码中,

each.a.span只会定位到a标签下第一个span标签的内容。

each.a.contents则会定位到a标签下所有的子标签内容(包括换行符‘\n’),例如在for循环中添加一句print(each.a.contents),则输出内容为(以“肖申克的救赎一项为例展示”):

['\n', <span class="title">肖申克的救赎</span>, '\n', <span class="title"> / The Shawshank Redemption</span>, '\n', <span class="other"> / 月黑高飞()  /  刺激1995()</span>, '\n'] 

即包括换行符“\n”,所以若用 each.a.contents[0] 定位到的则是开头的换行符,不是我们需要的有价值的信息。

故我们需要的部分的索引应为3(英文名),当我们直接用 movie=each.a.contents[3].text.strip() 进行输出时候,则发现输出的为(以“肖申克的救赎一项为例展示”):

['/\xa0The Shawshank Redemption',

我们发现在英文名前面有一个“/”(这个是网页页面文本中本来就有的),还有一个“\xa0”,这个代表不间断空白符  
故需要 movie=movie[2:] 进行截取。
输出结果如下:

1 页码响应状态码: 200
2 页码响应状态码: 200
3 页码响应状态码: 200
4 页码响应状态码: 200
5 页码响应状态码: 200
6 页码响应状态码: 200
7 页码响应状态码: 200
8 页码响应状态码: 200
9 页码响应状态码: 200
10 页码响应状态码: 200
['The Shawshank Redemption', '再见,我的妾  /  Farewell My Concubine', 'Léon', 'Forrest Gump', 'La vita è bella', 'Titanic', '千と千尋の神隠し', "Schindler's List", 'Inception', "Hachi: A Dog's Tale", 'WALL·E', '3 Idiots', "La leggenda del pianista sull'oceano", 'Les choristes', 'The Truman Show', '西遊記大結局之仙履奇緣', 'Interstellar', 'となりのトトロ', 'The Godfather', '도가니', '無間道', 'Zootopia', 'The Pursuit of Happyness', 'Flipped', 'Intouchables', 'Gone with the Wind', 'The Dark Knight', '人生  /  Lifetimes', 'Life of Pi', 'Witness for the Prosecution', 'Nuovo Cinema Paradiso', 'Devils on the Doorstep', 'The Lord of the Rings: The Return of the King', '12 Angry Men', '天空の城ラピュタ', 'Dangal', 'Up', '西遊記第壹佰零壹回之月光寶盒', 'Fight Club', 'Roman Holiday', 'ハウルの動く城', 'Scent of a Woman', '변호인', 'Das Leben der Anderen', 'Lock, Stock and Two Smoking Barrels', 'The Last Emperor', "One Flew Over the Cuckoo's Nest", 'Dead Poets Society', 'The Lord of the Rings: The Two Towers', '소원', 'V for Vendetta', 'The Godfather: Part Ⅱ', 'Coco', 'The Lord of the Rings: The Fellowship of the Ring', 'The Cove', '飲食男女', 'A Beautiful Mind', 'The Lion King', 'Love Letter', 'The Pianist', 'The Curious Case of Benjamin Button', 'Once Upon a Time in America', 'Contratiempo', 'The Matrix', 'بچههای آسمان', 'Malèna', '大闹天宫 上下集  /  The Monkey King', '让子弹飞一会儿  /  火烧云', 'Saving Private Ryan', "Harry Potter and the Sorcerer's Stone", 'The Prestige', 'Se7en', '嫌われ松子の一生', 'The Sound of Music', 'Pulp Fiction', "Le fabuleux destin d'Amélie Poulain", 'The Silence of the Lambs', 'Braveheart', 'Catch Me If You Can', 'The Butterfly Effect', 'Edward Scissorhands', '春光乍洩', 'Good Will Hunting', 'Shutter Island', 'The Grand Budapest Hotel', 'The Boy in the Striped Pajamas', 'おくりびと', 'Avatar', 'もののけ姫', 'In the Heat of the Sun', 'Identity', 'The Sixth Sense', 'Pirates of the Caribbean: The Curse of the Black Pearl', 'Jagten', 'Mary and Max', 'Brokeback Mountain', '重慶森林', 'Modern Times', '喜劇之王', '自白  /  母亲', 'Big Fish', 'Gone Girl', 'Yi yi  /  Yi yi: A One and a Two', '射鵰英雄傳之東成西就', '써니', 'Comrades: Almost a Love Story', 'Before Sunrise', 'リトル・フォレスト 夏・秋', 'How to Train Your Dragon', '耳をすませば', 'パプリカ', 'Call Me by Your Name', '倩女幽魂(87版)  /  倩女幽魂:妖魔道', '더 테러 라이브', '風の谷のナウシカ', 'Cidade de Deus', 'Detachment', 'Before Sunset', '菊次郎の夏', 'The Terminal', 'Harry Potter and the Deathly Hallows: Part 2', 'リトル・フォレスト 冬・春', '살인의 추억', '7번방의 선물', 'Despicable Me', '借りぐらしのアリエッティ', '蛍火の杜へ', '唐伯虎點秋香', 'Big Hero 6', 'The Dark Knight Rises', 'Monsters, Inc.', '歲月神偷', 'Saw', '七人の侍', 'The Bourne Ultimatum', 'Love Actually', 'The Croods', '誰も知らない', '囍宴', '火垂るの墓', '東邪西毒', 'A Better Tomorrow  /  Gangland Boss', 'Slumdog Millionaire', 'Black Swan', 'Memento', 'Hacksaw Ridge', '殡棺  /  The Coffin in the Mountain', 'Pride & Prejudice', 'About Time', 'Relatos salvajes', 'Rain Man', '緃横四海', 'The Godfather: Part III', 'Dallas Buyers Club', 'Toy Story 3', 'Hotel Rwanda', 'A Perfect World', '花樣年華', 'Manchester by the Sea', 'Océans', 'The Notebook', 'La grande vadrouille', 'おまえうまそうだな', 'Twenty Two  /  22', 'Django Unchained', 'Inside Out', 'Wreck-It Ralph', 'Ice Age', 'Legends of the Fall', '君の名は。', "Singin' in the Rain", 'I Am Sam', 'Three Billboards Outside Ebbing, Missouri', 'Whiplash', 'Artificial Intelligence: AI', 'Perfect Blue', '時をかける少女', 'Waterloo Bridge', 'Trainspotting', 'The Imitation Game', 'En man som heter Ove', 'Room', 'ハチ公物語', 'Perfetti sconosciuti', '羅生門', 'Triangle', '魔女の宅急便', '阿飛正傳', 'Perfume: The Story of a Murderer', "Prince Nezha's Triumph Against Dragon King  /  Nezha nao hai", 'Die Welle', 'The Reader', 'The Matrix Revolutions', '海街diary', 'Kekexili: Mountain Patrol', 'The Bourne Supremacy', 'The Bourne Identity', 'Lord of War', '牯嶺街少年殺人事件', 'Taare Zameen Par', 'Green Snake', 'جدایی نادر از سیمین', 'Psycho', 'Crazy Stone', 'Following', 'Terminator 2: Judgment Day', 'Source Code', 'Bajrangi Bhaijaan', '歩いても 歩いても', 'สิ่งเล็กเล็กที่เรียกว่า...รัก', 'Begin Again', '新龍門客棧', 'Crash', 'The Legend of Sealed Book  /  Secrets of the Heavenly Book', 'Requiem for a Dream', 'Before Midnight', 'Heidi', 'Inglourious Basterds', '東京物語', 'City Lights', 'The Green Mile', 'Coherence', 'Blood Diamond', 'The Man from Earth', 'E.T.: The Extra-Terrestrial', 'Thelma & Louise', '2001: A Space Odyssey', 'Spotlight', 'The Rock', 'Face/Off', 'A Clockwork Orange', '秒速5センチメートル', 'Il buono, il brutto, il cattivo.', 'Black Hawk Down', '功夫3D  /  Kung Fu Hustle', 'The Usual Suspects', 'Casablanca', '그대를 사랑합니다', "The King's Speech", 'Gattaca', 'American Beauty', 'Mad Max: Fury Road', 'The Bucket List', 'Wonder', 'Le grand bleu', 'Cast Away', 'Mr. Donkey', '鎗火', 'The English Patient', 'Into the Wild']

进程已结束,退出代码0

参考:Python开发爬虫之静态网页抓取篇:爬取“豆瓣电影 Top 250”电影数据

  • 3
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值