第一次使用API爬取数据之几个坑
第一次使用网站自带api爬取相关数据,跟着网上教程爬取豆瓣网top20电影,第一步很顺利。
import urllib.request as request
import json
url = 'https://api.douban.com/v2/movie/top250'
crawl_content = request.urlopen(url).read()
top20 = json.loads(crawl_content.decode('utf8'))['subjects']
for movie in top20:
url = 'https://api.douban.com/v2/movie/' + movie['id']
movieContent = request.urlopen(url).read()
print(json.loads(movieContent.decode('utf8'))['title'] + ': ' + json.loads(movieContent.decode('utf8'))['rating']['average'])
问题来了,要把得到的数据存到一个文件,加入
with open("douban_name_movie.txt", "w") as output file:
outputfile.write('{}, {}\n'.format(name, rank))
于是新代码为
import urllib.request as request
import json
with open("douban_top20_movie","w")as outputfile:
url = 'https://api.douban.com/v2/movie/top250'
crawl_content = request.urlopen(url).read()
top20 = json.loads(crawl_content.decode('utf8'))['subjects']
for movie in top20:
url = 'https://api.douban.com/v2/movie/' + movie['id']
movieContent = request.urlopen(url).read()
print(json.loads(movieContent.decode('utf8'))['title'] + ': ' + json.loads(movieContent.decode('utf8'))['rating']['average'])
rank=json.loads(movieContent.decode('utf8'))['rating']['average']
outputfile.write("{}{}\n".format(movie,rank))
问题1:在存入的过程中出现错误:gbk’ codec can’t encode character
在一篇博客中发现解决方法——字符无法转换解决方法
改为with open(“douban_top20_movie”,“w”,encoding=“utf-8”)as outputfile:
问题2: 然后发现,只有最后一部电影的信息,也就是“douban_top20_movie"文件中只有一部电影的信息,why?想起前面教程中的一句话——打开模式:r 只读,w写入并覆盖原文件,a写入模式打开若文件已存在则在末尾追加写入。所以把”w“改为”a"就可以。
问题3:执行多次后,会出现http400错误,请求失败。网站对调用api的次数有限制,所以可以等很久很久以后再试就行了。(这个很久很久不知道多久)
最终版代码为
import urllib.request as request
import json
with open("douban_top20_movie","a",encoding="utf-8")as outputfile:
url = 'https://api.douban.com/v2/movie/top250'
crawl_content = request.urlopen(url).read()
top20 = json.loads(crawl_content.decode('utf8'))['subjects']
for movie in top20:
url = 'https://api.douban.com/v2/movie/' + movie['id']
movieContent = request.urlopen(url).read()
print(json.loads(movieContent.decode('utf8'))['title'] + ': ' + json.loads(movieContent.decode('utf8'))['rating']['average'])
rank=json.loads(movieContent.decode('utf8'))['rating']['average']
outputfile.write("{}{}\n".format(movie,rank))