最近用python练习了一个爬虫,爬取豆瓣的分数电影,在爬去过程中遇到了很多问题,这里记录一下。
网络请求我用的urllib3,
解析网页我分别用了BeautifulSoup和lxml,使用下来感觉BeautifulSoup是比lxml慢很多。在使用BeautifulSoup的时候感觉很容易上手,lxml在使用的时候遇到点问题,语法熟悉后会发现比BeautifulSoup 方便很多。BeautifulSoup由于是纯python写的所以比用C语言写的lxml 性能要弱很多。同样的300条数据 BeautifulSoup 需要17s左右,lxml需要9s,当然这个和电脑性能和数据处理会有点影响。
我先用BeautifulSoup并保存在Excel中
1.下载Url:
def download(url):
ssl._create_default_https_context = ssl._create_unverified_context
urllib3.disable_warnings(InsecureRequestWarning)
http = urllib3.PoolManager();
ip = ['121.34.156.197', '175.31.128.78', '124.219.217.120'];
headers = {
'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64)'r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3',
'Referer': r'https://www.douban.com/doulist/240962/?start=50&sort=seq&sub_type=',
'Connection': 'keep-alive',
'X-Forwarded-For': ip[2]
}
r = http.request("GET",url,headers=headers);
return r.data.decode('utf-8')
下载后用BeautifulSoup 解析
def BeautifulSouplist(html):
soup = BeautifulSoup(html, "lxml")
print(html);
s = soup.find_all("div", "bd doulist-subject")#获得所有的
if s == []:
print(page)
return None;
else:
print(len(s))
sz = len(s)
for j in range(1, sz + 1):
sp = BeautifulSoup(str(s[j - 1]), "html.parser")
movie = [];
# print("========")
# print(sp)
# print("========")
imageurl = sp.img['src'] # 电影的封面
bookUrl = sp.a['href'] # 电影的详细链接地址
bookName = ""; # 电影的名字
pingfen = ""; # 电影的评分
movieYear = ""; # 电影的年份
actors = "";
books = sp.find_all("div", "title");#div的class是"title"的标签,返回的是数组。
for book in books:
bookName = book.a.string.strip() #strip(),默认删除空白符(包括'\n', '\r', '\t', ' ')
infos = sp.div.find_all("span")
# print(infos[1].string)
pingfen = infos[1].string;
# print((infos[2].string))
abss = sp.find_all("div", "abstract");
tag_soup = sp.find(class_="abstract")
arr = tag_soup.contents;
# print(arr);
k = 0;
for string in arr:
ss = str(string).strip();
if not (ss == "<br/>"):
# print(ss)
if k == 1:
actors = ss;
if k == 4:
movieYear = ss;
k = k + 1;
# print("=============")
# print(imageurl)
# print(bookUrl)
# print(str(bookName))
# print(str(movieYear))
# print(pingfen)
# print("=============")
movie.append(bookName);
movie.append(actors);
movie.append(movieYear);
movie.append(pingfen);
movie.append(bookUrl);
movie.append(imageurl);
# print(movie)
result.append(movie);
pass;
return 1;
保存在Excel中。
Excel写入的框架 我用的xlsxwriter这个框架
def saveExecl(resultData,titleList,bookName):
workbook = xlsxwriter.Workbook(bookName + '.xlsx');#创建一个Excel文件
worksheet = workbook.add_worksheet('bookName') #增加一个表单
resultData.insert(0,titleList);
row = len(resultData); #行数
col = len(titleList); #列数
i = -1;
# 写入数据
for tag_nameArr in resultData:
i = i+1;
ks = 0;
#print(tag_nameArr)
for data in tag_nameArr:
#print(data)
worksheet.write(i,ks,str(data));
ks= ks + 1;
workbook.close()