所有的一切都跟上一篇文章是一样的,不同的是不用写长长的正则表达式啦,上一期传送门https://blog.csdn.net/u010376229/article/details/114042780
这次我们需要用到BeautifulSoup,只需简单的学习一下就剋不用写正则表达式啦,而且更加清楚
def get_books_info_of_current_page(page):
html = get_html("http://bang.dangdang.com/books/fivestars/01.00.00.00.00.00-recent30-0-0-1-" + str(page))
soup = BeautifulSoup(html, 'lxml')
lis = soup.find("ul", class_="bang_list").find_all("li") # 找到<ul class="bang_list">下所有的li元素
get_book_info_and_write_to_txt(lis)
def get_book_info_and_write_to_txt(lis):
for li in lis:
book_info = {
"range": li.find('div', class_="list_num").string,
"img": li.find("div", class_="pic").a.img.get("src"),
"title": li.find("div", class_="name").a.get("title"),
"recommend": li.find("div", class_="star").find("span", class_="tuijian").string,
"author": li.find("div", class_="publisher_info").a.get("title") if li.find("div", class_="publisher_info").a else "无",
"price": li.find("div", class_="price").span.string
}
write_item_to_file(book_info)
不过用这种方法用的时间比较久,取500条数据用时14s左右,用正则只需要10s左右