作者:IT小样
前面只是简单的实现了爬虫功能,可以继续完善,本篇文章主要是完善爬取的网页数据,爬取tag/漫画下所有书的书名,作者信息,将函数进行封装,并添加保存数据功能。
之前爬取的目的url为:https://book.douban.com/tag/漫画 ,进入该网页,活动窗口到最下面,可见这只是第一页的书而已,如图:
怎么完整的爬取下所有的页面的书名呢?通过点击几个分页,我们发现该URL后缀有两个参数,实际上第一页的URL为:https://book.douban.com/tag/漫画?start=0&type=T ,两个参数为:start,type;通过点击发现type参数不变,start参数=(页数-1)*20,直接先放上未封装代码:
import requests
from bs4 import BeautifulSoup
header = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36"}
# get page count
# text = requests.get("https://book.douban.com/tag/%E6%BC%AB%E7%94%BB?start=0&type=T",headers=header,verify=False).text
# soup_count = BeautifulSoup(text,'lxml')
# div_soup = soup_count.find("span",attrs={"class":"break"})
# print(div_soup)
# a_soup = div_soup.next_sibling.next_sibling.next_sibling.next_sibling
# page_count = int(a_soup.string)
# get page text
url_temp = "https://book.douban.com/tag/%E6%BC%AB%E7%94%BB?start={}&type=T"
offset = 0
result = []
offset=20
for page in range(1):
title = ''
author = ''
comment_count = ''
start = offset*page
url = url_temp.format(start)
text = requests.get(url,headers=header,verify=False).text
# print(text)
soup = BeautifulSoup(text,'lxml')
ul_soup = soup.find(attrs={"class":"subject-list"})
li_soup = ul_soup.find_all("li",attrs={"class":"subject-item"})
for li in li_soup:
result_list = []
title = li.h2.get_text().replace(' ','').replace('\n','')
author = li.find("div",attrs={"class":"pub"}).get_text().replace(' ','').replace('\n','')
#comment_count = li.find("div",attrs={"class":"star"}).find("span",attrs={"class":"p1"}).get_text()
result_list.append(title)
result_list.append(author)
#result_list.append(comment_count)
result.append(result_list)
print(result)
with open('a.txt','w',encoding='utf-8') as f:
for result_list in result:
s = ' '.join(result_list)
f.write(s)
f.write('\n')
以上是直接编写的代码,我们可以将其进行封装成函数,这样子增加代码复用性。放上封装后的代码:
import requests
from bs4 import BeautifulSoup
def get_page_count(url):
#get page counts
text = requests.get(url,verify=False).text
soup_count = BeautifulSoup(text,'lxml')
div_soup = soup_count.find("span",attrs={"class":"break"})
a_soup = div_soup.next_sibling.next_sibling.next_sibling.next_sibling
page_count = int(a_soup.string)
return page_count
def get_page_text(page_count):
text = []
offset = 20
url_temp = "https://book.douban.com/tag/%E6%BC%AB%E7%94%BB?start={}&type=T"
for page in range(0,page_count):
start = offset*page
text = requests.get(url_temp.format(start),verify=False).text
soup = BeautifulSoup(text,'lxml')
ul_soup = soup.find(attrs={"class":"subject-list"})
li_soup = ul_soup.find_all("li",attrs={"class":"subject-item"})
for li in li_soup:
result_list = []
title = li.h2.get_text().replace(' ','').replace('\n','')
author = li.find("div",attrs={"class":"pub"}).get_text().replace(' ','').replace('\n','')
result_list.append(title)
result_list.append(author)
text.append(result_list)
return text
def save_data(book_text):
with open('a.txt','w') as f:
for book in book_text:
book_author = ' '.join(book)
f.write(book_author)
f.write('\n')
if __name__ =="__main__":
page_count = get_page_count("https://book.douban.com/tag/%E6%BC%AB%E7%94%BB?start=0&type=T")
book_text = []
book_text = get_page_text(page_count)
save_data(book_text)
以上的代码,是先获得整个页面的页面数,然后获取整个页面所有的书名以及作者信息,最后保存数据。当然也可以自己选定需要获取多少页的书目信息。
最后就是该程序运行起来很慢,可以进行继续的完善。