一、 使用scrapy.Selector或BeautifulSoup,实现以下需求(30分)
(1)读取给定的dangdang.html页面内容,注:编码为gbk(5分)
(2)获取页面中所有图书的名称,价格,作者,出版社及图书图片的url地址(20分)
(3)将获取的信息保存至文件(excel、csv、json、txt格式均可)(5分)
网页文件dangdang.html文件下载链接: https://pan.baidu.com/s/1awbG5zqOMdnWzXee7TZm6A 密码: 3urs
1.1使用BeautifulSoup解决
from bs4 import BeautifulSoup as bs
import pandas as pd
def cssFind(book,cssSelector,nth=1):
if len(book.select(cssSelector)) >= nth:
return book.select(cssSelector)[nth-1].text.strip()
else:
return ''
if __name__ == "__main__":
with open("dangdang.html",encoding='gbk') as file:
html = file.read()
soup = bs(html,'lxml')
book_list = soup.select("div ul.bigimg li")
result_list = []
for book in book_list:
item = {}
item['name'] = book.select("a.pic")[0]['title']
item['now_price'] = cssFind(book,"span.search_now_price")
item['pre_price'] = cssFind(book,"span.search_pre_price")
item['author'] = book.select("p.search_book_author a")[0]['title']
item['publisher'] = book.select("p.search_book_author span a")[-1].text
item['detailUrl'] = book.select("p.name a")[0]['href']
item['