Python爬虫-小测验

最新推荐文章于 2022-03-31 14:54:34 发布

xiaosakun

最新推荐文章于 2022-03-31 14:54:34 发布

阅读量584

点赞数

本文链接：https://blog.csdn.net/xiaosa_kun/article/details/84868405

版权

这篇博客介绍了如何使用Python的BeautifulSoup和scrapy.Selector解析HTML，实现对当当网图书信息的抓取，并将数据保存至文件。此外，还详细阐述了抓取天猫三只松鼠旗舰店超级满减商品信息的步骤，包括获取页面内容、解析图片链接、保存商品信息到数据库等。最后，解释了Scrapy框架的运行原理，涉及调度器、下载器、爬虫、Item、管道和中间件的角色与工作流程。

摘要由CSDN通过智能技术生成

一、使用scrapy.Selector或BeautifulSoup，实现以下需求（30分）

（1）读取给定的dangdang.html页面内容，注：编码为gbk（5分）
（2）获取页面中所有图书的名称，价格，作者，出版社及图书图片的url地址（20分）
（3）将获取的信息保存至文件（excel、csv、json、txt格式均可）（5分）
网页文件dangdang.html文件下载链接: https://pan.baidu.com/s/1awbG5zqOMdnWzXee7TZm6A 密码: 3urs

1.1使用BeautifulSoup解决

from bs4 import BeautifulSoup as bs
import pandas as pd

def cssFind(book,cssSelector,nth=1):
    if len(book.select(cssSelector)) >= nth:
        return book.select(cssSelector)[nth-1].text.strip()
    else:
        return ''

if __name__ == "__main__":
    with open("dangdang.html",encoding='gbk') as file:
        html = file.read()
    soup = bs(html,'lxml')
    book_list = soup.select("div ul.bigimg li")
    result_list = []
    for book in book_list:
        item = {}
        item['name'] = book.select("a.pic")[0]['title']
        item['now_price'] = cssFind(book,"span.search_now_price")
        item['pre_price'] = cssFind(book,"span.search_pre_price")
        item['author'] = book.select("p.search_book_author a")[0]['title']
        item['publisher'] = book.select("p.search_book_author span a")[-1].text
        item['detailUrl'] = book.select("p.name a")[0]['href']
        item['