python爬虫小练习之一：bs4库基础爬信息

最新推荐文章于 2023-12-30 15:32:36 发布

StarLord007

最新推荐文章于 2023-12-30 15:32:36 发布

阅读量2.4k

点赞数

分类专栏：爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/q1694222672/article/details/79343788

版权

本文介绍了使用Python的bs4库进行网页爬虫的实践，任务是抓取网页上的商品名称、图片URL、价格、阅读人数和星级评价。通过分析HTML结构，利用CSS选择器选取元素，展示了如何获取所需数据。

摘要由CSDN通过智能技术生成

任务
爬取该网页商品的名称，图片地址，价格，阅读人数，星级评价
这里写图片描述

使用bs4库，用到css selecter, xpath以后会用到
select地址:f12,找到标签，右键复制select地址

name:
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.caption > h4:nth-child(2) > a
image:
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > img
money:
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.caption > h4.pull-right
…….

代码

from bs4 import BeautifulSoup#导入库

with open('./index.html','r') as wb_data:#打开本地文件，读
    soup = BeautifulSoup(wb_data,'lxml')#解析内容
    #得到每个小标签的集合，对比前面的select路径，发现不一样了吧
    #模糊路径去掉位置，可以爬取该页所有相同模式下的内容
    images = soup.select('body > div > div > div.col-md-9 > div > div > div > img')
    names = soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4 > a')
    moneys = soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4.pull-right')
    reads = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p.pull-right')
    fives = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p:nth-of-type(2)')

my_list = []
for image,name,money,read,five in zip(images,names,moneys,reads,fives):
    data = {
        'name':name.get_text(),
        'image':image.get('src'),
        'money':money.get_text(),
        'read':read.get_text(),
        'fives':len(five.find_all("span",class_="glyphicon glyphicon-star"))
        # 观察发现,每一个星星会有一次<span class="glyphicon glyphicon-star"></span>,所以我们统计有多少次,就知道有多少个星星了;
        # 使用find_all 统计有几处是★的样式,第一个参数定位标签名,第二个参数定位css 样式,具体可以参考BeautifulSoup 文档示例http://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#find-all;
        # 由于find_all()返回的结果是列表,我们再使用len()方法去计算列表中的元素个数,也就是星星的数量
    }
    my_list.append(data)

for i in my_list:
    print(i['name'],i['image'],i['money'],i['read'],i['fives'],sep='\n')
    print('\n')

运行结果
这里写图片描述

最后附上网页源代码，没有图片，仅供参考源码


<!DOCTYPE html>
<html lang="en">
    <head>

        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">
        <meta name="viewport" content="width=device-width, initial-scale=1">
        <meta name="description" content="">
        <meta name="author" content="">

        <title>Shop Homepage - Start Bootstrap Template</title>

        <!-- Bootstrap Core CSS -->
        <link href="css/bootstrap.min.css" rel="stylesheet">

        <!-- Custom CSS -->
        <link href="css/shop-homepage.css" rel="stylesheet">

        <!-- HTML5 Shim and Respond.js IE8 support of HTML5 elements and media queries -->
        <!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
        <!--[if lt IE 9]>
        <script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
        <script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
        <![endif]-->

    </head>

    <body>

        <!-- Navigation -->
        <nav class="navbar navbar-inverse navbar-fixed-top" role="navigation">
            <div class="container">
                <!-- Brand and toggle get grouped for better mobile display -->
                <div class="navbar-header">
                    <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
                        <span class="sr-only">Toggle navigation</span>
                        <span class="icon-bar"></span>
                        <span class="icon-bar"></span>
                        <span class="icon-bar"></span>
                    </button>
                    <a class="navbar-brand" href="#">Web Parse</a>
                </div>
                <!-- Collect the nav links, forms, and other content for toggling -->
                <div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
                    <ul class="nav navbar-nav">
                        <li>
                            <a href="#">Home