任务
爬取该网页商品的名称,图片地址,价格,阅读人数,星级评价
使用bs4库,用到css selecter, xpath以后会用到
select地址:f12,找到标签,右键复制select地址
name:
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.caption > h4:nth-child(2) > a
image:
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > img
money:
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.caption > h4.pull-right
…….
代码
from bs4 import BeautifulSoup#导入库
with open('./index.html','r') as wb_data:#打开本地文件,读
soup = BeautifulSoup(wb_data,'lxml')#解析内容
#得到每个小标签的集合,对比前面的select路径,发现不一样了吧
#模糊路径去掉位置,可以爬取该页所有相同模式下的内容
images = soup.select('body > div > div > div.col-md-9 > div > div > div > img')
names = soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4 > a')
moneys = soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4.pull-right')
reads = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p.pull-right')
fives = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p:nth-of-type(2)')
my_list = []
for image,name,money,read,five in zip(images,names,moneys,reads,fives):
data = {
'name':name.get_text(),
'image':image.get('src'),
'money':money.get_text(),
'read':read.get_text(),
'fives':len(five.find_all("span",class_="glyphicon glyphicon-star"))
# 观察发现,每一个星星会有一次<span class="glyphicon glyphicon-star"></span>,所以我们统计有多少次,就知道有多少个星星了;
# 使用find_all 统计有几处是★的样式,第一个参数定位标签名,第二个参数定位css 样式,具体可以参考BeautifulSoup 文档示例http://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#find-all;
# 由于find_all()返回的结果是列表,我们再使用len()方法去计算列表中的元素个数,也就是星星的数量
}
my_list.append(data)
for i in my_list:
print(i['name'],i['image'],i['money'],i['read'],i['fives'],sep='\n')
print('\n')
运行结果
最后附上网页源代码,没有图片,仅供参考源码
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
<meta name="author" content="">
<title>Shop Homepage - Start Bootstrap Template</title>
<!-- Bootstrap Core CSS -->
<link href="css/bootstrap.min.css" rel="stylesheet">
<!-- Custom CSS -->
<link href="css/shop-homepage.css" rel="stylesheet">
<!-- HTML5 Shim and Respond.js IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
<script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<body>
<!-- Navigation -->
<nav class="navbar navbar-inverse navbar-fixed-top" role="navigation">
<div class="container">
<!-- Brand and toggle get grouped for better mobile display -->
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="#">Web Parse</a>
</div>
<!-- Collect the nav links, forms, and other content for toggling -->
<div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
<ul class="nav navbar-nav">
<li>
<a href="#">Home