编程要求:使用Xpath路径法查找,第三方lxml库:
1、Xpath读取default网页文件,建立对象
2、用xpath查找class为p2的段落,并遍历该列表对象显示其文本
3、用xpath查找id为divdemo1区块的所有p,显示其内容
4、用xpath查找id为divdemo1区块的所有class为“blue”的p,显示其文本
5、用xpath查找id为divdemo区块的链接,并显示其文本和链接
default.html :
<meta charset="UTF-8">
<html>
<head>
<title>ID和Class的定义</title>
<style type="text/css">
<!--
#divdemo{background-color:#90EE90 ;border:0.2cm groove orange;}
#divdemo1{background-color:#66EE90 ;border:0.6cm groove green;}
.m1 {font-size:20px; color:#FF0000;}
p.p2{font-size:26px; color:#FF0066;}
.red {font-size:20px; color:red;}
.green {font-size:20px; color:green;}
.blue {font-size:20px; color:blue;}
-->
</style>
</head>
<body>
<div id="divdemo">
<p>此段文字以默认方式显示</p>
<p class="m1">此段文字以16像素大小,红色字体显示</p>
<p class="p2">此段文字以26像素大小,玫红色字体显示</p>
<p class="p2">此段文字以26像素大小,玫红色字体显示1</p>
<p class="p2">此段文字以26像素大小,玫红色字体显示2</p>
<a href=http://xxx class="m1" title=“xxx”>xxx</a>
</div>
<div id="divdemo1">
<p class="red">数据分析</p>
<h2 class="green">数据可视化</h2>
<h3 class="blue">机器学习</h3>
<h3 class="red">机器不学习</h3>
<p class="blue">Python程序设计</p>
</div>
</body></html>
from lxml import etree
# Xpath读取default网页文件,建立对象
tree=etree.parse('default.html',parser=etree.HTMLParser())
# 用xpath查找class为p2的段落,并遍历该列表对象显示其文本
p2=tree.xpath('//p[@class="p2"]')
for i in p2:
print(i.text)
# 用xpath查找id为divdemo1区块的所有p,显示其内容
pD=tree.xpath('//div[@id="divdemo1"]//p')
for i in pD:
print(i.text)
# 用xpath查找id为divdemo1区块的所有class为“blue”的p,显示其文本
pDB=tree.xpath('//div[@id="divdemo1"]//p[@class="blue"]')
for i in pDB:
print(i.text)
# 用xpath查找id为divdemo区块的链接,并显示其文本和链接
aD=tree.xpath('//div[@id="divdemo"]//a')
for i in aD:
print(i.text)
print(i.get('href'))
用BeautifulSoup方式,抓取博客园(https://www .cnblogs.com)首页的博客标题和URL,并保存到文本文件中.
import requests
from bs4 import BeautifulSoup
url='https://www.cnblogs.com'
headers = {
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
}
result=requests.get(url,headers)
html=result.text
# print(html)
soup=BeautifulSoup(html,'html.parser')
# title=soup.find_all('title')
# print(title[0].text)
title=soup.title
t=title.text
print(t)
a=soup.find('div',class_='about')
for i in a:
if i.text=='博客园':
u=i.get('href')
print(u)
# 保存到文本文件中
with open('博客园bs.txt','w',encoding='utf-8') as f:
f.write(t+'\n')
f.write(u)
请用Xpath方法,抓取博客园(https://www .cnblogs.com)首页的博客标题和URL,并将爬取的结果保存到文件中。
import requests
from lxml import etree
url='https://www.cnblogs.com'
headers = {
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
}
result=requests.get(url,headers)
html=result.content.decode('utf-8')
# print(html)
tree=etree.HTML(html,parser=etree.HTMLParser(encoding='utf-8'))
titles=tree.xpath('/html/head/title')
for i in titles:
t=i.text
print(t)
urls=tree.xpath('//*[@id="footer_bottom"]/div[2]/a[5]')
for i in urls:
u=i.get('href')
print(u)
# 保存到文本文件中
with open('博客园xpath.txt','w',encoding='utf-8') as f:
f.write(t+'\n')
f.write(u)