Python爬虫练习3

最新推荐文章于 2024-07-12 16:16:27 发布

卡朗

最新推荐文章于 2024-07-12 16:16:27 发布

阅读量111

点赞数 1

分类专栏： Python 爬虫文章标签： python 前端数据库

本文链接：https://blog.csdn.net/qq_69866029/article/details/136592157

版权

Python 同时被 2 个专栏收录

29 篇文章 0 订阅

订阅专栏

爬虫

4 篇文章 0 订阅

订阅专栏

编程要求：使用Xpath路径法查找，第三方lxml库：
1、Xpath读取default网页文件，建立对象
2、用xpath查找class为p2的段落，并遍历该列表对象显示其文本
3、用xpath查找id为divdemo1区块的所有p,显示其内容
4、用xpath查找id为divdemo1区块的所有class为“blue”的p，显示其文本
5、用xpath查找id为divdemo区块的链接，并显示其文本和链接

default.html :

<meta charset="UTF-8">
<html>
<head>
    <title>ID和Class的定义</title>
    <style type="text/css">
    <!--
        #divdemo{background-color:#90EE90 ;border:0.2cm groove orange;}
        #divdemo1{background-color:#66EE90 ;border:0.6cm groove green;}
        .m1 {font-size:20px; color:#FF0000;}
        p.p2{font-size:26px; color:#FF0066;}
        .red {font-size:20px; color:red;}
        .green {font-size:20px; color:green;}
        .blue {font-size:20px; color:blue;}
    -->
    </style>
</head>
<body>
    <div id="divdemo">
        <p>此段文字以默认方式显示</p>
        <p class="m1">此段文字以16像素大小，红色字体显示</p>
        <p class="p2">此段文字以26像素大小，玫红色字体显示</p>
        <p class="p2">此段文字以26像素大小，玫红色字体显示1</p>
        <p class="p2">此段文字以26像素大小，玫红色字体显示2</p>
        <a href=http://xxx  class="m1"  title=“xxx”>xxx</a>

    </div>
    <div id="divdemo1">
        <p class="red">数据分析</p>
        <h2 class="green">数据可视化</h2>
        <h3 class="blue">机器学习</h3>
        <h3 class="red">机器不学习</h3>
        <p class="blue">Python程序设计</p>
    </div>


</body></html>

from lxml import etree
# Xpath读取default网页文件，建立对象
tree=etree.parse('default.html',parser=etree.HTMLParser())
# 用xpath查找class为p2的段落，并遍历该列表对象显示其文本
p2=tree.xpath('//p[@class="p2"]')
for i in p2:
    print(i.text)
# 用xpath查找id为divdemo1区块的所有p,显示其内容
pD=tree.xpath('//div[@id="divdemo1"]//p')
for i in pD:
    print(i.text)
# 用xpath查找id为divdemo1区块的所有class为“blue”的p，显示其文本
pDB=tree.xpath('//div[@id="divdemo1"]//p[@class="blue"]')
for i in pDB:
    print(i.text)
# 用xpath查找id为divdemo区块的链接，并显示其文本和链接
aD=tree.xpath('//div[@id="divdemo"]//a')
for i in aD:
    print(i.text)
    print(i.get('href'))

用BeautifulSoup方式,抓取博客园(https://www .cnblogs.com)首页的博客标题和URL，并保存到文本文件中.

import requests
from bs4 import BeautifulSoup
url='https://www.cnblogs.com'
headers = {
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
}
result=requests.get(url,headers)
html=result.text
# print(html)
soup=BeautifulSoup(html,'html.parser')
# title=soup.find_all('title')
# print(title[0].text)
title=soup.title
t=title.text
print(t)
a=soup.find('div',class_='about')
for i in a:
    if i.text=='博客园':
        u=i.get('href')
        print(u)
# 保存到文本文件中
with open('博客园bs.txt','w',encoding='utf-8') as f:
    f.write(t+'\n')
    f.write(u)

请用Xpath方法，抓取博客园(https://www .cnblogs.com)首页的博客标题和URL，并将爬取的结果保存到文件中。

import requests
from lxml import etree
url='https://www.cnblogs.com'
headers = {
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
}
result=requests.get(url,headers)
html=result.content.decode('utf-8')
# print(html)
tree=etree.HTML(html,parser=etree.HTMLParser(encoding='utf-8'))
titles=tree.xpath('/html/head/title')
for i in titles:
    t=i.text
    print(t)
urls=tree.xpath('//*[@id="footer_bottom"]/div[2]/a[5]')
for i in urls:
    u=i.get('href')
    print(u)
# 保存到文本文件中
with open('博客园xpath.txt','w',encoding='utf-8') as f:
    f.write(t+'\n')
    f.write(u)