Python爬虫练习2

最新推荐文章于 2024-07-12 16:16:27 发布

卡朗

最新推荐文章于 2024-07-12 16:16:27 发布

阅读量161

点赞数 8

分类专栏： Python 爬虫文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/qq_69866029/article/details/136538621

版权

Python 同时被 2 个专栏收录

29 篇文章 0 订阅

订阅专栏

爬虫

4 篇文章 0 订阅

订阅专栏

url = 'http://www.tipdm.com/'
ua = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) Chrome/65.0.3325.181'}
1、将网页解析到指定变量中
2、爬取网页标题
3、爬取第一个div下所有段落中的前5个的文本

import requests
from lxml import etree

# 将网页解析到指定变量中
url = 'http://www.tipdm.com/'
ua = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) Chrome/65.0.3325.181'}
response=requests.get(url,headers=ua)
html_code=response.content.decode('utf-8')
tree=etree.HTML(html_code,parser=etree.HTMLParser())
# 爬取网页标题
title=tree.xpath('/html/head/title/text()')
print(title)
# 爬取第一个div下所有段落中的前5个的文本
p=tree.xpath('//div[1]/p[position()<6]')
# print(p)
for i in p:
    print(i.text)

使用selenium独有的节点查找方式find_element或是find_elements复数：
url="xxx"
1、建立selenium建立对象读取网页
2、输出网页标题title
3、用By.Xpath查找“xxx”的版块标题，并显示该标题文本
4、用By.CLASS_NAME查找所有的的版块标题，并显示输出
5、用By.LINK_TEXT查找链接文本“xxxxx”，显示链接地址和文本，并自动点击链接
6、用By.name查找head部分的<meta>标签的网页的描述部分“"description"的内容“content”

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
url="xxx"
browse=webdriver.Chrome()
browse.get(url)
time.sleep(3)
# 输出网页标题title
t=browse.title #直接得出网页标题
print(t)
#t=browse.find_element(by=By.XPATH,value="/html/head/title")
#print(t.text)
# 用By.Xpath查找“xxx”的版块标题，并显示该标题文本
t1=browse.find_element(by=By.XPATH,value="/html/body/div[3]/div/div[3]/div[1]/div[1]/h3/span")
print(t1.text)
# 用By.CLASS_NAME查找所有的的版块标题，并显示输出
tAll=browse.find_elements(By.CLASS_NAME,'title')
for i in tAll:
    print(i.text)
# 用By.LINK_TEXT查找链接文本“xxxxx”，显示链接地址和文本，并自动点击链接
a=browse.find_elements(By.LINK_TEXT,'xxxxx')
print(a.get_attribute('href'))
print(a.get_attribute('text'))
a.click()
time.sleep(3)
# 用By.name查找head部分的<meta>标签的网页的描述部分“"description"的内容“content”
description=browse.find_element(By.NAME,'description')
print(description.get_attribute('content'))

browse.close()

使用Xpth路径方法操作，第三方lxml库，给定网页代码：
1、建立Xpath对象读取index网页文件，建立对象
2、路径法查找网页标题，显示该标题文本
3、路径法查找所有的p，并显示其类型和文本
4、路径法查找第二个p段落的文本；
5、路径法查找所有的链接a，并显示其文本和链接
6、显示“xxx”的链接地址和文本

<html >
<head>
    <meta charset="UTF-8">
    <title>我的第一个页面</title>
</head>
<body>
    <h1>这里是文章的标题。</h1>
    <p>这里是文章的段落1。</p>
    <p>这里是文章的段落2。</p>
    <p>这里是文章的段落3。</p>
<a href=http://xxx  title=“xxx”>xxx</a>
<a href=http://xxx  title=“xxx”>xxx</a>
</body>
</html>

from lxml import etree
# 建立对象
tree=etree.parse('index.html',parser=etree.HTMLParser())
# 路径法查找网页标题，显示该标题文本
title=tree.xpath('/html/head/title')
print(title[0].text)
# 路径法查找所有的p，并显示其类型和文本
p=tree.xpath('//p')
print(type(p))
# print(p)
for i in p:
    # print(type(i))
    print(i.text)
# 路径法查找第二个p段落的文本
p2=tree.xpath('/html/body/p[2]/text()')
print(p2[0])
# 路径法查找所有的链接a，并显示其文本和链接
a=tree.xpath('//a')
for i in a:
    print(i.text)
    print(i.get('href'))
t=tree.xpath('//a[@title="“xxx”"]')
for i in t:
    print(i.get('href'))
    print(i.text)