python的基础-1

最新推荐文章于 2024-10-03 09:02:12 发布

0xsu

最新推荐文章于 2024-10-03 09:02:12 发布

阅读量96

点赞数

分类专栏：爬虫文章标签： python 正则表达式

本文链接：https://blog.csdn.net/weixin_45752509/article/details/110005079

版权

爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

requests库

requests的作用：发送请求获取响应数据

安装

在终端运行

pip install requests

使用三步骤

# 1.导入模块
import requests

# 2.发送get请求，获取响应的对象response
response1 = requests.get('https://www.baidu.com/')
response = requests.get('https://ncov.dxy.cn/ncovh5/view/pneumonia')

# 3.从对象中获取数据
#把二进制数据转换成字符串，默认为utf8
print(response.content.decode())

Beautiful Soup库

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的python库

安装

在终端运行

# 安装Beautiful Soup4
pip install bs4
# 安装 lxml
pip install lxml

创建BeautifulSoup对象

# 1.导入模块
from bs4 import BeautifulSoup

# 2.创建BeautifulSoup对象
soup = BeautifulSoup("<html>data</html>","lxml")
print(soup)

BeautifulSoup对象的find()方法和Tag对象

Tag对象常见属性
name：获取标签名称
attrs：获取标签所有属性的键和值
text：获取标签的文本字符串

# 1.导入模块
from bs4 import BeautifulSoup

# 2.准备文档字符串
# html = '''this'''
html = '''<html>
<head>
    <title>The Document's story</title>
</head>
<body>
    <p class="title">
        <b>The Document's story</b>
    </p>
    <p class="p2">
        <a href="###1" class="sister" id="link1">vir</a>
        <a href="###2" class="sister" id="link2">kali</a>
        <a href="###3" class="sister" id="link3">root</a>
    </p>
    <P class="story">……*&</P>
</body>
</html>'''

# 3.创建BeautifulSoup对象
soup = BeautifulSoup(html,'lxml')

# 4.查找文档中的title标签
title = soup.find('title')
print(title)

# 查找文本内容
text = soup.find(text='kali')
print(text)

# 查找文档中的a标签
# 方式1
a = soup.find('a')
print(a)

# 方式2
# 根据id属性的命名参数进行查找
a = soup.find(id='link1')
print(a)

# 方式3
# 根据attrs来指定属性字典进行查找
a = soup.find(attrs={'id': 'link1'})
print(a)

# Tag对象
print(type(a)) #<class 'bs4.element.Tag'>

print('标签名',a.name)
print('标签所有属性',a.attrs)
print('标签文本内容',a.text)

正则表达式的基本使用

概念

是一种字符串匹配的模式

作用

检查一个字符中是否含有某种字符串
替换匹配的字符串
提取某个字符串中匹配的子串

re.findall()查找字符串所有与正则匹配的子串

返回一个列表
如果没有找到返回空列表

import re

# 1.findall方法，返回匹配的结果列表
rs = re.findall('\d','ki13ki24') #['1', '3', '2', '4']
rs = re.findall('\d+','ki13ki24') #['13', '24']
# + 前面的一个匹配模式出现一次或多次
print(rs)

# 2.findall方法中，flag参数的作用
rs = re.findall('a.+bc','a\nbc')
print(rs) #[]
rs = re.findall('a.+bc','a\nbc',re.DOTALL)
print(rs) #['a\nbc']
rs = re.findall('a(.+)bc','a\nbc',re.DOTALL)
print(rs) #['\n']
# ()小括号两边的东西都是负责确定提取数据所在的位置

# 3.r原串的作用
# 3.1忽略转义符带来的影响
rs = re.findall(r'a\\nbc','a\\nbc')
print(rs) # ['a\\nbc']
# 3.2 解决不符合PEP8规范的问题
rs = re.findall(r'\d','a123')
print(rs) # ['1', '2', '3']