python笔记

最新推荐文章于 2024-10-08 20:27:53 发布

龙潜月七

最新推荐文章于 2024-10-08 20:27:53 发布

阅读量221

点赞数

文章标签： python 开发语言后端

本文链接：https://blog.csdn.net/ariadna/article/details/118876944

版权

本文介绍了Python中正则表达式、贪婪与非贪婪匹配的概念，以及如何使用urllib获取网页源码。通过BeautifulSoup库展示了如何查找和筛选HTML元素，包括find和find_all方法，以及CSS选择器的使用。同时，还分享了字符串操作的优化技巧，并提供了下载文件时显示进度的方法。此外，还探讨了汉字字符的正则匹配。

摘要由CSDN通过智能技术生成

1正则表达式（regular expression）re

re.sub(“a”,“b”,str):把str里的a换成b

2贪婪与非贪婪

.*：贪婪
.*?：非贪婪

3.获得网页源码

urllib.request.urlretrieve

4

html = """
         <html><head><title>The Dormouse's story</title></head>
         <body>
         <p class="title" name="dromouse"><b>The Dormouse's story</b></p >
         <p class="story">Once upon a time there were three little sisters; and their names were
         <a href=" " class="sister" id="link1"><!-- Elsie --></a >,
         <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a > and
         <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a >;
         and they lived at the bottom of a well.</p >
         <p class="story">...</p >
         """

# 转换类型
soup = BeautifulSoup(html, "lxml")
# find -->只找一个复合条件
p = soup.find("p")
p = soup.find(attrs={"class": "title"})
p = soup.find(text="...")
p = soup.find(re.compile("^b"))

# findall-->列表  全局搜索
p = soup.find_all("p")
# print len(p)

# select-->列表 全局搜索  CSS 选择器
# ID
# 标签
# 类
# 层级选择器
# 并集选择器
# 属性选择器

a = soup.select("#link2")
a = soup.select("a")
a = soup.select(".sister")
a = soup.select("p #link2")
a = soup.select("title,a")
p = soup.select('p[class="story"]')[1]

# 获取标签包裹的内容
p_content = p.get_text()
# 获取属性:默认是列表
p_class = p.get("class")
print(p_class[0])

字符串操作

##不推荐
colors = ['red', 'blue', 'green', 'yellow']
result = ''
for s in colors:
	result += s # 每次赋值都丢弃以前的字符串对象, 生成一个新对象 
##推荐
colors = ['red', 'blue', 'green', 'yellow']
result = ''.join(colors) # 没有额外的内存分配

在这里插入图片描述

下载百分号，堆糖项目里学的

import urllib
def callbackfunc(blocknum, blocksize, totalsize):
‘’‘回调函数
@blocknum: 已经下载的数据块
@blocksize: 数据块的大小
@totalsize: 远程文件的大小
‘’’
percent = 100.0 * blocknum * blocksize / totalsize
if percent > 100:
percent = 100
print “%.2f%%”% percent

url = ‘http://www.sina.com.cn’
local = ‘d:\sina.html’
urllib.urlretrieve(url, local, callbackfunc)

chinese = re.findall(’[\u4e00-\u9fa5]’,i) #汉字的范围为"\u4e00-\u9fa5"