复习正则表达式:
# ?尽可能少的匹配,在上面的案列中,当匹配到第一个div就结束匹配。?关闭贪婪
假设这里有一个HTML文件:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<footer>
<div>
<div class="email">
Email:kefu@CSDN.net
</div>
<div class="tel">
手机号:400-660-0108
</div>
</div>
</footer>
</body>
</html>
我想要提取Emai,创建一个python文件,import re
import re
with open('index.html','r', encoding='utf_8' ) as f:
html = f.read()
#print(html)
#提取Email: kefu @ CSDN.net
#过滤空
html = re.sub('\n', '', html)
#定义提取
pattern_1 = '<div class="email">(.*?)</div>'
#开始提取
re_1 = re.findall(pattern_1, html)
#strip()去两边的空,
print(re_1[0].strip())
匹配一个以字母开头,数字,字母,下划线长度为5-15位的密码
#定义匹配 加r代表防转义
password_pattern = r'^[a-zA-Z][a-zA-Z0-9_]{5,15}$'
password1 = '1234567'
password2 = 'a123456'
password3 = 'a123'
print(re.match(password_pattern, password1))
print(re.match(password_pattern, password2))
print(re.match(password_pattern, password3))
稍微拓展一下,提取更多的数据,提取商城的分类结构
import re
with open('static/html/index.html', 'r', encoding='utf-8') as f:
html = re.sub('\n', '', f.read())
section_pattern = '<section class="main_section">(.*?)</section>'
section_s= re.findall(section_pattern, html)
print(section_s)
print(len(section_s))
crategory_pattern = '<h1>(.*?)</h1>'
# crategory_s = re.findall(crategory_pattern, section_s[0])
#print(crategory_s)
course_pattern ='<span class="course_name">(.*?)</span>'
data_s = []
for section in section_s:
crategory = re.findall(crategory_pattern, section)[0]
course = re.findall(course_pattern, section)
print(crategory)
data_s.append(
{
'crategory':crategory,
'course':course
}
)
print(data_s)
for data in data_s:
print(data.get('crategory'))
for d in data['course']:
print(" "+d)
测试的html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>CSDN微课商城</title>
<link rel="stylesheet" href="../css/main.css">
<script type="text/javascript" src="../js/main.js"></script>
</head>
<body>
<div id="register" hidden="hidden">
<h2 class="form_p">注册</h2>
<p id="register_message">
<!--信息有误-->
</p>
<form action="#" method="post" id="register_form">
<input id="register_account" type="text" name="account" placeholder="账号(数字、英文、下换线,8-16位)"><br/>
<input id="register_password" type="password" name="password" placeholder="密码(数字、英文、下换线,6-16位)"><br/>
<!--<input type="password" name="repassword" placeholder="确认密码"><br/>-->
<input id="register_submit" type="submit" value="注册">
</form>
</div>
<div id="login" hidden="hidden">
<h2 class="form_p">登录</h2>
<p id="login_message">
<!--信息有误-->
</p>
<form action="#" method="post" id="login_form">
<input id="login_account" type="text" name="account" placeholder="账号"><br>
<input id="login_password" type="password" name="password" placeholder="密码"><br>
<input id="login_submit" type="submit" value="登录">
</form>
</div>
<header>
<span class="title"> <a href="index.html">CSDN微课商城</a> </span>
<span>
<form action="#" class="search_form">
<input type="text" name="course" placeholder="按课程名称搜索">
<input type="submit" value="搜索">
</form>
</span>
<span class="user">
<a href="javascript:show('login')">登录</a>/
<a href="javascript:show('register')">注册</a>
<!-- 已经登录显示的内容 -->
你好:
<a href="user.html">用户1</a>
<a href="#">注销</a>
</span>
</header>
<article>
<section class="nav_section"><img src="../img/csdn_static/2.png" alt="" width="100%"></section>
<section class="main_section"><h1>第一章 路由与模板</h1>
<figure><a href="course.html"> <img src="../img/course/course.png">
<figcaption><span class="course_name">Web原理与框架简介</span><span class="price">¥75</span></figcaption>
</a></figure>
<figure><a href="course.html"> <img src="../img/course/course.png">
<figcaption><span class="course_name">Django环境搭建与入门案例</span><span class="price">¥153</span></figcaption>
</a></figure>
<figure><a href="course.html"> <img src="../img/course/course.png">
<figcaption><span class="course_name">基本路由映射与命名空间</span><span class="price">¥154</span></figcaption>
</a></figure>
<figure><a href="course.html"> <img src="../img/course/course.png">
<figcaption><span class="course_name">正则路由映射参数的传递与接收</span><span class="price">¥177</span></figcaption>
</a></figure>
<figure><a href="course.html"> <img src="../img/course/course.png">
<figcaption><span class="course_name">反向解析处理器</span><span class="price">¥161</span></figcaption>
</a></figure>
<figure><a href="course.html"> <img src="../img/course/course.png">
<figcaption><span class="course_name">Request对象与Response对象</span><span class="price">¥44</span></figcaption>
</a></figure>
<figure><a href="course.html"> <img src="../img/course/course.png">
<figcaption><span class="course_name">上下文与模板调用</span><span class="price">¥97</span></figcaption>
</a></figure>
<figure><a href="course.html"> <img src="../img/course/course.png">
<figcaption><span class="course_name">模板层基础语法</span><span class="price">¥105</span></figcaption>
</a></figure>
<figure><a href="course.html"> <img src="../img/course/course.png">
<figcaption><span class="course_name">模板过滤器</span><span class="price">¥133</span></figcaption>
</a></figure>
</section>
<section class="main_section"><h1>第二章 模型类实现</h1>
<figure><a href="course.html"> <img src="../img/course/course.png">
<figcaption><span class="course_name">ORM原理与数据库配置</span><span class="price">¥143</span></figcaption>
</a></figure>
<figure><a href="course.html"> <img src="../img/course/course.png">
<figcaption><span class="course_name">表与字段的定义和常用字段约束</span><span class="price">¥118</span></figcaption>
</a></figure>
<figure><a href="course.html"> <img src="../img/course/course.png">
<figcaption><span class="course_name">数据迁移与维护</span><span class="price">¥57</span></figcaption>
</a></figure>
<figure><a href="course.html"> <img src="../img/course/course.png">
<figcaption><span class="course_name">模型类的增删改</span><span class="price">¥45</span></figcaption>
</a></figure>
<figure><a href="course.html"> <img src="../img/course/course.png">
<figcaption><span class="course_name">模型类的查询方法</span><span class="price">¥187</span></figcaption>
</a></figure>
<figure><a href="course.html"> <img src="../img/course/course.png">
<figcaption><span class="course_name">QuerySet详解</span><span class="price">¥197</span></figcaption>
</a></figure>
</section>
</article>
<footer>
<div id="footer_div1">
<p><a href="#">关于我们</a>| <a href="#">招聘</a>| <a href="#">广告服务</a>| <a href="#">网站地图</a></p>
<p><a href="#">QQ客服</a>| <a href="#">kefu@csdn.ent</a>| <a href="#">客服论坛</a>| <a href="#">400-660-0108</a>| <a
href="#">工作时间:8:30-22:00</a></p>
<p> 百度提供站内搜索 北ICP备19004658 </p>
<p> ©1999-2019 北京创新乐知网络技术有限公司 </p>
<p> 版权申诉 家长监护 经营性网站备案信息 网络110报警服务 中国互联网举报中心 北京互联网违法和不良信息举报中心 </p>
</div>
<div id="footer_div2">
<figure><img src="../img/csdn_static/二维码1.png">
<figcaption>CSDN咨询</figcaption>
</figure>
<figure><img src="../img/csdn_static/二维码1.png">
<figcaption>CSDN学院</figcaption>
</figure>
<figure><img src="../img/csdn_static/二维码1.png">
<figcaption>CSDN企业招聘</figcaption>
</figure>
</div>
</footer>
</body>
</html>