Python的应用-前端、爬虫

最新推荐文章于 2024-08-08 14:28:23 发布

酸奶加香蕉

最新推荐文章于 2024-08-08 14:28:23 发布

阅读量313

点赞数 6

分类专栏：课后总结文章标签： python html web

本文链接：https://blog.csdn.net/m0_54139855/article/details/119762194

版权

课后总结专栏收录该内容

5 篇文章 0 订阅

订阅专栏

文章目录

Python的应用

Python的应用

前端

HTML: hyperText markup language，超文本标记语言
HTML后缀名：.html / .htm
HTML标签名：

一般标签：由起始标签和结束标签组成，可以插入其他标签或其他内容，例如：<h1>yyds</h1>

自闭和标签：由起始标签组成，在起始标签末尾加斜杠，在其中间不能加任何东西，

如：<br />HTML中标准是双引号

HTML没有大小写之分

格式化代码：ctrl + k

以下代码使用HBuilderX运行（代码中有各类注释）

HBuilder下载链接
请添加图片描述

<!-- 声明此文档为HTML5文档 -->
<!DOCTYPE html>
<!-- 定义了HTML文档 -->
<html lang="zh">
	<!-- 提供了需要定义的信息 -->
	<head>
		<!-- 元信息 -->
		<meta charset="UTF-8">
		<meta name="viewport" content="width=device-width, initial-scale=1.0">
		<meta http-equiv="X-UA-Compatible" content="ie=edge">
		<!-- 定义了标题名 -->
		<title>html基础</title>
		<link rel="stylesheet" type="text/css" href="./css/style.css"/>
	</head>
	<!-- 在body标签写页面可见元素 -->
	<body>
		<!-- div：把不属于一类标签的标签间隔开 -->
		<div class="a">
			<!-- 标题标签：h1、h2......h6 -->
			<h1 id="a">我的标签是h1</h1>
			<h2 id="b">我的标签是h2</h2>
			<h6 id="c">我的标签是h6</h6>
			
			<!-- 段落标签：<p> -->
			<p>YYDS</p>
			<p>大师兄，师傅被妖怪抓走了</p>
			<!-- 超链接标签：<a> 把其他链接引入进来-->
			<a href="https://www.baidu.com" target="_blank">百度</a>
			
			<!-- 图像标签：<img /> -->
			<img src=
			"https://www.baidu.com/img/PCtm_d9c8750bed0b3c7d089fa7d55720d6cf.png" 
			width="100px" height="50px"/>
			<!-- href：负责引入超文本（HTML、css、js） -->
			<!-- src：负责引入一些资源（图片、音频、视频）并且将引入的资源嵌入到页面上 -->
			
			<img src="./img/4.jpg" width="100" height="80"/>
			<!-- 绝对路径：在电脑资源盘下放的路径，相对于电脑 -->
			<!-- 相对路径：相对于当前文件所在的目录，寻找引入文件路径
				 ./: 表示当前目录
				 ../: 返回上一级
			 -->
			 
			 <!-- 引入音频文件：<audio> -->
			 <audio controls src="./music/1.m4a" ></audio>
			 
			 <!-- 视频：<video> -->
			 <video controls src="./video/puppy_2.mp4"></video>
		</div>
		<div class="b">
			<!-- 列表：有序列表和无序列表 -->
			<ol>
				有序列表
				<p>一人之下</p>
				<li>海贼王</li>
				<li>火影</li>
				<li>斗罗大陆</li>
				<li>柯南</li>
			</ol>
			<ul>
				无序列表
				<li>王者荣耀</li>
				<li>摩尔庄园</li>
				<li>和平精英</li>
				<li>LOL</li>
				<li>CSGO</li>
			</ul>
			
			<!-- iframe框架：在当前页面嵌套一个新页面 -->
			<iframe src="https://www.chinanews.com/" width="500px" height="300px"></iframe>
		</div>
		<div class="c">
			<!-- 表格：<tbody> -->
			<!-- tr: 表格的行数 -->
			<!-- th: 表头 -->
			<!-- td：存放内容的单元格 -->
			<tbody>
				<tr>
					<th>队名</th>
					<th>第一节</th>
					<th>第二节</th>
					<th>第三节</th>
					<th>第四节</th>
					<th>总分</th>
				</tr>
				<br />
				<tr>
					<td>太阳</td>
					<td>&nbsp;16</td>
					<td>31</td>
					<td>30</td>
					<td>21</td>
					<td>98</td>
				</tr>
				<br />
				<tr>
					<td>雄鹿</td>
					<td>29</td>
					<td>13</td>
					<td>35</td>
					<td>28</td>
					<td>105</td>
				</tr>
			</tbody>
			<!-- 换行标签：<br /> -->
			<!-- 水平线标签：<hr /> -->
			<hr />
			<!-- 加粗标签：b，strong -->
			<b>YYDS</b>
			<strong>YYDS</strong>
			<br />
			<!-- 文字倾斜标签：i，em -->
			<i>文字倾斜</i>
			<em>文字倾斜</em>
			<!-- 表单标签：<form> -->
			<form>
				账号：<input type="tel" /><br />
				密码：<input type="password" /><br />
				<input type="submit" value="登录">
				<input type="reset" value="重置">
			</form>
		</div>
	</body>
</html>

/* css用于描述HTML样式的编程语言 */
/* css引入分为：行内式、内嵌式、外链式 */

/* 通配符选择器：* */
* {
	/* 外边距 */
	margin: none;
	/* 内边距 */
	padding: none;
}

/* 类选择器：class */
.a {
	border: 1px solid red;
}

.b {
	border: 1px double blue;
}

.c {
	/* border: 1px dotted green; */
	border: 1px dashed black;
}

/* id选择器：# */
#a {
	color: yellowgreen;
	font-family: "楷体";
	font-size: 40px;
}

#b {
	color: skyblue;
	font-size: 35px;
}

#c {
	color: lawngreen;
	font-family: "楷体";
	font-size: 30px;
}

/* 标签选择器 */
/* 将前端中所有p标签都变成指定样式 */
p {
	font-size: 22px;
}

/* 父子选择器 */
/* 修改指定标签下的某个标签位置 */
.a>h1 {
	text-align: center;
}

/* 后代选择器 */
.b li {
	color: aquamarine;
}

/* 兄弟选择器 */
/* 和连接符左边的标签同级关系的下方所有标签都是其他的兄弟 */
h1~h2,
h6 {
	text-align: center;
}

/* 相邻兄弟选择器 */
/* 只能选择连接符左边的标签同级的相邻的下方的标签 */
h1+h2 {
	font-family: "仿宋";
}

/* nth-child选择器 */
/* nth-child只根据同级关系查找 */
.b>ol>li:nth-child(3) {
	color: red;
}

/* 属性选择器 */
input[type=submit] {
	background-color: green;
	color: white;
	border: none;
}

form {
	border: 2px solid black;
	width: 350px;
	margin: 20px 200px 20px 420px; /* 上右下左 */
	padding-left: 150px;
}

爬虫

为了避免多次访问，ip被封，获取数据后，可以先将数据保存，使用时再读取。（后续会有解决方法）

找到参数hearders的步骤：
按fn+f12或f12或者在菜单栏中找到开发者工具，打开网页源代码

"""
example01 - 爬虫

Author: Asus
Date: 2021/8/16
"""
import re

import requests

URL = 'https://movie.douban.com/top250'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
}
resp = requests.get(url=URL, headers=headers)
print(resp.text)
with open('resources/豆瓣电影.html', 'w', encoding='utf-8') as file:
    file.write(resp.text)

利用正则表达式进行爬虫

"""
example01 - 爬虫 - 利用正则表达式
Author: Asus
Date: 2021/8/16
"""
import re
import requests

with open('resources/豆瓣电影.html', 'r', encoding='utf-8') as file:
    content = file.read()
    # print(content)

# re_str = '<img width="100" alt="(.*?)" src="(.+)" class="">'
re_str = '<img width="100" alt="(.*?)" src="([a-z]{5}.{3}[a-z\d]{4}\.[a-z]{8}\.[a-z]{3}/view/photo/s_ratio_poster/public/p\d{9}.jpg)" class="">'
result = re.search(re_str, content)
print(result)
# span()输出匹配到字符的起始位置和结束位置
print(result.span())
# group(): 将分组的内容返回出来
# 如果参数是0(group(0))，将所有分组的内容输出
print(result.group(1))
print(result.group(2))
# 将正则表达式中分组的内容合成一个元组
print(result.groups())

requests请求数据

"""
example01 - requests请求数据

Author: Asus
Date: 2021/8/17
"""
import requests

resp = requests.get(
    url='http://www.baidu.com',
    # User-Agent: 将爬虫模拟成浏览器
    # Cooike: 存放的用户的账号密码信息
    headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
    }
)
# 状态码：200，爬虫可用；403，访问的网站将爬虫封了；404，页面丢失；500，服务器出问题
print(resp.status_code)
# 打印访问的网址
print(resp.url)
# 打印响应头: 只需要记住‘Content-Type’
print(resp.headers)
# 打印响应头中提供的编码方式（默认为ISO-8859-1），不能解析中文
print(resp.encoding)
# 打印网页源代码提供的编码方式
print(resp.apparent_encoding)
# 进行乱码修正
resp.encoding = resp.apparent_encoding
# 文本流方式打印网页源码
print(resp.text)
# 以字节流（二进制）输出源码
# print(resp.content)

bs4用于爬虫

"""
example02 - bs4

Author: Asus
Date: 2021/8/17
"""
import requests
import bs4

# bs4: Beautiful Soup 4 ---> 可以从HTML或者XML中提取数据。
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = bs4.BeautifulSoup(html, 'lxml')
# print(soup, type(soup), sep='\n')
# prettify(): 格式化HTML源码
print(soup.prettify())

# 打印标签: 只打印第一个标签的内容
print(soup.head.title)

# 打印标签内容: 4种 ---> string、get_text()、text、contents
print(soup.head.title.string)
print(soup.head.title.get_text())
print(soup.head.title.text)
print(soup.head.title.contents)

# 选择标签内容方法:
#     select:使用（id、class、标签、属性、父子、后代、兄弟、相邻兄弟选择器）去选择标签 ---> list
#     select_one:使用（id、class、标签、属性、父子、后代、兄弟、相邻兄弟选择器）去选择标签 ---> select结果中的第一个元素
p_list = soup.select('body > p')
print(p_list)
p_list1 = soup.select('body > .title')
print(p_list1)

p = soup.select_one('body > p')
print(p)

代理ip

"""
example03 - 代理ip

Author: Asus
Date: 2021/8/17
"""
import requests

from check_proxies import check_ip

flag = True
while flag:
    URL = '自己注册的代理IP地址'
    ip_list = check_ip(URL)
    print(ip_list)
    for i in range(len(ip_list)):
        douban_url = 'https://movie.douban.com/top250'
        headers = {
            'User-Agent': 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
        }
        # 代理ip
        proxy = {
            'http': 'http://' + ip_list[i],
            'https': 'https://' + ip_list[i]
        }

        try:
            resp = requests.get(url=douban_url, headers=headers, proxies=proxy, timeout=1)
            if resp.status_code == 200:
                print(resp.text)
                flag = False
                break
        except:
            print('Error')

此代码为代理IP可用性检测模块，可准确筛选出尚未失效IP
注：
1.此代码只针对TXT数据格式接口。
2.尚未失效IP不一定为爬虫可用IP
3.使用时，请调用check_ip(url)，url为TXT数据格式接口地址

import requests
import telnetlib
import re
from concurrent.futures.thread import ThreadPoolExecutor


# 请求接口，匹配出代理IP，多线程检测
def check_ip(url):
    real_ip = []

    # 检测代理IP是否失效
    def telnet_ip(ip, port):
        try:
            telnetlib.Telnet(ip, port, timeout=1)
            real_ip.append(f'{ip}:{port}')
        except:
            pass

    resp = requests.get(url)
    ip_data = re.findall('(\d+\.\d+\.\d+\.\d+):(\d+)', resp.text)
    with ThreadPoolExecutor(max_workers=16) as pool:
        for ip, port in ip_data:
            pool.submit(telnet_ip, ip, port)
    return real_ip

酸奶加香蕉

关注

6
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
Python的应用-前端、爬虫

文章目录Python的应用前端爬虫利用正则表达式进行爬虫requests请求数据bs4用于爬虫代理ipPython的应用前端HTML: hyperText markup language，超文本标记语言HTML后缀名：.html / .htmHTML标签名：一般标签：由起始标签和结束标签组成，可以插入其他标签或其他内容，例如：<h1>yyds</h1>自闭和标签：由起始标签组成，在起始标签末尾加斜杠，在其中间不能加任何东西，如：<br />
复制链接

扫一扫