学习使我快乐第十四天

最新推荐文章于 2024-09-20 00:07:10 发布

Mr_suyi

最新推荐文章于 2024-09-20 00:07:10 发布

阅读量62

点赞数

文章标签：学习前端 javascript

本文链接：https://blog.csdn.net/mr_suyi/article/details/125725133

版权

Day 014

一、利用所学HTML实现某度页面

<!DOCTYPE html>
<html>
	<head>
		<meta charset="utf-8">
		<title>某度一下，你就知道</title>
		<style>
			* {
				margin: 0;
				padding: 0;
			}
			
			.top {
				margin-top: 10px;
				width: 100%;
				height: 30px;
			}
			
			.top_left {
				float: left;
			}
			
			.top_right {
				float: right;
			}
			
			a {
				/* 修改文字颜色 */
				color: black;
				/* 去掉超链接下划线 */
				text-decoration: none;
				font-size: 5px;
				margin-left: 20px;
			}
			
			.top_right>a,
			.top_right>input {
				margin-right: 10px;
			}
			
			/* 修改登录按钮 */
			.top_right>input {
				width: 44px;
				height: 20px;
				background-color: rgba(61, 83, 239, 0.9);
				color: white;
				border: 0px;
				border-radius: 6px;
				font-size: 1px;
			}
			
			
			
			.center {
				width: 100%;
				text-align: center;
			}
			
			.center>div:nth-child(2)>.input_text {
				border: 2px solid lightgray;
				border-right: 0px;
				border-radius: 10px 0 0 10px;
				width: 450px;
				height: 30px;
			
			}
			
			.center>div:nth-child(2)>.button {
				border: 0px;
				border-radius: 0 10px 10px 0;
				background-color: rgba(61, 83, 239, 0.9);
				color: white;
				width: 80px;
				height: 34px;
				margin-left: -6px;
			}
			
			
			
			table {
				width: 530px;
				text-align: left;
				margin-top: 30px;
				margin-left: 28%;
				font-size: 5px;
			}
			
			td {
				height: 30px;
			}
			
			
			
			
			.bottom {
				text-align: center;
				font-size: 1%;
				position: fixed;
				bottom: 2px;
			}
			
		</style>
	</head>
	<body>
		<!-- top -->
		<div class="top">
			<!-- top-left -->
			<div class="top_left">
				<a href="">新闻</a>
				<a href="">hao123</a>
				<a href="">地图</a>
				<a href="">贴吧</a>
				<a href="">视频</a>
				<a href="">图片</a>
				<a href="">网盘</a>
				<a href="">更多</a>
			</div>
			<!-- top-right -->
			<div class="top_right">
				<a href="" class="set">设置</a>
				<input type="submit" value="登录" class="login">
			</div>
		</div>


		<!-- center -->
		<div class="center">
			<!-- logo -->
			<div>
				<img src="https://www.baidu.com/img/PCtm_d9c8750bed0b3c7d089fa7d55720d6cf.png" alt="" width="200px"
					height="100px">
			</div>

			<!-- input -->
			<div>
				<input type="text" class="input_text">
				<input type="submit" value="百度一下" class="button">
			</div>

			<!-- table -->
			<div>
				<table>
					<tr>
						<td><b>百度热搜</b></td>
						<td style="text-align:right;">换一换</td>
					</tr>

					<tr>
						<td>
							<span>0.</span>
							<span>多措并举助企稳就业</span>
						</td>
						<td>
							<span>3.</span>
							<span>山东舰霸气壁纸</span>
						</td>
					</tr>

					<tr>
						<td>
							<span>1.</span>
							<span>多措并举助企稳就业</span>
						</td>
						<td>
							<span>4.</span>
							<span>多措并举助企稳就业</span>
						</td>
					</tr>

					<tr>
						<td>
							<span>2.</span>
							<span>多措并举助企稳就业</span>
						</td>
						<td>
							<span>5.</span>
							<span>多措并举助企稳就业</span>
						</td>
					</tr>
				</table>
			</div>

		</div>


		<!-- bottom -->
		<div class="bottom">
			<span>关于百度</span>
			<span>About Baidu</span>
			<span>京公网安备11000002000001号</span>
			<span>京ICP证030173号</span>
			<span>药品医疗器械网络信息服务备案（京）网药械信息备字（2021）第00159号</span>
			<span>医疗器械网络交易服务第三方平台备案凭证（京）网械平台备字（2020）第00002号</span>
			<span></span>
		</div>
	</body>
</html>

效果图：

二、利用所学爬取新闻

1. BeautifulSoup4 解析页面

爬虫流程

requests - 请求页面，得到响应结果
BeautifulSoup4 - 根据响应结果解析页面、提取数据
写入文件、数据库

import requests
from bs4 import BeautifulSoup

# bs4模块能够从html或者xml中提取数据。
for page in range(1,11):
    print(f'这是第{page}页')
    URL = f'https://www.chinanews.com.cn/scroll-news/news{page}.html'
    # URL = 'https://101.qq.com/#/hero'

    # headers = {}  --->  headers是一个字典：{key:vlaue}
    # headers是给爬虫提供伪装的
    # User-Agent ----> 将爬虫伪装成浏览器

    Headers ={
        'User-Agent ':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)'
                      ' Chrome/92.0.4515.131 Safari/537.36 SLBrowser/8.0.0.7062 SLBChan/123'
    }
        # 提取访问源 让网站无法确认我为爬虫

    response = requests.get(url=URL,headers=Headers)
    # 如果状态码=200，爬虫可用
    if response.status_code == 200:
        response.encoding = 'utf-8'
        # response.encoding = 'gbk'
        # 打印网页源代码（字符串）
        # print(response.text)
        # 为什么要对比打印结果和网页中的内容是否一致？
        """
        网页：分为静态页面和动态页面
        静态页面：内容的写死的，除非人为的进行内容修改，否则这个页面的内容是一成不变的。
        动态页面：内容不是写死的，使用某种特殊的技术（JavaScript）使数据通过某种方式显示在页面中。
    
        requests得到的结果是静态页面的结果。
        """
        # BeautifulSoup(网页源码,解析器) --> 将字符串类型的源代码转换为bs4类型
        # bs模块提供了一系列提取数据的方法，这些方法的操作对象的bs4类型的数据。
        soup = BeautifulSoup(response.text, 'html.parser')
        # print(soup, type(soup))
        # select：根据css选择器（标签、class、id等）定位数据，得到的是符合选择器的全部结果（整体是列表，列表中每一个元素是bs4类型）
        # select_one：根据css选择器（标签、class、id等）定位数据，得到的是符合选择器的一个结果（得到的结果是bs4类型）
        # text：从bs4类型数据中提取标签内的内容，结果为字符串。
        # attris：从bs4类型元素中提取标签内容的属性值，结果为字符串。
        li_list = soup.select('body > div.w1280.mt20 > div.content-left > div.content_list > ul > li')
        # print(li_list)
        for i in li_list:
            # print(i)
            if  i.select_one('li > div.dd_lm > a') != None:
                news_type = i.select_one('li > div.dd_lm > a').text
                # print(news_type,type(news_type))
                news_title = i.select_one('li > div.dd_bt >a').text
                # print(news_title)
                news_href = 'https://www.chinanews.com.cn'+i.select_one('li > div.dd_bt >a').attrs['href']
                # print(news_href)
                news_time = i.select_one('li > div.dd_time').text
                print(news_type, news_title, news_href, news_time)