Web Scraping 爬虫

最新推荐文章于 2024-09-11 10:38:53 发布

voicialex

最新推荐文章于 2024-09-11 10:38:53 发布

阅读量447

点赞数

分类专栏：网页爬虫 Python 文章标签：网页爬虫

本文链接：https://blog.csdn.net/weixin_44280688/article/details/98474330

版权

Python 同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

网页爬虫

1 篇文章 0 订阅

订阅专栏

key word : requests, urllib, beautifulsoup, scrapy

网页结构

HTML，CSS，JavaScript

HTML中，基本上所有实体内容都会有个tag来框住，被框住的内容被展现成不同内容和形式。

主体tag有head和body：

head中存放一些网页的源信息，比如title，这些信息不回被显示到网页中，大多时候是给搜索引擎的爬虫看

<head>
	<meta charset="UTF-8">
	<title>Scraping tutorial 1 | 莫烦Python</title>
	<link rel="icon" href="https://morvanzhou.github.io/static/img/description/tab_icon.png">
</head>

body中存放网页信息

<body>
    <h1>爬虫测试1</h1>
    <p>
        这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
        <a href="https://morvanzhou.github.io/tutorials/scraping">爬虫教程</a> 中的简单测试.
    </p>
</body>

<h1> </h1> tag 存放标题

<p> </p> 存放的就是段落

<a> </a> 存放的就是连接

爬虫

from urllib.request import urlopen

# if has Chinese, apply decode()
html = urlopen(
    "https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
print(html)

正则表达式匹配

import re
res = re.findall(r"<title>(.+?)</title>", html)
print("\nPage title is: ", res[0])

res = re.findall(r"<p>(.*?)</p>", html, flags=re.DOTALL)    # re.DOTALL if multi line
print("\nPage paragraph is: ", res[0])

BeautifulSoup

流程：

1. 选择 url

2. urlopen登陆网页

3. read() 读取网页星系, 有中文时需要decode('utf-8')

4. 将读取的信息放入 BeautifulSoup里

5. 使用BeautifulSoup选取tag信息（代替正则表达式）

voicialex

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录