python脚本 —— 抓取web信息

最新推荐文章于 2024-08-07 17:56:56 发布

shenmingik

最新推荐文章于 2024-08-07 17:56:56 发布

阅读量1k

点赞数 1

分类专栏： # python 文章标签： python 脚本语言 js html css

本文链接：https://blog.csdn.net/shenmingxueIT/article/details/117120746

版权

python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

文章目录

利用webbrowser模块
参考文献

利用webbrowser模块

打开指定的web url

通过指定url，webbrowser 模块的open()函数可以启动一个新的浏览器。

import webbrowser
webbrowser.open("https://mp.csdn.net/console/home?spm=1000.2115.3001.4503")

结果：
在这里插入图片描述
点击确定就可以打开网站了：

利用这个模块，我们可以做一个以下的脚本，利用这个脚本搜索博主以前写过的博客里面的知识点：

在命令行启动脚本，并且带上一个参数，这个参数指定了要搜索的内容
启动脚本，打开网页

为了实现这个脚本，要搞清楚以下几件事情：

搞懂url的组成

例子：
https://so.csdn.net/so/search?q=%E8%BF%9B%E7%A8%8B&t=blog&u=shenmingxueIT
这个url中，q=后面就是我们要搜索的内容，也就是我们要输入的参数；后面的u=后面就是相关博主的id

处理命令行参数

代码如下：

import sys
import webbrowser
# 如果有输出参数
if len(sys.argv) > 1:
    content = "".join(sys.argv[1])
webbrowser.open("https://so.csdn.net/so/search?q="+content+"&t=blog&u=shenmingxueIT")

结果：

ubuntu@VM-0-2-ubuntu:~/python_file/Python$ python3 web.py 进程

在这里插入图片描述

从web下载文件

request是模块可以让我们很容易从web下载文件。想要用此模块我们我们需要先运行以下命令安装request 模块：

ubuntu@VM-0-2-ubuntu:~/python_file/Python$ sudo apt install python3-pip

ubuntu@VM-0-2-ubuntu:~/python_file/Python$ pip3 install requests

利用requests.get()函数下载一个网页

requests.get()函数接受一个要下载的URL字符串。通过在requests.get()的返回值是一个Response对象，其中包含了web服务器对你的请求做出的响应。

import requests
# 该网址包含整部罗密欧
res = requests.get("http://www.gutenberg.org/cache/epub/1112/pg1112.txt")
if res.status_code == requests.codes.ok:
    print(len(res.text))
    # 打印出小说的前几个字符
    print(res.text[0:250])

如果请求成功，那么下载的页面就作为一个字符串，保存在Response对象的text变量中。
输出如下：

ubuntu@VM-0-2-ubuntu:~/python_file$ /usr/bin/python3 /home/ubuntu/python_file/Python/web.py
179380
The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare


*******************************************************************
THIS EBOOK WAS ONE OF PROJECT GUTENBERG'S EARLY FILES PRODUCED AT A
TIME WHEN PROOFING METHODS AND TOO

检查错误

Response对象有一个status_code属性，可以检查它是否等于requests.codes.ok，了解下载是否成功。

当然，检查成功有一种简单的方法，就是在Response对象上调用raise_for_status()方法。如果下载文件出错，这将抛出异常。

import requests
# 一个错误的url
res = requests.get("http://inventwithpython.com/123456")
try:
    res.raise_for_status()
except Exception as exc:
    print("There was a problem: %s" %(exc))

输出：

ubuntu@VM-0-2-ubuntu:~/python_file$ /usr/bin/python3 /home/ubuntu/python_file/Python/web.py
There was a problem: 404 Client Error: Not Found for url: http://inventwithpython.com/123456

将下载的文件保存到磁盘

既然获取到了网页内容，我们就可以使用open()或者write()方法，将Web页面保存到硬盘的一个文件。为了将web页面写入到一个文件，可以使用for循环和Response对象的iter_content()方法

注：
为了保存文件中的unicode编码以及其他信息，对于任何文件，我们都应该使用“wb”的方式

import requests
res = requests.get("http://www.gutenberg.org/cache/epub/1112/pg1112.txt")
res.raise_for_status()
down_file = open("Romo&Juliet.txt","wb")
# iter_content()方法在循环的每次迭代中，返回一段内容。每一段都是bytes数据类型，我们需要指定一段包含多少字节
for chunk in res.iter_content(50000):
    down_file.write(chunk)

down_file.close()

结果：
在这里插入图片描述

爬取网页信息

同样，resquests模块同样可以抓取HTML页面信息，利用BeautifulSoup模块可以快速从HTML页面中提取信息，当前，使用之前需要安装一下。

ubuntu@VM-0-2-ubuntu:~/python_file$ pip3 install beautifulsoup4

之后我们就可以使用以下代码导入BeautifulSoup4：

import bs4

之后，我们可以针对要寻找的元素，调用select()方法，传入一个字符串作为CSS选择器，这个样就可以取得web元素。
至于这个字符串怎么写，在下图可以看到：
在这里插入图片描述
当然，也可以组合起来，比如说：soup.select("p#author") 就代表找到id为author 并且其也在一个p标签之内。

实现自己的项目：自动翻译脚本

那么，既然我们可以爬取网页信息，我们就根据有道在线翻译来做一个自己的翻译脚本：

import requests
import bs4
import sys
import pyperclip

if len(sys.argv) > 1:
    i = 2
    content = "".join(sys.argv[1])
    while i < len(sys.argv):
        content = content + " "+"".join(sys.argv[i])
        i = i+1
# 从剪切板获取参数
else:
    content = pyperclip.paste()
url = "https://dict.youdao.com/w/eng/"+content+"/#keyfrom=dict2.index"
res = requests.get(url)
res.raise_for_status()

soup = bs4.BeautifulSoup(res.text,features="html.parser")
# 处理div模块里面 class 为trans-container
elems = soup.select("div.trans-container > ul")

# 根据有道的编写规则，elems[0]里面存储的就是单词释义的相关信息
words = list(elems[0])
# 处理字符串
for word in words:
    word = str(word)
    if word.find("<li>") != -1:
        begin = word.find("<li>")
        end = word.find("</li>")
        print(word[begin+4:end])

输出：

ubuntu@VM-0-2-ubuntu:~/python_file/Python$ python3 web.py web
n. 网；卷筒纸；蹼；织物；圈套
vt. 用网缠住；使中圈套
vi. 形成网

参考文献

[1]AI Sweigart.Python编程快速上手——让繁琐工作自动化.人民邮电出版社.2016.07

shenmingik

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录