Python之数据采集与文档读取练习

梦里逆天

已于 2022-08-14 19:59:21 修改

阅读量408

点赞数

分类专栏： Python 文章标签： python urllib BeautifulSoup pdfminer3k

于 2021-07-09 22:59:14 首次发布

本文链接：https://blog.csdn.net/username666/article/details/118585123

版权

Python 专栏收录该内容

60 篇文章 25 订阅

订阅专栏

本文介绍了如何使用Python的urllib库进行URL请求，并结合BeautifulSoup解析HTML获取百度百科信息，包括模拟浏览器、POST请求、BeautifulSoup的安装与使用，以及数据存储到MySQL的过程。

摘要由CSDN通过智能技术生成

1. urllib和BeautifulSoup

1.1 urllib的基本用法

urllib是Python 3.x中提供的一系列操作URL的库，它可以轻松的模拟用户使用浏览器访问网页。

使用步骤：

导入urllib库的request模块：from urllib import request
请求URL，如：resp = request.urlopen(‘http://www.baidu.com’)
使用响应对象输出数据，如：print(resp.read().decode(“utf-8”))

示例：

from urllib import request

resp = request.urlopen("http://www.baidu.com")
print(resp.read().decode("utf-8"))

1.1.1 模拟真实浏览器

携带User-Agent头

# 使用Request(url)获取请求对象
req = request.Request(url)
# 使用add_header(key,value)方法添加请求头
req.add_header(key, value)
# 使用urlopen请求链接
resp = request.urlopen(req)
# 使用decode对结果进行编码
print(resp.read().decode("utf-8"))

from urllib import request
# 使用Request(url)获取请求对象
req = request.Request("http://www.baidu.com")
# 使用add_header(key,value)方法添加请求头
req.add_header("User Agent", "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Mobile Safari/537.36")
# 使用urlopen请求链接
resp = request.urlopen(req)
# 使用decode对结果进行编码
print(resp.read().decode("utf-8"))

1.1.2 使用POST

导入urllib库下面的parse：from urllib import parse
使用urlencode生成post数据

postData = parse.urlencode([
    (key1, val1),
    (key2, val2),
    (keyn, valn)
])

使用postData发送post请求：

request.urlopen(req, data=postData.encode('utf-8'))

得到请求状态：resp.status
得到服务器的类型：resp.reason

示例：

from urllib.request import urlopen
from urllib.request import Request
from urllib import parse

req = Request("https://m.xbiquge.la/register.php")
# 使用parser.urlencode()生成post数据
postData = parse.urlencode([
    ("SignupForm[username]", "admin"),
    ("SignupForm[password]", "123456"),
    ("SignupForm[email]", ""),
    ("register", "确认注册")
])
req.add_header("Origin", "https://m.xbiquge.la")
req.add_header("User Agent", "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Mobile Safari/537.36")
resp = urlopen(req, data=postData.encode('utf-8'))
print(resp.read().decode("utf-8"))

执行结果：

<!doctype html>
<html>
<head>
<title>出现错误！</title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name="MobileOptimized" content="240"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0,  minimum-scale=1.0, maximum-scale=1.0" />
<style>
    .content{margin:50px 7px 10px 7px;}
    .content a{color:#0080C0;}
</style>
</head>
<body>
<div class="content">
    <div class="c1">
        <h1>出现错误！</h1>
        <strong>错误原因：</strong>
        <ul>
<li>用户名已存在.<li/><li>email不能为空！<li/>        <br /><br /><br />
        请 <a href="javascript:history.back(1)">返 回</a> 并修正<br /><br />
</div>
</body>
</html>

1.2 BeautifulSoup

1.2.1 安装BeautifulSoup4

Linux

sudo apt-get install python-bs4

sudo easy_install pip
pip install beautifulsoup4

Windows

pip install beautifulsoup4
pip3 install beautifulsoup4

在这里插入图片描述

下载：https://www.crummy.com/software/BeautifulSoup/#Download
文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id4

1.2.2 BeautifulSoup使用

使用BeautifulSoup(html, ‘html.parser’)解析HTML
查找一个节点：soup.find(id=‘imooc’)
查找多个节点：soup.findAll(‘a’)
使用正则表达式匹配：soup.findAll(‘a’, href=reObj)

from bs4 import BeautifulSoup as bs


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = bs(html_doc, 'html.parser')
# 格式化输出
print(soup.prettify())

执行结果：

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

from bs4 import BeautifulSoup as bs
import re

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = bs(html_doc, 'html.parser')
# 格式化输出
#print(soup.prettify())

# 获取title
print(soup.title.string) # The Dormouse's story
# 获取第一个a标签
print(soup.a) # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 获取id为link2的标签
print(soup.find(id="link2")) # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# 获取id为link2的标签的内容
print(soup.find(id="link2").string) # Lacie
print(soup.find(id="link2").get_text()) # Lacie
# 获取所有a标签
print(soup.findAll("a"))
'''
[
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
]
'''
# 输出所有a标签的内容
for link in soup.findAll("a"):
    print(link.string)
'''
Elsie
Lacie
Tillie
'''
print(soup.find('p', {"class":"story"}))
print(soup.find('p', {"class":"story"}).get_text())

# 正则表达式查找
# 找出所有以b开头的标签
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

data = soup.findAll("a", href=re.compile(r"^http://example\.com/"))
print(data)

1.3 获取百度百科词条信息

# 引入开发包
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re


# 处理空行
def bs_preprocess(html):
    """remove distracting whitespaces and newline characters"""
    pat = re.compile('(^[\s]+)|([\s]+$)', re.MULTILINE)
    html = re.sub(pat, '', html)  # remove leading and trailing whitespaces
    html = re.sub('\n', ' ', html)  # convert newlines to spaces将换行符替换成空格
    # this preserves newline delimiters
    html = re.sub('[\s]+<', '<', html)  # remove whitespaces before opening tags
    html = re.sub('>[\s]+', '>', html)  # remove whitespaces after closing tags
    return html

# 请求URL并把结果用UTF-8编码
resp = urlopen("https://baike.baidu.com/").read().decode("utf-8")
resp = bs_preprocess(resp)
# 使用BeautifulSoup去解析
soup = BeautifulSoup(resp, "html.parser")
# 格式化输出
# print(soup.prettify())
# 查找所有标签 按层级查找
for tag in soup.find_all():
    # 是否包含em、img标签
    if tag.name in ["em", "img"]:
        # 包含则删除对应的标签
        tag.decompose()
    if tag.name in ['span']:
        # 父标签不是div
        if tag.parent.name != "div":
            tag.decompose()
# 删除类名以category_或content_cnt开头的div
for div in soup.findAll("div", {"class": re.compile("^(category_|content_cnt)")}):
    div.decompose()

# 获取所有以https://baike.baidu.com/item/开头的a标签的属性
listUrls = soup.findAll("a", href=re.compile("^https://baike.baidu.com/item/"))
#print(listUrls)
# 输出所有的词条对应的名称和URL
for url in listUrls:
    # 过滤以.jpg或JPG结尾的URL
    if url.get_text().strip() != "" and url['href'].strip() != "":
        # 输出URL的文字和对应的链接
        # string只能获取一个，get_text()获取标签下所有的文字
        print(url.get_text().strip(), "<--->", url['href'].strip())

执行结果：
在这里插入图片描述

1.4 存储数据到MySQL

1.4.1 安装与卸载

通过pip安装pymysql

pip install pymysql

在这里插入图片描述

通过安装文件

python setup.py install

卸载

pip uninstall pymysql

1.4.2 pymysql的使用

# 引入开发包
import pymysql.cursors

# 获取数据库链接
connection = pymysql.connect(host='localhost', 
                             user='root',
                             password='123456',
                             db='baikeurl',
                             charset='utf8mb4')

# 获取会话指针
connection.cursor()

# 执行SQL语句
cursor.execute(sql, (参数1, 参数n))

# 提交
connection.commit()

# 关闭
connection.close()

新建数据库：
在这里插入图片描述
建表sql：

SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;

-- ----------------------------
-- Table structure for urls
-- ----------------------------
DROP TABLE IF EXISTS `urls`;
CREATE TABLE `urls`  (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `urlname` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
  `urlhref` varchar(1000) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
  PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 1 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic;

SET FOREIGN_KEY_CHECKS = 1;

1.4.3 存储数据到MySQL

获取数据库连接：

connection = pymysql.connect(host='localhost',
                            user='root',
                            password='123456',
                            db='db',
                            charset='utf8mb4')

使用connection.cursor()获取会话指针
使用cursor.execute(sql, (参数1,参数n))执行sql
提交connection.commit()
关闭连接connection.close()
使用cursor.execute()获取查询出多少条记录
使用cursor.fetchone()获取下一行记录
使用cursor.fetchmany(size=10)获取指定数量的记录
使用cursor.fetchall()获取全部的记录

# 引入开发包
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import pymysql.cursors


# 处理空行
def bs_preprocess(html):
    """remove distracting whitespaces and newline characters"""
    pat = re.compile('(^[\s]+)|([\s]+$)', re.MULTILINE)
    html = re.sub(pat, '', html)  # remove leading and trailing whitespaces
    html = re.sub('\n', ' ', html)  # convert newlines to spaces将换行符替换成空格
    # this preserves newline delimiters
    html = re.sub('[\s]+<', '<', html)  # remove whitespaces before opening tags
    html = re.sub('>[\s]+', '>', html)  # remove whitespaces after closing tags
    return html

# 请求URL并把结果用UTF-8编码
resp = urlopen("https://baike.baidu.com/").read().decode("utf-8")
resp = bs_preprocess(resp)
# 使用BeautifulSoup去解析
soup = BeautifulSoup(resp, "html.parser")
# 格式化输出
# print(soup.prettify())
# 查找所有标签 按层级查找
for tag in soup.find_all():
    # 是否包含em、img标签
    if tag.name in ["em", "img"]:
        # 包含则删除对应的标签
        tag.decompose()
    if tag.name in ['span']:
        # 父标签不是div
        if tag.parent.name != "div":
            tag.decompose()
# 删除类名以category_或content_cnt开头的div
for div in soup.findAll("div", {"class": re.compile("^(category_|content_cnt)")}):
    div.decompose()

# 获取所有以https://baike.baidu.com/item/开头的a标签的属性
listUrls = soup.findAll("a", href=re.compile("^https://baike.baidu.com/item/"))
#print(listUrls)
# 输出所有的词条对应的名称和URL
for url in listUrls:
    # 过滤以.jpg或JPG结尾的URL
    if url.get_text().strip() != "" and url['href'].strip() != "":
        # 输出URL的文字和对应的链接
        # string只能获取一个，get_text()获取标签下所有的文字
        print(url.get_text().strip(), "<--->", url['href'].strip())

        # 获取数据库链接
        connection = pymysql.connect(host='localhost',
                                     user='root',
                                     password='123456',
                                     db='baikeurl',
                                     charset='utf8mb4')
        try:
            # 获取会话指针
            with connection.cursor() as cursor:
                # 创建sql语句
                sql = "insert into `urls` (`urlname`, `urlhref`) values (%s, %s)"
                # 执行sql语句
                cursor.execute(sql, (url.get_text(), url['href']))
                # 提交
                connection.commit()
        finally:
            connection.close()

效果：

在这里插入图片描述

1.4.4 读取（查询）MySQL数据

# 得到总记录数
cursor.execute()

# 查询下一行
cursor.fetchone()

# 得到指定大小
cursor.fetchmany(size=None)

# 得到全部
cursor.fetcchall()

# 关闭
connection.close()

# 导入开发包
import pymysql.cursors

# 获取链接
connection = pymysql.connect(host='localhost',
                             user='root',
                             password='123456',
                             db='baikeurl',
                             charset='utf8mb4')

try:
    # 获取会话指针
    with connection.cursor() as cursor:
        # 查询语句
        sql = "select `urlname`, `urlhref` from `urls` where `id` is not null"
        count = cursor.execute(sql)
        print(count)

        # 查询数据
        #result = cursor.fetchall()
        #print(result)

        result2 = cursor.fetchmany(size=3)
        print(result2)
finally:
    connection.close()

执行结果：
在这里插入图片描述

2. 常见文档读取（TXT，PDF）

2.1 python读取TXT文档

读取TXT文档：urlopen()
读取PDF文档：pdfminer3k

from urllib.request import urlopen

html = urlopen("https://www.csdn.net/robots.txt")
print(html.read().decode("utf-8"))

执行效果：

User-agent: * 
Disallow: /scripts 
Disallow: /public 
Disallow: /css/ 
Disallow: /images/ 
Disallow: /content/ 
Disallow: /ui/ 
Disallow: /js/ 
Disallow: /scripts/ 
Disallow: /article_preview.html* 
Disallow: /tag/
Disallow: /*?*
Disallow: /link/

Sitemap: https://www.csdn.net/sitemap-aggpage-index.xml
Sitemap: https://www.csdn.net/article/sitemap.txt

2.2 pdfminer3k安装

下载：https://pypi.org/project/pdfminer3k/

pip install pdfminer3k

在这里插入图片描述

python setup.py install

2.3 python读取PDF文档

from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice

'''
w：以写方式打开；
a：以追加模式打开（从EOF开始，必要时创建新文件）；
r+：以读写模式打开；
w+：以读写模式打开（参见w）；
a+：以读写模式打开（参见a）；
rb：以二进制读模式打开；
wb：以二进制追加模式打开（参见w）；
ab：以二进制追加模式打开（参见a）；
rb+：以二进制读写模式打开（参见r+）；
wb+：以二进制读写模式打开（参见w+）；
ab+：以二进制读写模式打开（参见a+）
'''
# 获取文档对象
# 以二进制读模式打开
fp = open("test.pdf", "rb")

# 创建一个与文档关联的解释器
parser = PDFParser(fp)

# PDF文档的对象
doc = PDFDocument()

# 链接解释器和文档对象
parser.set_document(doc)
doc.set_parser(parser)

# 初始化文档
doc.initialize("") # 密码为空

# 创建PDF资源管理器
resource = PDFResourceManager()

# 参数分析器
laparam = LAParams()

# 创建一个聚合器
device = PDFPageAggregator(resource, laparams=laparam)

# 创建PDF页面解释器
interpreter = PDFPageInterpreter(resource, device)

# 使用文档对象得到页面的集合
for page in doc.get_pages():
    # 使用页面解释器来读取
    interpreter.process_page(page)

    # 使用聚合器来获取内容
    layout = device.get_result()

    for out in layout:
        if hasattr(out, "get_text"):
            print(out.get_text())

test.pdf：
在这里插入图片描述
执行效果：

古之学者必有师。师者，所以传道受业解惑也。人非生而知之者，孰能无惑？
惑而不从师，其为惑也，终不解矣。生乎吾前，其闻道也固先乎吾，吾从而师之；
生乎吾后，其闻道也亦先乎吾，吾从而师之。吾师道也，夫庸知其年之先后生于吾
乎？是故无贵无贱，无长无少，道之所存，师之所存也。

月份
一月份
二月份
三月份

预期销售额

700
500
800

实际销售额

650
600
600

开始

主页

指南查询

输入检索关
键字

否

是

是否检索到相
关记录

是

结果分页显
示

是否继续

否

结束

参考文章地址：

梦里逆天

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
Python之数据采集与文档读取练习

urllib是Python 3.x中提供的一系列操作URL的库，它可以轻松的模拟用户使用浏览器访问网页。本章主要内容为urllib的基本使用及BeautifulSoup的安装及使用。
复制链接

扫一扫