5.1.json和csv

最新推荐文章于 2024-07-09 19:02:21 发布

sty3318

最新推荐文章于 2024-07-09 19:02:21 发布

阅读量1.8k

点赞数 44

分类专栏： python学习文章标签： json python 学习

本文链接：https://blog.csdn.net/sty3318/article/details/136261094

版权

python学习专栏收录该内容

15 篇文章 0 订阅

订阅专栏

5.1.1.json

5.1.1.1.首先获取html

import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
from bs4 import BeautifulSoup

url = 'https://www.baidu.com'

# 禁用证书验证
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

# 发送 GET 请求
response = requests.get(url, verify=False)

# 使用 BeautifulSoup 解析 HTML 内容
soup = BeautifulSoup(response.content, 'html.parser')

# 使用 prettify() 方法美化输出
print(soup.prettify())

5.1.1.2.分析html内容结构

<!DOCTYPE html>
<!--STATUS OK-->
<html>
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="always" name="referrer"/>
  <link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>
  <title>
   百度一下，你就知道
  </title>
 </head>
 <body link="#0000cc">
  <div id="wrapper">
   <div id="head">
    <div class="head_wrapper">
     <div class="s_form">
      <div class="s_form_wrapper">
       <div id="lg">
        <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/>
       </div>
       <form action="//www.baidu.com/s" class="fm" id="form" name="f">
        <input name="bdorz_come" type="hidden" value="1"/>
        <input name="ie" type="hidden" value="utf-8"/>
        <input name="f" type="hidden" value="8"/>
        <input name="rsv_bp" type="hidden" value="1"/>
        <input name="rsv_idx" type="hidden" value="1"/>
        <input name="tn" type="hidden" value="baidu"/>
        <span class="bg s_ipt_wr">
         <input autocomplete="off" autofocus="autofocus" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/>
        </span>
        <span class="bg s_btn_wr">
         <input autofocus="" class="bg s_btn" id="su" type="submit" value="百度一下"/>
        </span>
       </form>
      </div>
     </div>
     <div id="u1">
      <a class="mnav" href="http://news.baidu.com" name="tj_trnews">
       新闻
      </a>
      <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">
       hao123
      </a>
      <a class="mnav" href="http://map.baidu.com" name="tj_trmap">
       地图
      </a>
      <a class="mnav" href="http://v.baidu.com" name="tj_trvideo">
       视频
      </a>
      <a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">
       贴吧
      </a>
      <noscript>
       <a class="lb" href="http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login">
        登录
       </a>
      </noscript>
      <script>
       document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');
      </script>
      <a class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">
       更多产品
      </a>
     </div>
    </div>
   </div>
   <div id="ftCon">
    <div id="ftConw">
     <p id="lh">
      <a href="http://home.baidu.com">
       关于百度
      </a>
      <a href="http://ir.baidu.com">
       About Baidu
      </a>
     </p>
     <p id="cp">
      ©2017 Baidu
      <a href="http://www.baidu.com/duty/">
       使用百度前必读
      </a>
      <a class="cp-feedback" href="http://jianyi.baidu.com/">
       意见反馈
      </a>
      京ICP证030173号
      <img src="//www.baidu.com/img/gs.gif"/>
     </p>
    </div>
   </div>
  </div>
 </body>
</html>

分析后，我们应该获取，内容的种别，如“新闻”等

代码：

from bs4 import BeautifulSoup

# 假设给定的 HTML 内容存储在变量 html 中
html = """
<!DOCTYPE html>
<!--STATUS OK-->
<html>
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="always" name="referrer"/>
  <link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>
  <title>
   百度一下，你就知道
  </title>
 </head>
 <body link="#0000cc">
  <div id="wrapper">
   <div id="head">
    <div class="head_wrapper">
     <div class="s_form">
      <div class="s_form_wrapper">
       <div id="lg">
        <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/>
       </div>
       <form action="//www.baidu.com/s" class="fm" id="form" name="f">
        <input name="bdorz_come" type="hidden" value="1"/>
        <input name="ie" type="hidden" value="utf-8"/>
        <input name="f" type="hidden" value="8"/>
        <input name="rsv_bp" type="hidden" value="1"/>
        <input name="rsv_idx" type="hidden" value="1"/>
        <input name="tn" type="hidden" value="baidu"/>
        <span class="bg s_ipt_wr">
         <input autocomplete="off" autofocus="autofocus" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/>
        </span>
        <span class="bg s_btn_wr">
         <input autofocus="" class="bg s_btn" id="su" type="submit" value="百度一下"/>
        </span>
       </form>
      </div>
     </div>
     <div id="u1">
      <a class="mnav" href="http://news.baidu.com" name="tj_trnews">
       新闻
      </a>
      <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">
       hao123
      </a>
      <a class="mnav" href="http://map.baidu.com" name="tj_trmap">
       地图
      </a>
      <a class="mnav" href="http://v.baidu.com" name="tj_trvideo">
       视频
      </a>
      <a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">
       贴吧
      </a>
      <noscript>
       <a class="lb" href="http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login">
        登录
       </a>
      </noscript>
      <script>
       document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');
      </script>
      <a class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">
       更多产品
      </a>
     </div>
    </div>
   </div>
   <div id="ftCon">
    <div id="ftConw">
     <p id="lh">
      <a href="http://home.baidu.com">
       关于百度
      </a>
      <a href="http://ir.baidu.com">
       About Baidu
      </a>
     </p>
     <p id="cp">
      ©2017 Baidu
      <a href="http://www.baidu.com/duty/">
       使用百度前必读
      </a>
      <a class="cp-feedback" href="http://jianyi.baidu.com/">
       意见反馈
      </a>
      京ICP证030173号
      <img src="//www.baidu.com/img/gs.gif"/>
     </p>
    </div>
   </div>
  </div>
 </body>
</html>
"""

# 使用 BeautifulSoup 解析 HTML 内容
soup = BeautifulSoup(html, 'html.parser')

# 获取 class 为 "mnav" 的 <a> 标签
a_tags = soup.find_all('a', {'class': 'mnav'})

# 遍历 <a> 标签列表并输出其中的文本内容
for a in a_tags:
    print(a.text.strip())
    print(a['href'])

5.1.1.3.json的操作

在Python中，可以使用内置的 json 模块来进行 JSON 数据的编码（序列化）和解码（反序列化）操作。下面我将详细解释一下编码和解码的操作：

5.1.1.3.1.编码（序列化）

JSON 编码是将 Python 对象转换为 JSON 格式的过程。通过 json.dumps() 方法可以将 Python 对象（如字典、列表等）编码为 JSON 字符串。

import json

# Python对象
data = {'name': 'Alice', 'age': 30, 'city': 'New York'}

# 将Python对象编码为JSON字符串
json_str = json.dumps(data)
print(json_str)

在 Python 的 json 模块中，除了 dumps() 方法用于将 Python 对象编码为 JSON 字符串之外，还有 dump() 方法可将 Python 对象编码为 JSON 并直接写入文件。下面我将详细解释这两个方法的区别：

5.1.1.3.1.1.dumps

dumps() 方法用于将 Python 对象编码为 JSON 字符串。它接受一个 Python 对象作为参数，并返回一个表示该对象的 JSON 字符串。

import json

# Python对象
data = {'name': 'Alice', 'age': 30, 'city': 'New York'}

# 将Python对象编码为JSON字符串
json_str = json.dumps(data, indent=4)
print(json_str)

这个例子跟上一个例子几乎相同。

当使用 json.dumps() 方法编码 Python 对象为 JSON 字符串时，可以通过传递不同的参数来控制编码的行为。下面我将详细解释这些参数的作用：

skipkeys

当 skipkeys 被设置为 True 时，在编码过程中如果遇到非字符串类型的键，则会跳过而不会引发 TypeError 错误。
默认情况下，skipkeys 的值为 False。

ensure_ascii

当 ensure_ascii 被设置为 True 时，所有非 ASCII 字符会被转义为 ASCII 形式。
当 ensure_ascii 被设置为 False 时，所有字符都会按原样输出，即使它们是非 ASCII 字符。
默认情况下，ensure_ascii 的值为 True。

indent

indent 用于指定在编码过程中添加缩进的空格数，使得生成的 JSON 字符串更易读。
可以指定为整数，表示每一级别的缩进空格数，或者字符串，表示用于缩进的字符串。
默认情况下，indent 的值为 None，表示不进行缩进。

separators

separators 是一个包含两个元素的元组，用于指定在生成的 JSON 字符串中分隔项（items）和键值对（key-value pairs）的分隔符。
默认情况下，separators 的值为 (',', ':')，表示用逗号分隔项，用冒号分隔键值对。

encoding

encoding 用于指定在编码过程中使用的字符编码。
默认情况下，encoding 的值为 'utf-8'。

sort_keys

当 sort_keys 被设置为 True 时，生成的 JSON 字符串将按键排序。
默认情况下，sort_keys 的值为 False。

以上这些参数可以根据您的需求进行调整，以便在编码 JSON 字符串时获得期望的格式和行为。

5.1.1.3.1.2.dump

dump() 方法与 dumps() 方法类似，但是它不仅将 Python 对象编码为 JSON 字符串，还可以将其直接写入文件。它接收两个参数：Python 对象和文件对象（或文件名），并将编码后的 JSON 数据写入文件

import json

# Python对象
data = {'name': 'Alice', 'age': 30, 'city': 'New York'}

# 将Python对象编码为JSON并写入文件
with open(r'xxx\xxx\data.json', 'w') as file:
    json.dump(data, file)

5.1.1.3.2.json解码（反序列化）

JSON 解码是将 JSON 格式的数据转换为 Python 对象的过程。通过 json.loads() 方法可以将 JSON 字符串解码为 Python 对象（如字典、列表等）。

import json

# JSON字符串
json_str = '{"name": "Alice", "age": 30, "city": "New York"}'

# 将JSON字符串解码为Python对象
data = json.loads(json_str)
print(data)

※※※特别注意

在进行 JSON 编码时，Python 中的一些数据类型需要特别注意，例如：datetime、Decimal 等，可能需要额外处理才能正确序列化为 JSON 格式。
在进行 JSON 解码时，需要确保 JSON 字符串的格式是有效的，否则会抛出异常。

例，将前面的例子进行修改，出力到json文件中

import json
from bs4 import BeautifulSoup

# 假设给定的 HTML 内容存储在变量 html 中
html = """
<!DOCTYPE html>
<!--STATUS OK-->
<html>
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="always" name="referrer"/>
  <link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>
  <title>
   百度一下，你就知道
  </title>
 </head>
 <body link="#0000cc">
  <div id="wrapper">
   <div id="head">
    <div class="head_wrapper">
     <div class="s_form">
      <div class="s_form_wrapper">
       <div id="lg">
        <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/>
       </div>
       <form action="//www.baidu.com/s" class="fm" id="form" name="f">
        <input name="bdorz_come" type="hidden" value="1"/>
        <input name="ie" type="hidden" value="utf-8"/>
        <input name="f" type="hidden" value="8"/>
        <input name="rsv_bp" type="hidden" value="1"/>
        <input name="rsv_idx" type="hidden" value="1"/>
        <input name="tn" type="hidden" value="baidu"/>
        <span class="bg s_ipt_wr">
         <input autocomplete="off" autofocus="autofocus" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/>
        </span>
        <span class="bg s_btn_wr">
         <input autofocus="" class="bg s_btn" id="su" type="submit" value="百度一下"/>
        </span>
       </form>
      </div>
     </div>
     <div id="u1">
      <a class="mnav" href="http://news.baidu.com" name="tj_trnews">
       新闻
      </a>
      <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">
       hao123
      </a>
      <a class="mnav" href="http://map.baidu.com" name="tj_trmap">
       地图
      </a>
      <a class="mnav" href="http://v.baidu.com" name="tj_trvideo">
       视频
      </a>
      <a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">
       贴吧
      </a>
      <noscript>
       <a class="lb" href="http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login">
        登录
       </a>
      </noscript>
      <script>
       document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');
      </script>
      <a class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">
       更多产品
      </a>
     </div>
    </div>
   </div>
   <div id="ftCon">
    <div id="ftConw">
     <p id="lh">
      <a href="http://home.baidu.com">
       关于百度
      </a>
      <a href="http://ir.baidu.com">
       About Baidu
      </a>
     </p>
     <p id="cp">
      ©2017 Baidu
      <a href="http://www.baidu.com/duty/">
       使用百度前必读
      </a>
      <a class="cp-feedback" href="http://jianyi.baidu.com/">
       意见反馈
      </a>
      京ICP证030173号
      <img src="//www.baidu.com/img/gs.gif"/>
     </p>
    </div>
   </div>
  </div>
 </body>
</html>
"""

# 使用 BeautifulSoup 解析 HTML 内容
soup = BeautifulSoup(html, 'html.parser')

# 获取 class 为 "mnav" 的 <a> 标签
a_tags = soup.find_all('a', {'class': 'mnav'})

contentList = []
# 遍历 <a> 标签列表并输出其中的文本内容
for a in a_tags:
    contentList.append({a.text.strip():a['href']})

with open(r'D:\sty\13-workspace\01-py\test\bd.json', 'w') as f:
    json.dump(contentList, f, indent = 4, ensure_ascii = False)

5.1.2.csv

5.1.2.1.将list写入到csv中

CSV（Comma-Separated Values）是一种常用的文件格式，用于存储表格数据。它以纯文本形式存储数据，数据之间使用逗号进行分隔。下面我将介绍一些CSV文件的基本特点和常见的处理方式。

特点：

纯文本格式：CSV文件是以纯文本形式存储的，可以使用任何文本编辑器进行查看和编辑。
逗号分隔：每个字段的值之间使用逗号进行分隔，逗号可以是任意字符，但通常是英文逗号。
行分隔：每一行表示一个数据记录，行与行之间使用换行符进行分隔。

处理方式：

读取CSV文件：可以使用Python的内置csv模块或者第三方库（如pandas）来读取CSV文件。读取后的数据通常会被转换成列表、字典或类似的数据结构进行处理。
写入CSV文件：同样可以使用csv模块或第三方库来写入CSV文件。需要将数据转换成适当的格式（如列表、字典），并指定分隔符和换行符等参数。
数据处理：一旦数据被读取到内存中，可以对其进行各种操作，如过滤、排序、统计等。这通常需要使用其他库（如pandas）进行更高级的数据处理和分析。

例：

import csv

# 读取CSV文件
with open(r'D:\sty\13-workspace\01-py\test\data.csv', 'r') as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)

# 写入CSV文件
data = [
    ['Name', 'Age', 'City'],
    ['Alice', '25', 'New York'],
    ['Bob', '30', 'London'],
    ['Charlie', '35', 'Paris']
]

with open(r'D:\sty\13-workspace\01-py\test\output.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(data)

以上代码中，data.csv是要读取的CSV文件，output.csv是要写入的CSV文件。csv.reader用于读取CSV文件并返回一个迭代器，每次迭代返回一行数据（作为列表），可以通过遍历来逐行处理数据。csv.writer用于写入CSV文件，接受一个可迭代对象（如列表）并将其写入文件。

5.1.2.2.将dict写入到csv中

csv.DictWriter 是 Python 标准库中用于将字典数据写入 CSV 文件的类。与 csv.writer 类似，csv.DictWriter 提供了一种更方便的方式来处理包含键值对数据的情况。下面是关于 csv.DictWriter 的一些基本讲解：

例：

import csv

fieldnames = ['Name', 'Age', 'City']

with open(r'xxx\xxx\output.csv', 'w', newline='') as file:
    writer = csv.DictWriter(file, fieldnames)

    # 写入表头
    writer.writeheader()

    # 写入数据
    writer.writerow({'Name': 'Alice', 'Age': 25, 'City': 'New York'})
    writer.writerow({'Name': 'Bob', 'Age': 30, 'City': 'London'})

使用 writeheader() 方法写入表头。
使用 writerow() 方法逐行写入字典数据，字典的键对应字段名，值对应要写入的数据。

如果某个字典中缺少某些字段，DictWriter 会自动用 None 或指定的默认值填充这些字段，确保每行都包含完整的数据。(csv中，会显示该字段内容为空)

注意事项：

字典的键必须与 fieldnames 中的字段名相匹配，否则会引发 ValueError 异常。
在创建 DictWriter 对象时，需要指定文件对象和字段名列表。

5.1.2.3.namedtuple

from collections import namedtuple
import csv

with open(r'xxx\xxx\output.csv', 'r') as f:
    content = csv.reader(f)
    # 取得表头
    header = next(content)
    print('heading=',header)
    # 创建namedtuple对象
    Row = namedtuple('Row', header)
    for row in content:
        r = Row(*row)
        print(r.Name, r.Age)
        print(r)

其中【r = Row(*row)】，

让我们逐步解释这句代码的含义：

row 是一个包含了 CSV 行数据的列表。每个元素都对应着 CSV 行中的一个字段值。
*row 使用了星号操作符（*），它将列表 row 拆分成独立的元素，相当于将每个元素作为单独的参数传递给函数或构造函数。
Row(*row) 调用了 Row 的构造函数，并使用拆分后的参数创建一个新的 Row 实例。拆分后的参数将按照 Row 类型定义的字段顺序进行匹配。

例如，如果 Row 类型定义的字段顺序为 ('Name', 'Age', 'Gender')，而 row 列表中的元素顺序为 ['John', '25', 'Male']，那么拆分后的参数将按照 'John', '25', 'Male' 的顺序进行匹配，生成一个包含了这些字段值的 Row 实例。
最后，将新创建的 Row 实例赋值给变量 r，以便后续访问和操作该实例的字段值。

也可以将csv读入到dict中去

import csv

with open(r'xxx\xxx\output.csv', 'r') as f:
    results = csv.DictReader(f)
    for row in results:
        print(row['City'])

sty3318

关注

44
点赞
踩
44

收藏

觉得还不错? 一键收藏
0
评论
5.1.json和csv

它接受一个 Python 对象作为参数，并返回一个表示该对象的 JSON 字符串。方法编码 Python 对象为 JSON 字符串时，可以通过传递不同的参数来控制编码的行为。用于读取CSV文件并返回一个迭代器，每次迭代返回一行数据（作为列表），可以通过遍历来逐行处理数据。它接收两个参数：Python 对象和文件对象（或文件名），并将编码后的 JSON 数据写入文件。以上这些参数可以根据您的需求进行调整，以便在编码 JSON 字符串时获得期望的格式和行为。的构造函数，并使用拆分后的参数创建一个新的。
复制链接

扫一扫