HTML
1 读取 HTML 内容
顶级 read_html() 函数可以接受 HTML 字符串、文件或URL,并将 HTML 表解析为 pandas DataFrames 列表。
注意:即使 HTML 内容中仅包含一个表,read_html 也会返回 DataFrame 对象的列表
让我们看几个例子
In [295]: url = (
.....: "https://raw.githubusercontent.com/pandas-dev/pandas/master/"
.....: "pandas/tests/io/data/html/spam.html"
.....: )
.....:
In [296]: dfs = pd.read_html(url)
In [297]: dfs
Out[297]:
[ Nutrient Unit Value per 100.0g oz 1 NLEA serving 56g Unnamed: 4 Unnamed: 5
0 Proximates Proximates Proximates Proximates Proximates Proximates
1 Water g 51.70 28.95 NaN NaN
2 Energy kcal 315 176 NaN NaN
3 Protein g 13.40 7.50 NaN NaN
4 Total lipid (fat) g 26.60 14.90 NaN NaN
.. ... ... ... ... ... ...
32 Fatty acids, total monounsaturated g 13.505 7.563 NaN NaN
33 Fatty acids, total polyunsaturated g 2.019 1.131 NaN NaN
34 Cholesterol mg 71 40 NaN NaN
35 Other Other Other Other Other Other
36 Caffeine mg 0 0 NaN NaN
[37 rows x 6 columns]]
读入 banklist.html 文件的内容,并将其作为字符串传递给 read_html
In [298]: with open(file_path, "r") as f:
.....: dfs = pd.read_html(f.read())
.....:
In [299]: dfs
Out[299]:
[ Bank Name City ... Closing Date Updated Date
0 Banks of Wisconsin d/b/a Bank of Kenosha Kenosha ... May 31, 2013 May 31, 2013
1 Central Arizona Bank Scottsdale ... May 14, 2013 May 20, 2013
2 Sunrise Bank Valdosta ... May 10, 2013 May 21, 2013
3 Pisgah Community Bank Asheville ... May 10, 2013 May 14, 2013
4 Douglas County Bank Douglasville ... April 26, 2013 May 16, 2013
.. ... ... ... ... ...
500 Superior Bank, FSB Hinsdale ... July 27, 2001 June 5, 2012
501 Malta National Bank Malta ... May 3, 2001 November 18, 2002
502 First Alliance Bank & Trust Co. Manchester ... February 2, 2001 February 18, 2003
503 National State Bank of Metropolis Metropolis ... December 14, 2000 March 17, 2005
504 Bank of Honolulu Honolulu ... October 13, 2000 March 17, 2005
[505 rows x 7 columns]]
如果愿意,您甚至可以传入 StringIO 的实例
In [300]: with open(file_path, "r") as f:
.....: sio = StringIO(f.read())
.....:
In [301]: dfs = pd.read_html(sio)
In [302]: dfs
Out[302]:
[ Bank Name City ... Closing Date Updated Date
0 Banks of Wisconsin d/b/a Bank of Kenosha Kenosha ... May 31, 2013 May 31, 2013
1 Central Arizona Bank Scottsdale ... May 14, 2013 May 20, 2013
2 Sunrise Bank Valdosta ... May 10, 2013 May 21, 2013
3 Pisgah Community Bank Asheville ... May 10, 2013 May 14, 2013
4 Douglas County Bank Douglasville ... April 26, 2013 May 16, 2013
.. ... ... ... ... ...
500 Superior Bank, FSB Hinsdale ... July 27, 2001 June 5, 2012
501 Malta National Bank Malta ... May 3, 2001 November 18, 2002
502 First Alliance Bank & Trust Co. Manchester ... February 2, 2001 February 18, 2003
503 National State Bank of Metropolis Metropolis ... December 14, 2000 March 17, 2005
504 Bank of Honolulu Honolulu ... October 13, 2000 March 17, 2005
[505 rows x 7 columns]]
读取 URL 并匹配包含特定文本的表
match = "Metcalf Bank"
df_list = pd.read_html(url, match=match)
指定一个标题行(默认情况下 <th> 或 <td> 位于 <thead> 中的元素用于形成列索引,如果 <thead> 中包含多个行,那么创建一个多索引)
dfs = pd.read_html(url, header=0)
指定索引列
dfs = pd.read_html(url, index_col=0)
指定要跳过的行数:
dfs = pd.read_html(url, skiprows=0)
使用列表指定要跳过的行数(range 函数也适用)
dfs = pd.read_html(url, skiprows=range(2))
指定一个 HTML 属性
dfs1 = pd.read_html(url, attrs={"id": "table"})
dfs2 = pd.read_html(url, attrs={"class": "sortable"})
print(np.array_equal(dfs1[0], dfs2[0])) # Should be True
指定应转换为 NaN 的值
dfs = pd.read_html(url, na_values=["No Acquirer"])
指定是否保持默认的 NaN 值集
dfs = pd.read_html(url, keep_default_na=False)
可以为列指定转换器。这对于具有前导零的数字文本数据很有用。
默认情况下,将数字列转换为数字类型,并且前导零会丢失。为了避免这种情况,我们可以将这些列转换为字符串
url_mcc = "https://en.wikipedia.org/wiki/Mobile_country_code"
dfs = pd.read_html(
url_mcc,
match="Telekom Albania",
header=0,
converters={"MNC": str},
)
组合上面的选项
dfs = pd.read_html(url, match="Metcalf Bank", index_col=0)
读取 to_html 的输出(会损失浮点数的精度)
df = pd.DataFrame(np.random.randn(2, 2))
s = df.to_html(float_format="{0:.40g}".format)
dfin = pd.read_html(s, index_col=0)
当只提供了一个解析器时,如果解析失败,lxml 解析器会抛出异常,最好的方式是指定一个解析器列表
dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor=["lxml"])
# or
dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor="lxml")
但是,如果安装了 bs4 和 html5lib 并传入 None 或 ['lxml','bs4'],则解析很可能会成功。
dfs = pd.read_html(url, "Metcalf Bank", index_col=0, flavor=["lxml", "bs4"])
2 写入 HTML 文件
DataFrame 对象有一个实例方法 to_html,它将 DataFrame 的内容呈现为 html 表格。
函数参数与上面描述的方法 to_string 相同。
In [303]: df = pd.DataFrame(np.random.randn(2, 2))
In [304]: df
Out[304]:
0 1
0 -0.184744 0.496971
1 -0.856240 1.857977
In [305]: print(df.to_html()) # raw html
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>-0.184744</td>
<td>0.496971</td>
</tr>
<tr>
<th>1</th>
<td>-0.856240</td>
<td>1.857977</td>
</tr>
</tbody>
</table>

columns 参数将限制显示的列
In [306]: print(df.to_html(columns=[0]))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>-0.184744</td>
</tr>
<tr>
<th>1</th>
<td>-0.856240</td>
</tr>
</tbody>
</table>

float_format 参数控制浮点值的精度
In [307]: print(df.to_html(float_format="{0:.10f}".format))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>-0.1847438576</td>
<td>0.4969711327</td>
</tr>
<tr>
<th>1</th>
<td>-0.8562396763</td>
<td>1.8579766508</td>
</tr>
</tbody>
</table>

bold_rows 默认情况下将使行标签加粗,但你可以关闭它
In [308]: print(df.to_html(bold_rows=False))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>-0.184744</td>
<td>0.496971</td>
</tr>
<tr>
<td>1</td>
<td>-0.856240</td>
<td>1.857977</td>
</tr>
</tbody>
</table>

classes 参数提供了给 HTML 表 设置 CSS 类的能力。
请注意,这些类附加到现有的 dataframe 类之后
In [309]: print(df.to_html(classes=["awesome_table_class", "even_more_awesome_class"]))
<table border="1" class="dataframe awesome_table_class even_more_awesome_class">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>-0.184744</td>
<td>0.496971</td>
</tr>
<tr>
<th>1</th>
<td>-0.856240</td>
<td>1.857977</td>
</tr>
</tbody>
</table>
render_links 参数提供了向包含 url 的单元格添加超链接的能力
In [310]: url_df = pd.DataFrame(
.....: {
.....: "name": ["Python", "pandas"],
.....: "url": ["https://www.python.org/", "https://pandas.pydata.org"],
.....: }
.....: )
.....:
In [311]: print(url_df.to_html(render_links=True))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>url</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Python</td>
<td><a href="https://www.python.org/" target="_blank">https://www.python.org/</a></td>
</tr>
<tr>
<th>1</th>
<td>pandas</td>
<td><a href="https://pandas.pydata.org" target="_blank">https://pandas.pydata.org</a></td>
</tr>
</tbody>
</table>

最后,escape 参数允许您控制 HTML 结果中是否转义了 "<"、">" 和 "&" 字符(默认情况下为 True)。
因此,要获得没有转义字符的 HTML,请传递 escape=False
In [312]: df = pd.DataFrame({"a": list("&<>"), "b": np.random.randn(3)})
转义
In [313]: print(df.to_html())
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>&</td>
<td>-0.474063</td>
</tr>
<tr>
<th>1</th>
<td><</td>
<td>-0.230305</td>
</tr>
<tr>
<th>2</th>
<td>></td>
<td>-0.400654</td>
</tr>
</tbody>
</table>

不转义
In [314]: print(df.to_html(escape=False))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>&</td>
<td>-0.474063</td>
</tr>
<tr>
<th>1</th>
<td><</td>
<td>-0.230305</td>
</tr>
<tr>
<th>2</th>
<td>></td>
<td>-0.400654</td>
</tr>
</tbody>
</table>

在某些浏览器上这两个 HTML 表可能并不会显示出差异。
3 HTML表解析技巧
在顶级 pandas io 函数 read_html 中,用于解析 HTML 表的库存在一些问题
lxml的问题
- 优点
- 快速
- 依赖
Cython
- 缺点
lxml对其解析结果不做任何保证,除非为其提供严格有效的标记。可以选择使用lxml后端,但是如果lxml无法解析,则该后端将使用html5lib- 强烈建议您同时安装
BeautifulSoup4和html5lib,以便即使lxml失败,您仍将获得有效结果
- 使用
lxml作为BeautifulSoup4后端的问题
- 因为
BeautifulSoup4本质上只是解析器后端的封装,因此上述问题同样存在
- 使用
html5lib作为BeautifulSoup4后端的问题
- 优点
html5lib比lxml宽松得多,因此会以一种更理智的方式处理现实生活中的标记,而不仅仅是在不通知您的情况下删除一个元素html5lib自动从无效标记中生成有效的HTML5标记。这对于解析HTML表非常重要,因为它保证了一个有效的文档。然而,这并不意味着它一定是正确的,因为修复标记的过程并没有统一的定义html5lib是纯Python实现,不需要额外的构建步骤
- 缺点
- 使用
html5lib的最大缺点是速度很慢
- 使用
本文介绍了pandas库中的read_html函数,用于从HTML字符串、文件或URL解析表格,并详细说明了各种参数的使用,如匹配特定表、指定索引列、处理缺失值等。同时,还展示了DataFrame对象如何通过to_html写入HTML文件,以及解析HTML表时的不同后端选择及其优缺点。
649

被折叠的 条评论
为什么被折叠?



