python获取网页表格数据

最新推荐文章于 2024-05-19 23:31:32 发布

恩喜玛生物

最新推荐文章于 2024-05-19 23:31:32 发布

阅读量783

点赞数 24

文章标签： python

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/2401_84540063/article/details/138813632

版权

需求

需要网页中的基因（Gene Symbol），一共371个。

使用pandas读取网页表格

read_html 返回的是列表（a list of DataFrame）

import pandas as pd

import bioquest as bq

url = "http://exocarta.org/browse_results?org_name=&cont_type=&tissue=Bladder%20cancer%20cells&gene_symbol="

df = pd.read_html(url, encoding='utf-8',header=0,index_col=0)[0]

bq.tl.select(df,columns=["Gene Name","Gene Symbol","Species"]).to_csv("gene.csv",index=False)

没有学过爬虫，好奇是read_html怎么做到的，怎么解析网页的。

This function searches for <table> elements and only for <tr> and <th> rows and <td> elements within each <tr> or <th> element in the table. <td> stands for “table data”. This function attempts to properly handle colspan and rowspan attributes. If the function has a <thead> argument, it is used to construct the header, otherwise the function attempts to find the header within the body (by putting rows with only <th> elements into the header).

网页中的表格html语法大概如下

tr: 定义表格的行

th: 定义表格的表头

td: 定义表格单元

<table class="..." id="...">

<thead>

<tr>

<th>...</th>

</tr>

</thead>

<tbody>

<tr>

<td>...</td>

</tr>

<tr>...</tr>

<tr>...</tr>

<tr>...</tr>

<tr>...</tr>

...

<tr>...</tr>

<tr>...</tr>

<tr>...</tr>

<tr>...</tr>

</tbody>

</table>

所以read_html是依靠lxml等库根据HTML语法找到表格位置，并转换为DataFrame

Reference

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

https://zhuanlan.zhihu.com/p/51968879

https://blog.csdn.net/qq_40478273/article/details/103980288

恩喜玛生物

关注

24
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
python获取网页表格数据

所以read_html是依靠lxml等库根据HTML语法找到表格位置，并转换为DataFrame。read_html 返回的是列表（a list of DataFrame）没有学过爬虫，好奇是read_html怎么做到的，怎么解析网页的。需要网页中的基因（Gene Symbol），一共371个。网页中的表格html语法大概如下。使用pandas读取网页表格。th: 定义表格的表头。tr: 定义表格的行。td: 定义表格单元。
复制链接

扫一扫

恩喜玛生物 CSDN认证博客专家 CSDN认证企业博客

码龄33天

26: 原创

1万+: 周排名

4万+: 总排名

1万+: 访问

: 等级

593: 积分

243: 粉丝

333: 获赞

2: 评论

241: 收藏

私信

关注

热门文章

最新评论

02:机器学习实战：最小二乘法
恩喜玛生物: 需要的丝我有些符号打不出来
02:机器学习实战：最小二乘法
恩喜玛生物: 多元线性回归的写法 y=w1x1 +W2X2+...+Waxd+b 与数学中不同的是，在机器学习中，系数W和截距 b 是需要求得的未知数，而特征x和标签y 则是已知的。将上边的方程写成矩阵形式便是 y=wx+b 此时的，w和x都是矩阵， w=[w1,w2,...wd]T, x =[x1,x2,...xd]¹ 普通线性回归中的目标便是求的w和b两个参数，w其实是weight的简写，意为自变量的权重。 ·普通线性回归常用的损失函数(L)是SSE (误差平方和)，即(真实值-预测值)的平方之和 L(w) =∑ m m`(yi -9i)²=∑(yi - Xiw)²= i=1 其中||y - Xû||称为2范数，不过在这里暂时用不到。可以看到，损失函数是关于参数 w的函数。目标是对损失函数求最小值，因此可以让其偏导数=0

大家在看

最新文章

目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。