pythontable处理_如何在使用python废弃wikitable时处理rowspan？

最新推荐文章于 2024-06-13 09:46:10 发布

weixin_39531780

最新推荐文章于 2024-06-13 09:46:10 发布

阅读量175

点赞数

文章标签： pythontable处理

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_39531780/article/details/113647635

版权

这是这个问题唯一的解决办法。在这里，我将把rowspan，colspan table更改为simple table。

我在这个问题上浪费了很多天，但没有找到简单而好的解决办法。在许多stackoverflow解决方案中，开发人员只抓取文本。但在我的例子中，我也需要url链接。所以，我写了这个代码。

这对我有用# this code written in beautifulsoup python3.5

# fetch one wikitable in html format with links from wikipedia

from bs4 import BeautifulSoup

import requests

import codecs

import os

url = "https://en.wikipedia.org/wiki/Ministry_of_Agriculture_%26_Farmers_Welfare"

fullTable = '

rPage = requests.get(url)

soup = BeautifulSoup(rPage.content, "lxml")

table = soup.find("table", {"class": "wikitable"})

rows = table.findAll("tr")

row_lengths = [len(r.findAll(['th', 'td'])) for r in rows]

ncols = max(row_lengths)

nrows = len(rows)

# rows and cols convert list of list

for i in range(len(rows)):

rows[i]=rows[i].findAll(['th', 'td'])

# Header - colspan check in Header

for i in range(len(rows[0])):

col = rows[0][i]

if (col.get('colspan')):

cSpanLen = int(col.get('colspan'))

del col['colspan']

for k in range(1, cSpanLen):

rows[0].insert(i,col)

# rowspan check in full table

for i in range(len(rows)):

row = rows[i]

for j in range(len(row)):

col = row[j]

del col['style']

if (col.get('rowspan')):

rSpanLen = int(col.get('rowspan'))

del col['rowspan']

for k in range(1, rSpanLen):

rows[i+k].insert(j,col)

# create table again

for i in range(len(rows)):

row = rows[i]

fullTable += '

'

for j in range(len(row)):

col = row[j]

rowStr=str(col)

fullTable += rowStr

fullTable += '

'

fullTable += '

'

# table links changed

fullTable = fullTable.replace('/wiki/', 'https://en.wikipedia.org/wiki/')

fullTable = fullTable.replace('\n', '')

fullTable = fullTable.replace('
', '')

# save file as a name of url

page=os.path.split(url)[1]

fname='outuput_{}.html'.format(page)

singleTable = codecs.open(fname, 'w', 'utf-8')

singleTable.write(fullTable)

# here we can start scraping in this table there rowspan and colspan table changed to simple table

soupTable = BeautifulSoup(fullTable, "lxml")

urlLinks = soupTable.findAll('a');

print(urlLinks)

# and so on .............

weixin_39531780

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pythontable处理_如何在使用python废弃wikitable时处理rowspan？

这是这个问题唯一的解决办法。在这里，我将把rowspan，colspan table更改为simple table。我在这个问题上浪费了很多天，但没有找到简单而好的解决办法。在许多stackoverflow解决方案中，开发人员只抓取文本。但在我的例子中，我也需要url链接。所以，我写了这个代码。这对我有用# this code written in beautifulsoup python3.5#...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。