pythontable处理_如何在使用python废弃wikitable时处理rowspan?

这是这个问题唯一的解决办法。在这里,我将把rowspan,colspan table更改为simple table。

我在这个问题上浪费了很多天,但没有找到简单而好的解决办法。在许多stackoverflow解决方案中,开发人员只抓取文本。但在我的例子中,我也需要url链接。所以,我写了这个代码。

这对我有用# this code written in beautifulsoup python3.5

# fetch one wikitable in html format with links from wikipedia

from bs4 import BeautifulSoup

import requests

import codecs

import os

url = "https://en.wikipedia.org/wiki/Ministry_of_Agriculture_%26_Farmers_Welfare"

fullTable = '

rPage = requests.get(url)

soup = BeautifulSoup(rPage.content, "lxml")

table = soup.find("table", {"class": "wikitable"})

rows = table.findAll("tr")

row_lengths = [len(r.findAll(['th', 'td'])) for r in rows]

ncols = max(row_lengths)

nrows = len(rows)

# rows and cols convert list of list

for i in range(len(rows)):

rows[i]=rows[i].findAll(['th', 'td'])

# Header - colspan check in Header

for i in range(len(rows[0])):

col = rows[0][i]

if (col.get('colspan')):

cSpanLen = int(col.get('colspan'))

del col['colspan']

for k in range(1, cSpanLen):

rows[0].insert(i,col)

# rowspan check in full table

for i in range(len(rows)):

row = rows[i]

for j in range(len(row)):

col = row[j]

del col['style']

if (col.get('rowspan')):

rSpanLen = int(col.get('rowspan'))

del col['rowspan']

for k in range(1, rSpanLen):

rows[i+k].insert(j,col)

# create table again

for i in range(len(rows)):

row = rows[i]

fullTable += '

'

for j in range(len(row)):

col = row[j]

rowStr=str(col)

fullTable += rowStr

fullTable += '

'

fullTable += '

'

# table links changed

fullTable = fullTable.replace('/wiki/', 'https://en.wikipedia.org/wiki/')

fullTable = fullTable.replace('\n', '')

fullTable = fullTable.replace('
', '')

# save file as a name of url

page=os.path.split(url)[1]

fname='outuput_{}.html'.format(page)

singleTable = codecs.open(fname, 'w', 'utf-8')

singleTable.write(fullTable)

# here we can start scraping in this table there rowspan and colspan table changed to simple table

soupTable = BeautifulSoup(fullTable, "lxml")

urlLinks = soupTable.findAll('a');

print(urlLinks)

# and so on .............

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值