html分隔标签,如何在HTML标签元素中分隔文本

最新推荐文章于 2023-10-08 15:12:12 发布

谭嘉豪

最新推荐文章于 2023-10-08 15:12:12 发布

阅读量603

点赞数

文章标签： html分隔标签

我将选择包含的所有

//div[@id="menu_list"]//td[@valign="top"][.//span[@class="foodTitle"]]

然后对于这些td单元格中的每一个，获取所有文本节点

^{pr2}$

你会得到这样的结果：['\n ',

'\n ',

'Century Square',

'\n 2 Tampines Central 5',

'\n #01-44-47 Century Square',

'\n Singapore 529509',

'\n ',

'\n ',

'Opening Hours:',

u'\n 7am to 12am (Sun-Thu &\xa0PH)',

u'\n 24 Hours (Fri & Sat\xa0&',

'\n ',

'\n Eve of PH)',

'\n Telephone: 6789 0457',

'\n ']

以及['\n ',

'Liat Towers',

'\n 541 Liat towers #01-01',

'\n Orchard Road',

'\n Singapore 238888',

'Opening Hours: ',

'\n 24 hours (Daily)',

'\n Telephone: 6737 8036']

其中一些文本节点的字符串表示形式都是空白，因此去掉它们并查找“营业时间”和“电话”关键字以处理循环中的行：from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector

import re

from todo.items import wendyItem

class wendySpider(BaseSpider):

name = "wendyspider"

allowed_domains = ["wendys.com.sg"]

start_urls = ["http://www.wendys.com.sg/outlets.php"]

def parse(self, response):

hxs = HtmlXPathSelector(response)

cells = hxs.select('//div[@id="menu_list"]//td[@valign="top"][.//span[@class="foodTitle"]]')

items = []

for cell in cells:

item = wendyItem()

# get all text nodes

# some lines are blank so .strip() them

lines = cell.select('.//text()').extract()

lines = [l.strip() for l in lines if l.strip()]

# first non-blank line is the place name

item['name'] = lines.pop(0)

# for the other lines, check for "Opening hours" and "Telephone"

# to store lines in correct list container

address_lines = []

hours_lines = []

telephone_lines = []

opening_hours = False

telephone = False

for line in lines:

if 'Opening Hours' in line:

opening_hours = True

elif 'Telephone' in line:

telephone = True

if telephone:

telephone_lines.append(line)

elif opening_hours:

hours_lines.append(line)

else:

address_lines.append(line)

# last address line is the postal code + town name

item['address'] = "\n".join(address_lines[:-1])

item['postal'] = address_lines[-1]

# ommit "Opening hours" (first element in list)

item['hours'] = "\n".join(hours_lines[1:])

item['contact'] = "\n".join(telephone_lines)

items.append(item)

return items

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。