html分隔标签,如何在HTML标签元素中分隔文本

我将选择包含的所有

//div[@id="menu_list"]//td[@valign="top"][.//span[@class="foodTitle"]]

然后对于这些td单元格中的每一个,获取所有文本节点

^{pr2}$

你会得到这样的结果:['\n ',

'\n ',

'Century Square',

'\n 2 Tampines Central 5',

'\n #01-44-47 Century Square',

'\n Singapore 529509',

'\n ',

'\n ',

'Opening Hours:',

u'\n 7am to 12am (Sun-Thu &\xa0PH)',

u'\n 24 Hours (Fri & Sat\xa0&',

'\n ',

'\n Eve of PH)',

'\n Telephone: 6789 0457',

'\n ']

以及['\n ',

'Liat Towers',

'\n 541 Liat towers #01-01',

'\n Orchard Road',

'\n Singapore 238888',

'Opening Hours: ',

'\n 24 hours (Daily)',

'\n Telephone: 6737 8036']

其中一些文本节点的字符串表示形式都是空白,因此去掉它们并查找“营业时间”和“电话”关键字以处理循环中的行:from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector

import re

from todo.items import wendyItem

class wendySpider(BaseSpider):

name = "wendyspider"

allowed_domains = ["wendys.com.sg"]

start_urls = ["http://www.wendys.com.sg/outlets.php"]

def parse(self, response):

hxs = HtmlXPathSelector(response)

cells = hxs.select('//div[@id="menu_list"]//td[@valign="top"][.//span[@class="foodTitle"]]')

items = []

for cell in cells:

item = wendyItem()

# get all text nodes

# some lines are blank so .strip() them

lines = cell.select('.//text()').extract()

lines = [l.strip() for l in lines if l.strip()]

# first non-blank line is the place name

item['name'] = lines.pop(0)

# for the other lines, check for "Opening hours" and "Telephone"

# to store lines in correct list container

address_lines = []

hours_lines = []

telephone_lines = []

opening_hours = False

telephone = False

for line in lines:

if 'Opening Hours' in line:

opening_hours = True

elif 'Telephone' in line:

telephone = True

if telephone:

telephone_lines.append(line)

elif opening_hours:

hours_lines.append(line)

else:

address_lines.append(line)

# last address line is the postal code + town name

item['address'] = "\n".join(address_lines[:-1])

item['postal'] = address_lines[-1]

# ommit "Opening hours" (first element in list)

item['hours'] = "\n".join(hours_lines[1:])

item['contact'] = "\n".join(telephone_lines)

items.append(item)

return items

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值