目标网址:https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent
如图所示需要取到黄色部分
源码
如果使用//text()并不能取到br下面的内容
all_time_list_1 = response.etree.xpath('//div[starts-with(@style,"margin-left")]//table/tr/td[4]//text()')[0]
print(all_time_list_1)
结果:
2019-05-10
正确形式:
all_time_list = response.etree.xpath('//div[starts-with(@style,"margin-left")]//table/tr/td[4]')
for i in all_time_list:
print(i.xpath('./text()'))
print(" ".join(i.xpath('text()')))
结果:
['2019-05-10', '21:52:25']
2019-05-10 21:52:25