python爬取pdf网页,Python从URL抓取pdf

I want to scrape the text from the URL "http://www.nycgo.com/venues/thalia-restaurant#menu"

The text I'm interested in is in the 'menu' tab on the page. I tried BeautifulSoup to get all the text on the page, but the return value from the following code misses all the text in the menu.

html = urllib2.urlopen("http://www.nycgo.com/venues/thalia-restaurant#menu")

html=html.read()

soup = BS(html)

print soup.get_text()

It seems that the content of the menu is part of the html on the page when I inspect elements from the menu content. I did notice that when physically browsing the page, it takes several seconds for the menu to fully load. Not sure if that's why the code above fails to get the menu content.

Any insight would be appreciated.

解决方案

While soup.get_text() will return all of the text from a HTML document (webpage) the problem here is that the menu is embedded in the page as a PDF, which Beautiful soup cannot access. The actual PDF file is defined in Javascript like follows:

{

name: "menu",

show: Boolean(1),

url: "/assets/files/programs/rw/2016W/thalia-restaurant.pdf"

}

The simplest way to extract this then is probably to use regular expressions. While this is generally a bad idea, here you're looking for a very specific thing — a file, wrapped in "quotes" ending in .pdf. The following code will find that and extract the URL:

import re

from urllib import urlopen

html = urlopen("http://www.nycgo.com/venues/thalia-restaurant#menu")

html_doc = html.read()

match = re.search(b'\"(.*?\.pdf)\"', html_doc)

pdf_url = "http://www.nycgo.com" + match.group(1).decode('utf8')

Now pdf_url is:

u'http://www.nycgo.com/assets/files/programs/rw/2016W/thalia-restaurant.pdf'

However, extracting the text from the PDF is a little trickier. You can download the file first:

from urllib import urlretrieve

urlretrieve(pdf_url, "download.pdf")

Then extract the text as described using the function in this answer to another question:

text = convert_pdf_to_txt("download.pdf")

print(text)

Returns:

NEW YOUR CITY

RESTAURANT WEEK

WINTER 2016

MONDAY - FRIDAY

828 Eighth Avenue

New York City, 10019

Tel: 212.399.4444

www.restaurantthalia.com

LUNCH $25

FIRST COURSE

CREAMY POLENTA

fricassee of truffle mushrooms

...

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值