事实上,你所指的是更准确地称之为网站报废,在这种情况下,人们可以从给定的网站上抓取一些特定的内容:Web scraping is a computer software technique of extracting
information from websites. This technique mostly focuses on the
transformation of unstructured data (HTML format) on the web into
structured data (database or spreadsheet).
如果不了解HTML语义,就不可能为您提供所需的代码快照。但在这里我可以给你一些建议,你可以使用一些方法,你可以从你的网站抓取。在
1。非编程方式:For those of you, who need a non-programming way to extract
information out of web pages, you can also look at import.io . It
provides a GUI driven interface to perform all basic web scraping
operations.
2。程序员方式:
您可以找到许多库来使用python执行一个函数。因此,有必要找到最佳的使用库。我更喜欢beauthulsoup,因为它很容易而且直观。确切地说,您使用两个Python模块来获取数据:Urllib2: It is a Python module which can be used for fetching URLs. It defines functions and classes to help with URL actions (basic
and digest authentication, redirections, cookies, etc). For more
detail refer to the documentation page.
BeautifulSoup: It is an incredible tool for pulling out information
from a webpage. You can use it to extract tables, lists, paragraph and
you can also put filters to extract information from web pages. the latest available version is BeautifulSoup 4. You can look
at the installation instruction in its documentation page.
BeautifulSoup无法为我们获取网页。这就是为什么需要将urllib2与beauthoulsoup库结合使用。在
除了BeatifulSoup之外,Python还有其他几个HTML抓取选项。以下是其他一些: