python 搜寻蓝牙_如何构建URL搜寻器以使用Python映射网站

最新推荐文章于 2022-09-04 09:17:45 发布

cumian9828

最新推荐文章于 2022-09-04 09:17:45 发布

阅读量187

点赞数

文章标签： python java 大数据编程语言 linux

原文链接：https://www.freecodecamp.org/news/how-to-build-a-url-crawler-to-map-a-website-using-python-6a287be1da11/

版权

python 搜寻蓝牙

by Ahad Sheriff

通过阿哈德·谢里夫(Ahad Sheriff)

如何构建URL搜寻器以使用Python映射网站 (How to build a URL crawler to map a website using Python)

一个用于学习Web抓取基础知识的简单项目 (A simple project for learning the fundamentals of web scraping)

Before we start, let’s make sure we understand what web scraping is:

在开始之前，请确保我们了解什么是网络抓取：

Web scraping is the process of extracting data from websites to present it in a format users can easily make sense of.

Web抓取是从网站提取数据并以用户容易理解的格式呈现数据的过程。

In this tutorial, I want to demonstrate how easy it is to build a simple URL crawler in Python that you can use to map websites. While this program is relatively simple, it can provide a great introduction to the fundamentals of web scraping and automation. We will be focusing on recursively extracting links from web pages, but the same ideas can be applied to a myriad of other solutions.

在本教程中，我想演示在Python中构建可用于映射网站的简单URL搜寻器有多么容易。尽管此程序相对简单，但可以很好地介绍Web抓取和自动化的基础知识。我们将集中精力从网页上递归地提取链接，但是相同的思想也可以应用于众多其他解决方案。

Our program will work like this:

我们的程序将像这样工作：

Visit a web page
访问网页
Scrape all unique URL’s found on the webpage and add them to a queue
抓取网页上找到的所有唯一URL，并将它们添加到队列中
Recursively process URL’s one by one until we exhaust the queue
递归处理网址的一个，直到我们用尽队列
Print results
打印结果

第一件事第一 (First Things First)

The first thing we should do is import all the necessary libraries. We will be using BeautifulSoup, requests, and urllib for web scraping.

我们应该做的第一件事是导入所有必需的库。我们将使用BeautifulSoup ， request和urllib进行Web抓取。

from bs4 import BeautifulSoupimport requestsimport requests.exceptionsfrom urllib.parse import urlsplitfrom urllib.parse import urlparsefrom collections import deque

Next, we need to select a URL to start crawling from. While you can choose any webpage with HTML links, I recommend using ScrapeThisSite. It is a safe sandbox that you can crawl without getting in trouble.

接下来，我们需要选择一个网址以开始抓取。尽管您可以选择带有HTML链接的任何网页，但我建议使用ScrapeThisSite 。这是一个安全的沙箱，您可以在不麻烦的情况下进行爬网。

url = “https://scrapethissite.com”

Next, we are going to need to create a new deque object so that we can easily add newly found links and remove them once we are finished processing them. Pre-populate the deque with your url variable:

接下来，我们将需要创建一个新的双端队列对象，以便我们可以轻松添加新找到的链接，并在完成处理后将其删除。用您的url变量预填充双端队列：

# a queue of urls to be crawled nextnew_urls = deque([url])

We can then use a set to store unique URL’s once they have been processed:

然后，我们就可以使用一组存储唯一URL的地址：

# a set of urls that we have already processed processed_urls = set()

We also want to keep track of local (same domain as the target), foreign (different domain as the target), and broken URLs:

我们还希望跟踪本地(与目标相同的域)，外部(与目标不同的域)和损坏的URL：

# a set of domains inside the target websitelocal_urls = set()

# a set of domains outside the target websiteforeign_urls = set()

# a set of broken urlsbroken_urls = set()

爬的时间 (Time To Crawl)

With all that in place, we can now start writing the actual code to crawl the website.

一切就绪之后，我们现在就可以开始编写实际的代码来爬网该网站了。

We want to look at each URL in the queue, see if there are any additional URL’s within that page and add each one to the end of the queue until there are none left. As soon as we finish scraping a URL, we will remove it from the queue and add it to the processed_urls set for later use.

我们要查看队列中的每个URL，查看该页面内是否还有其他URL，并将每个URL添加到队列的末尾，直到没有剩余URL为止。抓取一个URL后，我们便立即将其从队列中删除，并将其添加到processed_urls的URL集中以备后用。

# process urls one by one until we exhaust the queuewhile len(new_urls):    # move url from the queue to processed url set    url = new_urls.popleft()    processed_urls.add(url)    # print the current url    print(“Processing %s” % url)

Next, add an exception to catch any broken web pages and add them to the broken_urls set for later use:

接下来，添加一个异常以捕获任何损坏的网页并将其添加到broken_urls集中以供以后使用：

try:    response = requests.get(url)

except(requests.exceptions.MissingSchema, requests.exceptions.ConnectionError, requests.exceptions.InvalidURL, requests.exceptions.InvalidSchema):    # add broken urls to it’s own set, then continue    broken_urls.add(url)    continue

We then need to get the base URL of the webpage so that we can easily differentiate local and foreign addresses:

然后，我们需要获取网页的基本URL，以便我们可以轻松地区分本地地址和外部地址：

# extract base url to resolve relative linksparts = urlsplit(url)base = “{0.netloc}”.format(parts)strip_base = base.replace(“www.”, “”)base_url = “{0.scheme}://{0.netloc}”.format(parts)path = url[:url.rfind(‘/’)+1] if ‘/’ in parts.path else url

Initialize BeautifulSoup to process the HTML document:

初始化BeautifulSoup以处理HTML文档：

soup = BeautifulSoup(response.text, “lxml”)

Now scrape the web page for all links and sort add them to their corresponding set:

现在，在网页上抓取所有链接并将其排序，然后将其添加到相应的集合中：

for link in soup.find_all(‘a’):    # extract link url from the anchor    anchor = link.attrs[“href”] if “href” in link.attrs else ‘’

if anchor.startswith(‘/’):        local_link = base_url + anchor        local_urls.add(local_link)    elif strip_base in anchor:        local_urls.add(anchor)    elif not anchor.startswith(‘http’):        local_link = path + anchor        local_urls.add(local_link)    else:        foreign_urls.add(anchor)

Since I want to limit my crawler to local addresses only, I add the following to add new URLs to our queue:

由于我只想将搜寻器限制为本地地址，因此我添加以下内容以将新的URL添加到队列中：

for i in local_urls:    if not i in new_urls and not i in processed_urls:        new_urls.append(i)

If you want to crawl all URLs use:

如果要爬网所有URL，请使用：

if not link in new_urls and not link in processed_urls:    new_urls.append(link)

Warning: The way the program currently works, crawling foreign URL’s will take a VERY long time. You could possibly get into trouble for scraping websites without permission. Use at your own risk!

警告： 抓取外部URL的程序当前工作方式将花费 很 长时间。 未经许可，您可能会在抓取网站时遇到麻烦。 使用风险自负！

Here is all my code:

这是我所有的代码：

And that should be it. You have just created a simple tool to crawl a website and map all URLs found!

就是这样。您刚刚创建了一个简单的工具来抓取网站并映射找到的所有URL！

结论 (In Conclusion)

Feel free to build upon and improve this code. For example, you could modify the program to search web pages for email addresses or phone numbers as you crawl them. You could even extend functionality by adding command line arguments to provide the option to define output files, limit searches to depth, and much more. Learn about how to create command-line interfaces to accept arguments here.

随时构建和改进此代码。例如，您可以修改程序以在爬网时在网页上搜索电子邮件地址或电话号码。您甚至可以通过添加命令行参数来扩展功能，以提供定义输出文件，将搜索范围限制为深度等选项。在此处了解有关如何创建命令行界面以接受参数的信息。

If you have additional recommendations, tips, or resources, please share in the comments!

如果您还有其他建议，技巧或资源，请分享评论！

Thanks for reading! If you liked this tutorial and want more content like this, be sure to smash that follow button. ❤️

谢谢阅读！如果您喜欢本教程，并且想要更多类似的内容，请确保粉碎该关注按钮。 ❤️

Also be sure to check out my website, Twitter, LinkedIn, and Github.

另外，请务必查看我的网站 Twitter ， LinkedIn和Github 。