Python-web-scraping简介

一、什么是web scraping

简单的可以概括为以下步骤:

1.从一个域名(网址)获得HTML数据

2.从获得的数据中分析到目标信息

3.存储目标信息

4.如果你愿意,换一个网页重复上述操作

二、为什么要进行web scraping

1.如果只会通过浏览器访问互联网,我们将会丧失许多的可能。尽管浏览器可以很方便处理JavaScript脚本,展示图片,以人类可读的方式处理对象,但是web scraper更适合于收集和处理大量数据。因而,你不用一次只在一个小窗口里浏览一个页面,你可以一次性浏览上千甚至上百万的页面。

2.其次,web scraper还可以做传统搜索引擎做不了的工作。如果你用搜索引擎搜索“去往A市最便宜的航班”,那么你可能得到大量的广告或者是一些航班搜索站点。搜索引擎只知道这些网站的内容页面,却不知道针对具体问题的准确答案。然而一个开发良好的web scraper可以造访若干网站,记录通往A市的航班的价格,最终告诉你购票的最佳时间。

3.有些人可能会问,为什么不使用某些API呢?当然,如果恰好你能找到适合你的API,那再好不过。但是,也有一些原因导致某些你想要的API不存在:

1)某些你想要获得数据的站点,不提供相应的API

2)你需要的数据量较小、有限,所以站点管理员不认为这需要一些API

3)该数据源的管理者,没有相应的基础设施和技术能力进行API的开发

即使存在相应的API,也会存在种种原因使得你的目的不能被满足,所以@_@开始web scraping的学习吧。



三、附录

本系列博客参考该书进行编写



Python Web Scraping - Second Edition by Katharine Jarmul English | 30 May 2017 | ASIN: B0725BCPT1 | 220 Pages | AZW3 | 3.52 MB Key Features A hands-on guide to web scraping using Python with solutions to real-world problems Create a number of different web scrapers in Python to extract information This book includes practical examples on using the popular and well-maintained libraries in Python for your web scraping needs Book Description The Internet contains the most useful set of data ever assembled, most of which is publicly accessible for free. However, this data is not easily usable. It is embedded within the structure and style of websites and needs to be carefully extracted. Web scraping is becoming increasingly useful as a means to gather and make sense of the wealth of information available online. This book is the ultimate guide to using the latest features of Python 3.x to scrape data from websites. In the early chapters, you'll see how to extract data from static web pages. You'll learn to use caching with databases and files to save time and manage the load on servers. After covering the basics, you'll get hands-on practice building a more sophisticated crawler using browsers, crawlers, and concurrent scrapers. You'll determine when and how to scrape data from a JavaScript-dependent website using PyQt and Selenium. You'll get a better understanding of how to submit forms on complex websites protected by CAPTCHA. You'll find out how to automate these actions with Python packages such as mechanize. You'll also learn how to create class-based scrapers with Scrapy libraries and implement your learning on real websites. By the end of the book, you will have explored testing websites with scrapers, remote scraping, best practices, working with images, and many other relevant topics. What you will learn Extract data from web pages with simple Python programming Build a concurrent crawler to process web pages in parallel Follow links to crawl a website Extract features from the HTML Cache downloaded HTML for reuse Compare concurrent models to determine the fastest crawler Find out how to parse JavaScript-dependent websites Interact with forms and sessions About the Author Katharine Jarmul is a data scientist and Pythonista based in Berlin, Germany. She runs a data science consulting company, Kjamistan, that provides services such as data extraction, acquisition, and modelling for small and large companies. She has been writing Python since 2008 and scraping the web with Python since 2010, and has worked at both small and large start-ups who use web scraping for data analysis and machine learning. When she's not scraping the web, you can follow her thoughts and activities via Twitter (@kjam) Richard Lawson is from Australia and studied Computer Science at the University of Melbourne. Since graduating, he built a business specializing in web scraping while travelling the world, working remotely from over 50 countries. He is a fluent Esperanto speaker, conversational in Mandarin and Korean, and active in contributing to and translating open source software. He is currently undertaking postgraduate studies at Oxford University and in his spare time enjoys developing autonomous drones. Table of Contents Introduction Scraping the data Caching downloads Concurrent downloading Dynamic content Interacting with forms Solving CAPTCHA Scrapy Putting it All Together
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值