python网络数据采集 第二版_Python网络数据采集 (影印版)第2版

如果编程是魔法,那么网络数据采集肯定就是某种巫术。编写一个简单的自动化程序,你就可以查询Web服务器,请求数据,解析数据以提取所需的信息。这本实用书籍的扩充版不但介绍了网络数据采集,更是从现代网络中抓取几乎各类数据的综合指南。《Python网络数据采集(第2版·影印版·英文版)》部分侧重于网络数据采集机制:使用Python向Web服务器请求信息,对服务器响应信息做基本的处理,自动与站点展开交互。第二部分探讨了各种更具体的工具和应用程序,以应对你可能遇到的任何网络数据采集场景。

瑞安·米切尔,位于波士顿的HedgeServ的不错软件工程师,负责开发公司的API和数据分析工具。她毕业于欧林工程学院,拥有哈佛大学扩展学院(Harvard Urliversity Exterlsion School)软件工程硕士学位以及数据科学证书。在加入HedgeServ之前,她曾就职于Abine,负责使用Python开发网络数据采集工具和自动化工具。她经常从事零售、金融和制药行业的网络数据采集项目的咨询工作,还曾经在东北大学和欧林工程学院担任课程顾问和兼职教员。

Preface

Part Ⅰ.Building Scrapers

1.Your First Web Scraper

Connecting

An Introduction to BeautifulSoup

Installing BeautifulSoup

Running BeautifulSoup

Connecting Reliably and Handling Exceptions

2.Advanced HTML Parsing

You Don't Always Need a Hammer

Another Serving of BeautifulSoup

findo and findallo with BeautifulSoup

Other BeautifulSoup Objects

Navigating Trees

Regular Expressions

Regular Expressions and BeautifulSoup

Accessing Attributes

Lambda Expressions

3.Writing Web Crawlers

Traversing a Single Domain

Crawling an Entire Site

Collecting Data Across an Entire Site

Crawling Across the Internet

4.Web Crawling Models

Planning and Defining Objects

Dealing with Different Website Layouts

Structuring Crawlers

Crawling Sites Through Search

Crawling Sites Through Links

Crawling Multiple Page Types

Thinking About Web Crawler Models

5.Scrapy

Installing Scrapy

Initializing a New Spider

Writing a Simple Scraper

Spidering with Rules

Creating Items

Outputting Items

The Item Pipeline

Logging with Scrapy

More Resources

6.St0ring Data

Media Files

Storing Data to CSV

MySQL

Installing MySQL

Some Basic Commands

Integrating with Python

Database Techniques and Good Practice

"Six Degrees" in MySQL

Email

Part Ⅱ.Advanced Scraping

7.Reading Documents

Document Encoding

Text

Text Encoding and the Global Internet

CSV

Reading CSV Files

PDF

Microsoft Word and .docx

8.Cleaning Your Dirty Data

Cleaning in Code

Data Normalization

Cleaning After the Fact

OpenRefine

9.Reading and Writing Natural Languages

Summarizing Data

Markov Models

Six Degrees of Wikipedia:Conclusion

Natural Language Toolkit

Installation and Setup

Statistical Analysis with NLTK

Lexicographical Analysis with NLTK

Additional Resources

10.Crawling Through Forms and Logins

Python Requests Library

Submitting a Basic Form

Radio Buttons,Checkboxes,and Other Inputs

Submitting Files and Images

Handling Logins and Cookies

HTTP Basic Access Authentication

Other Form Problems

11.Scraping JavaScript

A Brief Introduction to JavaScript

Common JavaScript Libraries

Ajax and Dynamic HTML

Executing JavaScript in Python with Selenium

Additional Selenium Webdrivers

Handling Redirects

A Final Note on JavaScript

12.Crawling Through APIs

A Brief Introduction to APIs

HTTP Methods and APIs

More About API Responses

Parsing JSON

Undocumented APIs

Finding Undocumented APIs

Documenting Undocumented APIs

Finding and Documenting APIs Automatically

Combining APIs with Other Data Sources

More About APIs

13.Image Processing and Text Recognition

Overview of Libraries

Pillow

Tesseract

NumPy

Processing Well-Formatted Text

Adjusting Images Automatically

Scraping Text from Images on Websites

Reading CAPTCHAs and Training Tesseract

Training Tesseract

Retrieving CAPTCHAs and Submitting Solutions

14.Avoiding Scraping Traps

A Note on Ethics

Looking Like a Human

Adjust Your Headers

Handling Cookies with JavaScript

Timing Is Everything

Common Form Security Features

Hidden Input Field Values

Avoiding Honeypots

The Human Checklist

15.Testing Your Website with Scrapers

An Introduction to Testing

What Are Unit Tests?

Python unittest

Testing Wikipedia

Testing with Selenium

Interacting with the Site

unittest or Selenium?

16.Web Crawling in Parallel

Processes versus Threads

Multithreaded Crawling

Race Conditions and Queues

The threading Module

Multiprocess Crawling

Multiprocess Crawling

Communicating Between Processes

Multiprocess Crawling--Another Approach

17.Scraping Rem0tely

Why Use Remote Servers?

Avoiding IP Address Blocking

Portability and Extensibility

Tor

PySocks

Remote Hosting

Running from a Website-Hosting Account

Running from the Cloud

Additional Resources

18.The Legalities and Ethics of Web Scraping

Trademarks,Copyrights,Patents,Oh My!

Copyright Law

Trespass to Chattels

The Computer Fraud and Abuse Act

robots.txt and Terms of Service

Three Web Scrapers

eBay versus Bidder's Edge and Trespass to Chattels

United States v.Auernheimer and The Computer Fraud and Abuse Act

Field v.Google:Copyright and robots.txt

Moving Forward

Index

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值