作者:禅与计算机程序设计艺术
1.简介
Spark is a distributed computing framework that provides APIs in Java, Scala, Python and R programming languages to developers to write fast, scalable, fault-tolerant applications that can process big data at scale. It offers high-level abstraction of Big Data processing over Hadoop Distributed File System (HDFS) or any other distributed storage system like Amazon S3, Google Cloud Storage, etc. Its API is well documented with clear examples and explanations on how it works under the hood. In this guide, we will take you through a quick start tutorial on how to use Apache Spark in your application development projects using Python language. We assume basic knowledge of python programming concepts such as variables, functions, control flow statements and file I/O operations are known to the reader.
Apache Spark’s key features include:
- Flexible parallelism - Spark supports different types of computation lik