快速开始Spark

最新推荐文章于 2024-09-15 16:13:23 发布

Marho11

最新推荐文章于 2024-09-15 16:13:23 发布

阅读量510

点赞数 1

分类专栏： Spark 文章标签： spark hadoop

Spark 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

基础

Spark shell提供一个交互式的数据分析工具，可以用来学习API。
启动python的shell：

./bin/pyspark

Spark的最主要抽象是RDD（Resilient Distributed Dataset），数据在spark内部用RDD表示。可以使用Hadoop InputFormats（如HDFS）或其他RDDs来创建RDDs。
启动Spark shell后，会自动创建一个SparkContext，用变量sc表示。

>>> textFile = sc.textFile('file:///opt/spark/spark-2.0.0-bin-hadoop2.7/README.md')
>>> textFile.count()#RDD的行数
99
>>> textFile.first()#RDD的第一行数据。
u'# Apache Spark'

使用一个transformation操作filter():

>>> linesWithSpark = textFile.filter(lambda line:'Spark' in line);#有多少行包含Spark单词。
>>> linesWithSpark.collect()
[u'# Apache Spark', u'Spark is a fast and general cluster computing system for Big Data. It provides', u'rich set of higher-level tools including Spark SQL for SQL and DataFrames,', u'and Spark Streaming for stream processing.', u'You can find the latest Spark documentation, including a programming', u'## Building Spark', u'Spark is built using [Apache Maven](http://maven.apache.org/).', u'To build Spark and its example programs, run:', u'You can build Spark using more than one thread by using the -T option with Maven, see ["Parallel builds in Maven 3"](https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3).', u'["Building Spark"](http://spark.apache.org/docs/latest/building-spark.html).', u'For developing Spark using an IDE, see [Eclipse](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse)', u'The easiest way to start using Spark is through the Scala shell:', u'Spark also comes with several sample programs in the `examples` directory.', u'    ./bin/run-example SparkPi', u'    MASTER=spark://host:7077 ./bin/run-example SparkPi', u'Testing first requires [building Spark](#building-spark). Once Spark is built, tests', u'Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported', u'Hadoop, you must build Spark against the same version that your cluster runs.', u'in the online documentation for an overview on how to configure Spark.']

向filter( )传入一个函数参数，将使该函数参数返回true的元素返回，并组成一个新的RDD。

更多的RDD操作

map()和reduce()

向map()传入一个函数参数，将改函数参数作用于RDD中的每一行，产生一个新的RDD，原RDD的每一个元素在新RDD中只有一个元素与之对应。
向reduce()中传入一个参数函数，先将RDD的前两个元素传入到参数函数中，并返回一个新值，然后将该值和RDD的下一个元素一起传入函数参数中，直到最后只有一个值为止。
用这两个函数来发现具有最多单词的行的单词数：

>>> textFile.map(lambda line:len(line.split())).reduce(lambda a,b:a if(a>b) else b)
22

上述代码首先将行映射为一个整形值产生一个新的RDD，然后在这个新的RDD上调用reduce()返回行中最大的个数。

map()和flatMap()：

flatMap与map类似，区别是原RDD中的一个元素经map处理后只能生成一个元素，而原RDD中的元素经flatmap处理后可生成多个元素来构建新RDD。通过将传入的函数作用到RDD中的所有元素，然后==将结果扁平化==，返回一个新的RDD。

具体区别看例子：

>>> textFile.map(lambda line:line.split()).collect()
[[u'#', u'Apache', u'Spark'], [], [u'Spark', u'is', u'a', u'fast', u'and', u'general', u'cluster', u'computing', u'system', u'for', u'Big', u'Data.', u'It', u'provides'], [u'high-level', u'APIs', u'in', u'Scala,', u'Java,', u'Python,', u'and', u'R,', u'and', u'an', u'optimized', u'engine', u'that'], [u'supports', u'general', u'computation', u'graphs', u'for', u'data', u'analysis.', u'It', u'also', u'supports', u'a'], [u'rich', u'set', u'of', u'higher-level', u'tools', u'including', u'Spark', u'SQL', u'for', u'SQL', u'and', u'DataFrames,'], [u'MLlib', u'for', u'machine', u'learning,', u'GraphX', u'for', u'graph', u'processing,'], [u'and', u'Spark', u'Streaming', u'for', u'stream', u'processing.'], [], [u'<http://spark.apache.org/>'], [], [], [u'##', u'Online', u'Documentation'], [], [u'You', u'can', u'find', u'the', u'latest', u'Spark', u'documentation,', u'including', u'a', u'programming'], [u'guide,', u'on', u'the', u'[project', u'web', u'page](http://spark.apache.org/documentation.html)'], [u'and', u'[project', u'wiki](https://cwiki.apache.org/confluence/display/SPARK).'], [u'This', u'README', u'file', u'only', u'contains', u'basic', u'setup', u'instructions.'], [], [u'##', u'Building', u'Spark'], [], [u'Spark', u'is', u'built', u'using', u'[Apache', u'Maven](http://maven.apache.org/).'], [u'To', u'build', u'Spark', u'and', u'its', u'example', u'programs,', u'run:'], [], [u'build/mvn', u'-DskipTests', u'clean', u'package'], [], [u'(You', u'do', u'not', u'need', u'to', u'do', u'this', u'if', u'you', u'downloaded', u'a', u'pre-built', u'package.)'], [], [u'You', u'can', u'build', u'Spark', u'using', u'more', u'than', u'one', u'thread', u'by', u'using', u'the', u'-T', u'option', u'with', u'Maven,', u'see', u'["Parallel', u'builds', u'in', u'Maven', u'3"](https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3).'], [u'More', u'detailed', u'documentation', u'is', u'available', u'from', u'the', u'project', u'site,', u'at'], [u'["Building', u'Spark"](http://spark.apache.org/docs/latest/building-spark.html).'], [u'For', u'developing', u'Spark', u'using', u'an', u'IDE,', u'see', u'[Eclipse](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse)'], [u'and', u'[IntelliJ](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ).'], [], [u'##', u'Interactive', u'Scala', u'Shell'], [], [u'The', u'easiest', u'way', u'to', u'start', u'using', u'Spark', u'is', u'through', u'the', u'Scala', u'shell:'], [], [u'./bin/spark-shell'], [], [u'Try', u'the', u'following', u'command,', u'which', u'should', u'return', u'1000:'], [], [u'scala>', u'sc.parallelize(1', u'to', u'1000).count()'], [], [u'##', u'Interactive', u'Python', u'Shell'], [], [u'Alternatively,', u'if', u'you', u'prefer', u'Python,', u'you', u'can', u'use', u'the', u'Python', u'shell:'], [], [u'./bin/pyspark'], [], [u'And', u'run', u'the', u'following', u'command,', u'which', u'should', u'also', u'return', u'1000:'], [], [u'>>>', u'sc.parallelize(range(1000)).count()'], [], [u'##', u'Example', u'Programs'], [], [u'Spark', u'also', u'comes', u'with', u'several', u'sample', u'programs', u'in', u'the', u'`examples`', u'directory.'], [u'To', u'run', u'one', u'of', u'them,', u'use', u'`./bin/run-example', u'<class>', u'[params]`.', u'For', u'example:'], [], [u'./bin/run-example', u'SparkPi'], [], [u'will', u'run', u'the', u'Pi', u'example', u'locally.'], [], [u'You', u'can', u'set', u'the', u'MASTER', u'environment', u'variable', u'when', u'running', u'examples', u'to', u'submit'], [u'examples', u'to', u'a', u'cluster.', u'This', u'can', u'be', u'a', u'mesos://', u'or', u'spark://', u'URL,'], [u'"yarn"', u'to', u'run', u'on', u'YARN,', u'and', u'"local"', u'to', u'run'], [u'locally', u'with', u'one', u'thread,', u'or', u'"local[N]"', u'to', u'run', u'locally', u'with', u'N', u'threads.', u'You'], [u'can', u'also', u'use', u'an', u'abbreviated', u'class', u'name', u'if', u'the', u'class', u'is', u'in', u'the', u'`examples`'], [u'package.', u'For', u'instance:'], [], [u'MASTER=spark://host:7077', u'./bin/run-example', u'SparkPi'], [], [u'Many', u'of', u'the', u'example', u'programs', u'print', u'usage', u'help', u'if', u'no', u'params', u'are', u'given.'], [], [u'##', u'Running', u'Tests'], [], [u'Testing', u'first', u'requires', u'[building', u'Spark](#building-spark).', u'Once', u'Spark', u'is', u'built,', u'tests'], [u'can', u'be', u'run', u'using:'], [], [u'./dev/run-tests'], [], [u'Please', u'see', u'the', u'guidance', u'on', u'how', u'to'], [u'[run', u'tests', u'for', u'a', u'module,', u'or', u'individual', u'tests](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools).'], [], [u'##', u'A', u'Note', u'About', u'Hadoop', u'Versions'], [], [u'Spark', u'uses', u'the', u'Hadoop', u'core', u'library', u'to', u'talk', u'to', u'HDFS', u'and', u'other', u'Hadoop-supported'], [u'storage', u'systems.', u'Because', u'the', u'protocols', u'have', u'changed', u'in', u'different', u'versions', u'of'], [u'Hadoop,', u'you', u'must', u'build', u'Spark', u'against', u'the', u'same', u'version', u'that', u'your', u'cluster', u'runs.'], [], [u'Please', u'refer', u'to', u'the', u'build', u'documentation', u'at'], [u'["Specifying', u'the', u'Hadoop', u'Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version)'], [u'for', u'detailed', u'guidance', u'on', u'building', u'for', u'a', u'particular', u'distribution', u'of', u'Hadoop,', u'including'], [u'building', u'for', u'particular', u'Hive', u'and', u'Hive', u'Thriftserver', u'distributions.'], [], [u'##', u'Configuration'], [], [u'Please', u'refer', u'to', u'the', u'[Configuration', u'Guide](http://spark.apache.org/docs/latest/configuration.html)'], [u'in', u'the', u'online', u'documentation', u'for', u'an', u'overview', u'on', u'how', u'to', u'configure', u'Spark.']]
>>> textFile.flatMap(lambda line:line.split()).collect()
[u'#', u'Apache', u'Spark', u'Spark', u'is', u'a', u'fast', u'and', u'general', u'cluster', u'computing', u'system', u'for', u'Big', u'Data.', u'It', u'provides', u'high-level', u'APIs', u'in', u'Scala,', u'Java,', u'Python,', u'and', u'R,', u'and', u'an', u'optimized', u'engine', u'that', u'supports', u'general', u'computation', u'graphs', u'for', u'data', u'analysis.', u'It', u'also', u'supports', u'a', u'rich', u'set', u'of', u'higher-level', u'tools', u'including', u'Spark', u'SQL', u'for', u'SQL', u'and', u'DataFrames,', u'MLlib', u'for', u'machine', u'learning,', u'GraphX', u'for', u'graph', u'processing,', u'and', u'Spark', u'Streaming', u'for', u'stream', u'processing.', u'<http://spark.apache.org/>', u'##', u'Online', u'Documentation', u'You', u'can', u'find', u'the', u'latest', u'Spark', u'documentation,', u'including', u'a', u'programming', u'guide,', u'on', u'the', u'[project', u'web', u'page](http://spark.apache.org/documentation.html)', u'and', u'[project', u'wiki](https://cwiki.apache.org/confluence/display/SPARK).', u'This', u'README', u'file', u'only', u'contains', u'basic', u'setup', u'instructions.', u'##', u'Building', u'Spark', u'Spark', u'is', u'built', u'using', u'[Apache', u'Maven](http://maven.apache.org/).', u'To', u'build', u'Spark', u'and', u'its', u'example', u'programs,', u'run:', u'build/mvn', u'-DskipTests', u'clean', u'package', u'(You', u'do', u'not', u'need', u'to', u'do', u'this', u'if', u'you', u'downloaded', u'a', u'pre-built', u'package.)', u'You', u'can', u'build', u'Spark', u'using', u'more', u'than', u'one', u'thread', u'by', u'using', u'the', u'-T', u'option', u'with', u'Maven,', u'see', u'["Parallel', u'builds', u'in', u'Maven', u'3"](https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3).', u'More', u'detailed', u'documentation', u'is', u'available', u'from', u'the', u'project', u'site,', u'at', u'["Building', u'Spark"](http://spark.apache.org/docs/latest/building-spark.html).', u'For', u'developing', u'Spark', u'using', u'an', u'IDE,', u'see', u'[Eclipse](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse)', u'and', u'[IntelliJ](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ).', u'##', u'Interactive', u'Scala', u'Shell', u'The', u'easiest', u'way', u'to', u'start', u'using', u'Spark', u'is', u'through', u'the', u'Scala', u'shell:', u'./bin/spark-shell', u'Try', u'the', u'following', u'command,', u'which', u'should', u'return', u'1000:', u'scala>', u'sc.parallelize(1', u'to', u'1000).count()', u'##', u'Interactive', u'Python', u'Shell', u'Alternatively,', u'if', u'you', u'prefer', u'Python,', u'you', u'can', u'use', u'the', u'Python', u'shell:', u'./bin/pyspark', u'And', u'run', u'the', u'following', u'command,', u'which', u'should', u'also', u'return', u'1000:', u'>>>', u'sc.parallelize(range(1000)).count()', u'##', u'Example', u'Programs', u'Spark', u'also', u'comes', u'with', u'several', u'sample', u'programs', u'in', u'the', u'`examples`', u'directory.', u'To', u'run', u'one', u'of', u'them,', u'use', u'`./bin/run-example', u'<class>', u'[params]`.', u'For', u'example:', u'./bin/run-example', u'SparkPi', u'will', u'run', u'the', u'Pi', u'example', u'locally.', u'You', u'can', u'set', u'the', u'MASTER', u'environment', u'variable', u'when', u'running', u'examples', u'to', u'submit', u'examples', u'to', u'a', u'cluster.', u'This', u'can', u'be', u'a', u'mesos://', u'or', u'spark://', u'URL,', u'"yarn"', u'to', u'run', u'on', u'YARN,', u'and', u'"local"', u'to', u'run', u'locally', u'with', u'one', u'thread,', u'or', u'"local[N]"', u'to', u'run', u'locally', u'with', u'N', u'threads.', u'You', u'can', u'also', u'use', u'an', u'abbreviated', u'class', u'name', u'if', u'the', u'class', u'is', u'in', u'the', u'`examples`', u'package.', u'For', u'instance:', u'MASTER=spark://host:7077', u'./bin/run-example', u'SparkPi', u'Many', u'of', u'the', u'example', u'programs', u'print', u'usage', u'help', u'if', u'no', u'params', u'are', u'given.', u'##', u'Running', u'Tests', u'Testing', u'first', u'requires', u'[building', u'Spark](#building-spark).', u'Once', u'Spark', u'is', u'built,', u'tests', u'can', u'be', u'run', u'using:', u'./dev/run-tests', u'Please', u'see', u'the', u'guidance', u'on', u'how', u'to', u'[run', u'tests', u'for', u'a', u'module,', u'or', u'individual', u'tests](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools).', u'##', u'A', u'Note', u'About', u'Hadoop', u'Versions', u'Spark', u'uses', u'the', u'Hadoop', u'core', u'library', u'to', u'talk', u'to', u'HDFS', u'and', u'other', u'Hadoop-supported', u'storage', u'systems.', u'Because', u'the', u'protocols', u'have', u'changed', u'in', u'different', u'versions', u'of', u'Hadoop,', u'you', u'must', u'build', u'Spark', u'against', u'the', u'same', u'version', u'that', u'your', u'cluster', u'runs.', u'Please', u'refer', u'to', u'the', u'build', u'documentation', u'at', u'["Specifying', u'the', u'Hadoop', u'Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version)', u'for', u'detailed', u'guidance', u'on', u'building', u'for', u'a', u'particular', u'distribution', u'of', u'Hadoop,', u'including', u'building', u'for', u'particular', u'Hive', u'and', u'Hive', u'Thriftserver', u'distributions.', u'##', u'Configuration', u'Please', u'refer', u'to', u'the', u'[Configuration', u'Guide](http://spark.apache.org/docs/latest/configuration.html)', u'in', u'the', u'online', u'documentation', u'for', u'an', u'overview', u'on', u'how', u'to', u'configure', u'Spark.']

reduceByKey()

reduceByKey就是==对元素为KV对的RDD中Key相同的元素的Value进行reduce==，因此，Key相同的多个元素的值被reduce为一个值，然后与原RDD中的Key组成一个新的KV对。

现计算单词的个数（并没有去除特殊字符）：

>>> textFile.flatMap(lambda line:line.split()).map(lambda word:(word,1)).reduceByKey(lambda a,b:a+b).collect()
[(u'when', 1), (u'R,', 1), (u'including', 3), (u'computation', 1), (u'using:', 1), (u'guidance', 2), (u'Scala,', 1), (u'environment', 1), (u'only', 1), (u'rich', 1), (u'Apache', 1), (u'sc.parallelize(range(1000)).count()', 1), (u'Building', 1), (u'And', 1), (u'guide,', 1), (u'return', 2), (u'Please', 3), (u'[Eclipse](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse)', 1), (u'Try', 1), (u'not', 1), (u'Spark', 15), (u'scala>', 1), (u'Note', 1), (u'cluster.', 1), (u'./bin/pyspark', 1), (u'params', 1), (u'through', 1), (u'GraphX', 1), (u'[run', 1), (u'abbreviated', 1), (u'For', 3), (u'##', 8), (u'library', 1), (u'see', 3), (u'"local"', 1), (u'[Apache', 1), (u'will', 1), (u'#', 1), (u'processing,', 1), (u'for', 11), (u'[building', 1), (u'Maven', 1), (u'["Parallel', 1), (u'provides', 1), (u'print', 1), (u'supports', 2), (u'built,', 1), (u'[params]`.', 1), (u'available', 1), (u'run', 7), (u'tests](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools).', 1), (u'This', 2), (u'Hadoop,', 2), (u'Tests', 1), (u'example:', 1), (u'-DskipTests', 1), (u'Maven](http://maven.apache.org/).', 1), (u'thread', 1), (u'programming', 1), (u'running', 1), (u'against', 1), (u'site,', 1), (u'comes', 1), (u'package.', 1), (u'and', 11), (u'package.)', 1), (u'prefer', 1), (u'documentation,', 1), (u'submit', 1), (u'tools', 1), (u'use', 3), (u'from', 1), (u'[project', 2), (u'./bin/run-example', 2), (u'fast', 1), (u'systems.', 1), (u'<http://spark.apache.org/>', 1), (u'Hadoop-supported', 1), (u'way', 1), (u'README', 1), (u'MASTER', 1), (u'engine', 1), (u'building', 2), (u'usage', 1), (u'instance:', 1), (u'with', 4), (u'protocols', 1), (u'IDE,', 1), (u'this', 1), (u'setup', 1), (u'shell:', 2), (u'project', 1), (u'following', 2), (u'distribution', 1), (u'detailed', 2), (u'have', 1), (u'stream', 1), (u'is', 6), (u'higher-level', 1), (u'tests', 2), (u'1000:', 2), (u'sample', 1), (u'["Specifying', 1), (u'Alternatively,', 1), (u'file', 1), (u'need', 1), (u'You', 4), (u'instructions.', 1), (u'different', 1), (u'programs,', 1), (u'storage', 1), (u'same', 1), (u'machine', 1), (u'Running', 1), (u'which', 2), (u'you', 4), (u'A', 1), (u'About', 1), (u'sc.parallelize(1', 1), (u'locally.', 1), (u'Hive', 2), (u'optimized', 1), (u'uses', 1), (u'Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version)', 1), (u'variable', 1), (u'The', 1), (u'data', 1), (u'a', 8), (u'"yarn"', 1), (u'Thriftserver', 1), (u'processing.', 1), (u'./bin/spark-shell', 1), (u'Python', 2), (u'Spark](#building-spark).', 1), (u'clean', 1), (u'the', 22), (u'requires', 1), (u'talk', 1), (u'help', 1), (u'Hadoop', 3), (u'-T', 1), (u'high-level', 1), (u'its', 1), (u'web', 1), (u'Shell', 2), (u'how', 2), (u'graph', 1), (u'run:', 1), (u'should', 2), (u'to', 14), (u'module,', 1), (u'given.', 1), (u'directory.', 1), (u'must', 1), (u'do', 2), (u'Programs', 1), (u'Many', 1), (u'YARN,', 1), (u'using', 5), (u'Example', 1), (u'Once', 1), (u'Spark"](http://spark.apache.org/docs/latest/building-spark.html).', 1), (u'Because', 1), (u'name', 1), (u'Testing', 1), (u'refer', 2), (u'Streaming', 1), (u'[IntelliJ](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ).', 1), (u'SQL', 2), (u'them,', 1), (u'analysis.', 1), (u'set', 2), (u'Scala', 2), (u'thread,', 1), (u'individual', 1), (u'examples', 2), (u'runs.', 1), (u'Pi', 1), (u'More', 1), (u'Python,', 2), (u'Versions', 1), (u'find', 1), (u'version', 1), (u'wiki](https://cwiki.apache.org/confluence/display/SPARK).', 1), (u'`./bin/run-example', 1), (u'Configuration', 1), (u'command,', 2), (u'Maven,', 1), (u'core', 1), (u'Guide](http://spark.apache.org/docs/latest/configuration.html)', 1), (u'MASTER=spark://host:7077', 1), (u'Documentation', 1), (u'downloaded', 1), (u'distributions.', 1), (u'Spark.', 1), (u'["Building', 1), (u'by', 1), (u'on', 5), (u'package', 1), (u'of', 5), (u'changed', 1), (u'pre-built', 1), (u'Big', 1), (u'3"](https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3).', 1), (u'or', 3), (u'learning,', 1), (u'locally', 2), (u'overview', 1), (u'one', 3), (u'(You', 1), (u'Online', 1), (u'versions', 1), (u'your', 1), (u'threads.', 1), (u'APIs', 1), (u'SparkPi', 2), (u'contains', 1), (u'system', 1), (u'`examples`', 2), (u'start', 1), (u'build/mvn', 1), (u'easiest', 1), (u'basic', 1), (u'more', 1), (u'option', 1), (u'that', 2), (u'N', 1), (u'"local[N]"', 1), (u'DataFrames,', 1), (u'particular', 2), (u'be', 2), (u'an', 4), (u'than', 1), (u'Interactive', 2), (u'builds', 1), (u'developing', 1), (u'programs', 2), (u'cluster', 2), (u'can', 7), (u'example', 3), (u'are', 1), (u'Data.', 1), (u'mesos://', 1), (u'computing', 1), (u'URL,', 1), (u'in', 6), (u'general', 2), (u'To', 2), (u'at', 2), (u'1000).count()', 1), (u'if', 4), (u'built', 1), (u'no', 1), (u'Java,', 1), (u'MLlib', 1), (u'also', 4), (u'other', 1), (u'build', 4), (u'online', 1), (u'several', 1), (u'HDFS', 1), (u'[Configuration', 1), (u'class', 2), (u'>>>', 1), (u'spark://', 1), (u'page](http://spark.apache.org/documentation.html)', 1), (u'documentation', 3), (u'It', 2), (u'graphs', 1), (u'./dev/run-tests', 1), (u'configure', 1), (u'<class>', 1), (u'first', 1), (u'latest', 1)]

缓存

当要重复访问一个小的数据集时，把它缓存到Spark集群内存中是很有用的。例如在一个小的hot数据集中查询，或运行一个像网页搜索排序这样的重复算法。

>>>textFile.cache()

独立应用程序

可以使用Spark API来写一个独立的应用。
如：

"""SimpleApp.py"""
from pyspark import SparkContext

logFile = "/opt/spark/spark-2.0.0-bin-hadoop2.7/README.md"  # Should be some file on your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()

numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()

print("Lines with a: %i, lines with b: %i" % (numAs, numBs))