数据读取
hadoopFile
Parameters:
- path – path to Hadoop file
- inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapred.TextInputFormat”)
- keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
- valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
- keyConverter – (None by default)
- valueConverter – (None by default)
- conf – Hadoop configuration, passed in as a dict (None by default)
- batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
# hadoopFile:返回键值对,键为为行的偏移量,值为行的内容
# log.txt:
# http://www.baidu.com
# http://www.google.com
# http://www.google.com
# ... ... ...
rdd = sc.hadoopFile("hdfs://centos03:9000/datas/log.txt",
inputFormatClass="org.apache.hadoop.mapred.TextInputFormat",
keyClass="org.apache.hadoop.io.LongWritable",
valueClass="org.apache.hadoop.io.Text")
print(rdd.collect()) #1
rdd1 = rdd.map(lambda x: x[1].split(":"))
print(rdd1.collect()) #2
#1
[(0, ‘http://www.baidu.com’), (22, ‘http://www.google.com’), (45, ‘http://www.google.com’), (68, ‘http://cn.bing.com’), (88, ‘http://cn.bing.com’), (108, ‘http://www.baidu.com’), (130, ‘http://www.sohu.com’), (151, ‘http://www.sina.com’), (172, ‘http://www.sin2a.com’), (194, ‘http://www.sin2desa.com’), (219, ‘http://www.sindsafa.com’)]
#2
[[‘http’, ‘//www.baidu.com’], [‘http’, ‘//www.google.com’], [‘http’, ‘//www.google.com’], [‘http’, ‘//cn.bing.com’], [‘http’, ‘//cn.bing.com’], [‘http’, ‘//www.baidu.com’], [‘http’, ‘//www.sohu.com’], [‘http’, ‘//www.sina.com’], [‘http’, ‘//www.sin2a.com’], [‘http’, ‘//www.sin2desa.com’], [‘http’, ‘//www.sindsafa.com’]]
newAPIHadoopFile
Parameters:
- path – path to Hadoop file
- inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapreduce.lib.input.TextInputFormat”)
- keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
- valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
- keyConverter – (None by default)
- valueConverter – (None by default)
- conf – Hadoop configuration, passed in as a dict (None by default)
- batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
# newAPIHadoopFile:返回键值对,键为为行的偏移量,值为行的内容
rdd = sc.newAPIHadoopFile("hdfs://centos03:9000/datas/log.txt",
# inputFormatClass与旧的API不同
inputFormatClass="org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
keyClass="org.apache.hadoop.io.LongWritable",
valueClass="org.apache.hadoop.io.Text"
)
print(rdd.collect()) #1
rdd1 = rdd.map(lambda x: x[1].split(":"))
print(rdd1.collect()) #2
#1
[(0, ‘http://www.baidu.com’), (22, ‘http://www.google.com’), (45, ‘http://www.google.com’), (68, ‘http://cn.bing.com’), (88, ‘http://cn.bing.com’), (108, ‘http://www.baidu.com’), (130, ‘http://www.sohu.com’), (151, ‘http://www.sina.com’), (172, ‘http://www.sin2a.com’), (194, ‘http://www.sin2desa.com’), (219, ‘http://www.sindsafa.com’)]
#2
[[‘http’, ‘//www.baidu.com’], [‘http’, ‘//www.google.com’], [‘http’, ‘//www.google.com’], [‘http’, ‘//cn.bing.com’], [‘http’, ‘//cn.bing.com’], [‘http’, ‘//www.baidu.com’], [‘http’, ‘//www.sohu.com’], [‘http’, ‘//www.sina.com’], [‘http’, ‘//www.sin2a.com’], [‘http’, ‘//www.sin2desa.com’], [‘http’, ‘//www.sindsafa.com’]]
hadoopRDD
Parameters:
- inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapred.TextInputFormat”)
- keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
- valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
- keyConverter – (None by default)
- valueConverter – (None by default)
- conf – Hadoop configuration, passed in as a dict (None by default)
- batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
confs = {
"mapred.input.dir": "hdfs://centos03:9000/datas/log.txt"}
rdd = sc.hadoopRDD(inputFormatClass="org.apache.hadoop.mapred.TextInputFormat",
keyClass="org.apache.hadoop.io.LongWritable",
valueClass="org.apache.hadoop.io.Text",
conf=confs)
print(rdd.collect()) #1
#1` [(0, ‘http://www.baidu.com’), (22, ‘http://www.google.com’), (45, ‘http://www.google.com’), (68, ‘http://cn.bing.com’), (88, ‘http://cn.bing.com’), (108, ‘http://www.baidu.com’), (130, ‘http://www.sohu.com’), (151, ‘http://www.sina.com’), (172, ‘http://www.sin2a.com’), (194, ‘http://www.sin2desa.com’), (219, ‘http://www.sindsafa.com’)]
newAPIHadoopRDD
Parameters:
- inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapreduce.lib.input.TextInputFormat”)
- keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
- valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
- keyConverter – (None by default)
- valueConverter – (None by default)
- conf – Hadoop configuration, passed in as a dict (None by default)
- batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
confs = {
"mapreduce.input.fileinputformat.inputdir":"hdfs://centos03:9000/datas/log.txt"}
rdd = sc.newAPIHadoopRDD(
inputFormatClass="org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
keyClass="org.apache.hadoop.io.LongWritable",
valueClass="org.apache.hadoop.io.Text",
conf=