pyspark RDD数据的读取与保存

数据读取

hadoopFile

Parameters:

  • path – path to Hadoop file
  • inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapred.TextInputFormat”)
  • keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
  • valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
  • keyConverter – (None by default)
  • valueConverter – (None by default)
  • conf – Hadoop configuration, passed in as a dict (None by default)
  • batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
# hadoopFile:返回键值对,键为为行的偏移量,值为行的内容
# log.txt:
# http://www.baidu.com
# http://www.google.com
# http://www.google.com
# ...	...		...

rdd = sc.hadoopFile("hdfs://centos03:9000/datas/log.txt",
inputFormatClass="org.apache.hadoop.mapred.TextInputFormat",
keyClass="org.apache.hadoop.io.LongWritable",
valueClass="org.apache.hadoop.io.Text")
print(rdd.collect())  #1
rdd1 = rdd.map(lambda x: x[1].split(":"))
print(rdd1.collect())  #2

#1 [(0, ‘http://www.baidu.com’), (22, ‘http://www.google.com’), (45, ‘http://www.google.com’), (68, ‘http://cn.bing.com’), (88, ‘http://cn.bing.com’), (108, ‘http://www.baidu.com’), (130, ‘http://www.sohu.com’), (151, ‘http://www.sina.com’), (172, ‘http://www.sin2a.com’), (194, ‘http://www.sin2desa.com’), (219, ‘http://www.sindsafa.com’)]

#2 [[‘http’, ‘//www.baidu.com’], [‘http’, ‘//www.google.com’], [‘http’, ‘//www.google.com’], [‘http’, ‘//cn.bing.com’], [‘http’, ‘//cn.bing.com’], [‘http’, ‘//www.baidu.com’], [‘http’, ‘//www.sohu.com’], [‘http’, ‘//www.sina.com’], [‘http’, ‘//www.sin2a.com’], [‘http’, ‘//www.sin2desa.com’], [‘http’, ‘//www.sindsafa.com’]]

newAPIHadoopFile

Parameters:

  • path – path to Hadoop file
  • inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapreduce.lib.input.TextInputFormat”)
  • keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
  • valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
  • keyConverter – (None by default)
  • valueConverter – (None by default)
  • conf – Hadoop configuration, passed in as a dict (None by default)
  • batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
# newAPIHadoopFile:返回键值对,键为为行的偏移量,值为行的内容
rdd = sc.newAPIHadoopFile("hdfs://centos03:9000/datas/log.txt",
# inputFormatClass与旧的API不同
inputFormatClass="org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
keyClass="org.apache.hadoop.io.LongWritable",
valueClass="org.apache.hadoop.io.Text"
)
print(rdd.collect())  #1
rdd1 = rdd.map(lambda x: x[1].split(":"))
print(rdd1.collect())  #2

#1 [(0, ‘http://www.baidu.com’), (22, ‘http://www.google.com’), (45, ‘http://www.google.com’), (68, ‘http://cn.bing.com’), (88, ‘http://cn.bing.com’), (108, ‘http://www.baidu.com’), (130, ‘http://www.sohu.com’), (151, ‘http://www.sina.com’), (172, ‘http://www.sin2a.com’), (194, ‘http://www.sin2desa.com’), (219, ‘http://www.sindsafa.com’)]

#2 [[‘http’, ‘//www.baidu.com’], [‘http’, ‘//www.google.com’], [‘http’, ‘//www.google.com’], [‘http’, ‘//cn.bing.com’], [‘http’, ‘//cn.bing.com’], [‘http’, ‘//www.baidu.com’], [‘http’, ‘//www.sohu.com’], [‘http’, ‘//www.sina.com’], [‘http’, ‘//www.sin2a.com’], [‘http’, ‘//www.sin2desa.com’], [‘http’, ‘//www.sindsafa.com’]]

hadoopRDD

Parameters:

  • inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapred.TextInputFormat”)
  • keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
  • valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
  • keyConverter – (None by default)
  • valueConverter – (None by default)
  • conf – Hadoop configuration, passed in as a dict (None by default)
  • batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
confs = {
   "mapred.input.dir": "hdfs://centos03:9000/datas/log.txt"}
rdd = sc.hadoopRDD(inputFormatClass="org.apache.hadoop.mapred.TextInputFormat",
                   keyClass="org.apache.hadoop.io.LongWritable",
                   valueClass="org.apache.hadoop.io.Text",
                   conf=confs)
print(rdd.collect())  #1

#1` [(0, ‘http://www.baidu.com’), (22, ‘http://www.google.com’), (45, ‘http://www.google.com’), (68, ‘http://cn.bing.com’), (88, ‘http://cn.bing.com’), (108, ‘http://www.baidu.com’), (130, ‘http://www.sohu.com’), (151, ‘http://www.sina.com’), (172, ‘http://www.sin2a.com’), (194, ‘http://www.sin2desa.com’), (219, ‘http://www.sindsafa.com’)]

newAPIHadoopRDD

Parameters:

  • inputFormatClass – fully qualified classname of Hadoop InputFormat (e.g. “org.apache.hadoop.mapreduce.lib.input.TextInputFormat”)
  • keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
  • valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
  • keyConverter – (None by default)
  • valueConverter – (None by default)
  • conf – Hadoop configuration, passed in as a dict (None by default)
  • batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
confs = {
   "mapreduce.input.fileinputformat.inputdir":"hdfs://centos03:9000/datas/log.txt"}
rdd = sc.newAPIHadoopRDD(
inputFormatClass="org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
keyClass="org.apache.hadoop.io.LongWritable",
valueClass="org.apache.hadoop.io.Text", 
conf=
  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
这个问题的意思是如何使用b'sparkstreaming\xe3\x80\x8a\xe4\xb8\x89\xe3\x80\x8b'来读取Kafka数据,并将增量存储在MySQL中。 首先需要使用Spark Streaming的Kafka Direct API来读取Kafka数据,然后将获得的数据转换为DataFrame或RDD。接下来将增量数据存储到MySQL中,可以使用Spark SQL或Dataframe APIs来实现。代码示例如下: ``` from pyspark.sql import SparkSession from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils from pyspark.sql.functions import * # 创建SparkSession spark = SparkSession.builder.appName("KafkaToMySQL").getOrCreate() # 创建StreamingContext ssc = StreamingContext(spark.sparkContext, batchDuration=5) # 设置Kafka参数 kafkaParams = {"bootstrap.servers": "localhost:9092", "group.id": "testGroup"} # 创建Kafka direct stream kafkaStream = KafkaUtils.createDirectStream(ssc, topics=["testTopic"], kafkaParams=kafkaParams) # 处理Kafka数据,并保存到MySQL def processBatch(batchTime, rdd): if not rdd.isEmpty(): # 转换为DataFrame df = spark.read.json(rdd) # 将时间戳转换为日期 df = df.withColumn("date", from_unixtime(col("timestamp"), "yyyy-MM-dd")) # 计算增量 incremental_data = df.groupBy("date").agg(sum("value").alias("incremental_value")) # 将增量数据写入MySQL incremental_data.write.format("jdbc").option("url", "jdbc:mysql://localhost/test").option("dbtable", "incremental_data").option("user", "root").option("password", "root").mode("append").save() # 处理每个批次 kafkaStream.foreachRDD(processBatch) # 启动StreamingContext ssc.start() ssc.awaitTermination() ```

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值