Spark 2.2之后版本DataFrame操作
读取数据——来源于RDD转换
from pyspark.sql.types import Row
def f(x):
rel = {}
rel[‘srcIP’] = x[14]
rel[‘desIP’] = x[16]
rel[‘signature’] = x[111]
rel[‘eventClass’] = x[151]
return rel
lines = sc.textFile("")
df = lines.map(lambda line : line.split(’^’)).map(lambda x: Row(**f(x))).toDF()
读取数据——来源于json文件
spark=SparkSession.builder.getOrCreate()
df = spark.read.json("")
第一种方式查询数据
df.createOrReplaceTempView(“datatable”)
result = spark.sql(“select * from datatable where signature==‘376090’”)
result.count()
result.limit(10).rdd.map(lambda t : “Name:”+t[0]+","+“Age:”+t[1]).foreach(print)
第二种方式查询数据
df.printSchema()
df.show(5)
df.select(df.name,df.age + 1).show()
df.filter(df.age > 20 ).show()
df.groupBy(“age”).count().show()
df.sort(df.age.desc(), df.name.asc()).show()
df.write.csv()
Spark 1.5版本DataFrame操作
读取数据——来源于RDD转换
from pyspark.sql.types import Row
lines = sc.textFile("")
df = lines.map(lambda line : line.split(’^’)).map(lambda x: Row(**f(x))).toDF()
读取数据——来源于json文件
df = sqlContext.read.json(“examples/src/main/resources/people.json”)
第一种方式数据查询
df.registerTempTable(“people”)
teenagers = sqlContext.sql(“SELECT name FROM people WHERE age >= 13 AND age <= 19”)
第二种数据查询
the same as before
杂项
设置字段大小写敏感,原始Json文件中字段大小写都有,spark处理会有问题
spark.sql(“set spark.sql.caseSensitive=true”)
过滤RDD中存在某个值的情况
lines.filter(lambda line : “Spark” in line).count()
直接查询Hive表数据(1.5.1测试可行)
result = sqlContext.sql(“select * from table”)
#result == DataFrame the operation of the result is the same as before