DataFrame是spark推荐的统一结构化数据接口。
基于DataFrame能实现快速的结构化数据分析。
它让spark具备了大规模结构化数据的处理能力。
暗示了spark希望一统大数据处理的决心和野心。
spark通过DataFrame希望满足所有数据处理工程师的需求,包括R工程师、SQL商业分析师。
DataFrame处理的基本步骤是:
1、创建sqlContext,它是DataFrame的起点。一般启动spark-shell时会自动创建,也可以编程时手动创建。
2、读取数据,转化为DataFrame.数据可以来自json文件,也可以来自RDD、JDBC、Hive、MySQL等。
3、接下来,我们就可以对数据进行处理和各种显示了。
df.show()
df.printSchema()
df.select("author").show()
df.filter(df['author'] !="tobe").show()
df.groupBy("author").count().show()
下面给出一个使用DataFrame的实例程序,实现文本搜索,查找日志中出现的error个数并打印。
java源码如下:
package sparkTest;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
public class TextSearch {
<span style="white-space:pre"> </span>
<span style="white-space:pre"> </span>public static void main(String[] args) {
<span style="white-space:pre"> </span>String logFile = "file:///home/hadoop/workspace/sparkTest/input/spark-hadoop-org.apache.spark.deploy.master.Master-1-peter-HP-ENVY-Notebook.out"; // Should be some file on your system
<span style="white-space:pre"> </span>SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local");
<span style="white-space:pre"> </span>JavaSparkContext sc = new JavaSparkContext(conf);
<span style="white-space:pre"> </span>// Creates a DataFrame having a single column named "line"
<span style="white-space:pre"> </span>JavaRDD<String> textFile = sc.textFile(logFile);
<span style="white-space:pre"> </span>JavaRDD<Row> rowRDD = textFile.map(new Function<String, Row>() {
<span style="white-space:pre"> </span> public Row call(String line) throws Exception {
<span style="white-space:pre"> </span> return RowFactory.create(line);<span style="white-space:pre"> </span>//基于每行的String创建Row
<span style="white-space:pre"> </span> }
<span style="white-space:pre"> </span> });
<span style="white-space:pre"> </span>System.out.println(rowRDD.toString());
<span style="white-space:pre"> </span>
<span style="white-space:pre"> </span>SQLContext sqlContext = new SQLContext(sc);<span style="white-space:pre"> </span>//创建sqlContext,DataFrame的起点
<span style="white-space:pre"> </span>List<StructField> fields = new ArrayList<StructField>();
<span style="white-space:pre"> </span>fields.add(DataTypes.createStructField("line", DataTypes.StringType, true));
<span style="white-space:pre"> </span>StructType schema = DataTypes.createStructType(fields);
<span style="white-space:pre"> </span>DataFrame df = sqlContext.createDataFrame(rowRDD, schema);<span style="white-space:pre"> </span>//创建DataFrame,将schema应用到RDD上
<span style="white-space:pre"> </span>System.out.println(df.toString());
<span style="white-space:pre"> </span>
<span style="white-space:pre"> </span>DataFrame errors = df.filter(df.col("line").like("%ERROR%"));<span style="white-space:pre"> </span>//根据列名line选择列,返回column类型. like同SQL的like
<span style="white-space:pre"> </span>System.out.println(errors.toString());
<span style="white-space:pre"> </span>// Counts all the errors
<span style="white-space:pre"> </span>long errorCount = errors.count();
<span style="white-space:pre"> </span>// Counts errors mentioning master
<span style="white-space:pre"> </span>long errorMaster = errors.filter(df.col("line").like("%master%")).count();<span style="white-space:pre"> </span>//按参数对DataFrame进行过滤
<span style="white-space:pre"> </span>// Fetches the master errors as an array of strings
<span style="white-space:pre"> </span>Row[] errorRow = errors.filter(df.col("line").like("%master%")).collect();<span style="white-space:pre"> </span>//返回数组,包含DataFrame中所有Row
<span style="white-space:pre"> </span>
<span style="white-space:pre"> </span>System.out.println("error count is: " + errorCount);
<span style="white-space:pre"> </span>System.out.println("error count like master is: " + errorMaster);
<span style="white-space:pre"> </span>System.out.println("error count like master is: " + errorRow.toString());
<span style="white-space:pre"> </span>}
}
1、创建Row型的RDD。
2、基于sc创建sqlContext.进而创建DataFrame.
3、在DataFrame中查找包含ERROR的行。
4、统计行总数,并以数组形式返回这些行。