spark 可以 通过jdbc直接将数据库中的一整张表直接读取出来作为一个DataFram,这样会出现几个问题:
- 1. 表格太大,导致出现OOM;
- 2. 读出来很多不需要的行,在join的时候,造成性能的浪费
这里记录通过JDBC通过select where语句只查询表格中的指定行
java 代码如下:
package utils;
import Main.NewWordsEx;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.sql.*;
import java.io.Serializable;
import java.util.*;
import java.util.regex.Pattern;
public class IOUtils implements Serializable {
/**
* 通过jdbc 读取postgresql
* @param sqlContext
* @param tokens where 条件中需要筛选的字段
* @return
*/
public static DataFrame readDataFromPSQL(SQLContext sqlContext, List<String> tokens) {
if (tokens == null || tokens.size() == 0){
return null;
}
StringBuilder builder = new StringBuilder();
for (String s : tokens) {
builder.append("\'"+s+"\'");
builder.append(",");
}
return readDataFromPSQL(sqlContext, builder.deleteCharAt(builder.length() - 1).toString());
}
public static DataFrame readDataFromPSQL(SQLContext sqlContext, String token) {
if (token == null || token.equals("") || token.length() == 0){
return null;
}
DataFrame df = sqlContext.read().format("jdbc")
.option("url", "jdbc:postgresql://127.0.0.1:5432/test_macan")
.option("user", "postgres")
.option("password", "postgres")
.option("driver", "org.postgresql.Driver")
.option("dbtable", "(select token, fequence from idf_feq where token in (" + token + ")) as tmp")
.load();
return df;
}
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("LoadData")//
.setMaster("local[*]");
Logger logger = Logger.getLogger("org.apache.spark");
logger.setLevel(Level.WARN);
JavaSparkContext jsc = new JavaSparkContext(conf);
final SQLContext sqlContext = new SQLContext(jsc);
List<String> tokenList = sqlContext.read().parquet(IOUtils.hdfsData2016)
.select("text")
.toJavaRDD()
.flatMap(new FlatMapFunction<Row, String>() {
@Override
public Iterable<String> call(Row row) throws Exception {
List<String> rst = new ArrayList<>(tokenizer(row.getString(0), 2).keySet());
return rst;
}
}).collect();
DataFrame idfDF = IOUtils.readDataFromPSQL(sqlContext, tokenList/*Arrays.asList("百零", "驻紫", "了相关业")*/);
idfDF.show();
sqlContext.clearCache();
jsc.close();
}
方法就是使用option参数,在dbtable的value哪里写查询选项。
.option("dbtable", "(select token, fequence from idf_feq where token in (" + token + ")) as tmp")
测试的数据库很简单,就两列,在实际中我存储的是IDF,例如:
token | idf_frequence |
---|---|
北京 | 10 |
张三 | 20 |
… | … |
附带一下spark 通过jdbc 写数据的方法:
/**
* 写数据到SQL
* @param dataFrame
*/
public static void saveDataToPSQL(DataFrame dataFrame, String table) {
if (dataFrame == null) {
return;
}
Properties properties = new Properties();
properties.put("user", "postgres");
properties.put("password", "postgres");
properties.put("driver", "org.postgresql.Driver");
properties.put("fetchSize", "5000"); dataFrame.write().mode(SaveMode.Append).jdbc("jdbc:postgresql://127.0.0.1:5432/test_macan",
table,
properties);
}
总结
如果问题,请指教。