对于DataFrame中字段为复合类型,即不是SQL字段中支持的数据类型的话,将DataFrame存入SQL将报错。本人在用Spark做机器学习时,常常会产生一些特征值(即一些向量)。如果直接将DataFrame字段中的特征值直接写入SQL中将会报错,如下:
Exception in thread "main" java.lang.IllegalArgumentException: Can't get JDBC type for vector
那么就需要将DataFrame中的这些复合数据类型转换为字符串类型,再将其存入数据库中。本文遇到上述问题是在做自然语言处理时遇到的,为了简化问题,我使用一个简单的自然语言处理代码,将文本转换为特征向量,再将其存入SQL中。
package spark2
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
import org.apache.spark.sql.types.StringType
/*
* 适合于DF列中数据类型为复合类型,可以将DF中复合数据类型全部转为字符串类型存入数据库
* */
object DF2SQL2 {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("DF2SQL2")
.master("local").getOrCreate()
val t3 = Seq(
(0, "I heard about Spark and I love Spark"),
(0, "I wish Java could use case classes"),
(1, "Logistic regression models are neat"))
val sentenceData = spark.createDataFrame(t3).toDF("label", "sentence")
// sentenceData.take(10).foreach(println)
//tokenizer对句子进行分词。
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
val hashingTF = new HashingTF().setInputCol("words")
.setOutputCol("rawFeatures").setNumFeatures(100)
val featurizedData = hashingTF.transform(wordsData)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
/*
如果上述代码不懂也没关系,主要是说明rescaledData字段中有复合数据类型(即不是简单的整型、字符串类型等)
*/
val colNames = rescaledData.columns
import org.apache.spark.sql.functions._
var df1 = rescaledData
for (colName <- colNames) {
df1 = df1.withColumn(colName, col(colName).cast(StringType))
//将DataFrame中的字段全部转为String类型,注意要导入spark.sql.types.StringType包
}
df1.show()
/*
+-----+--------------------+--------------------+--------------------+--------------------+
|label| sentence| words| rawFeatures| features|
+-----+--------------------+--------------------+--------------------+--------------------+
| 0|I heard about Spa...|[i, heard, about,...|(100,[5,29,33,40,...|(100,[5,29,33,40,...|
| 0|I wish Java could...|[i, wish, java, c...|(100,[9,13,29,42,...|(100,[9,13,29,42,...|
| 1|Logistic regressi...|[logistic, regres...|(100,[4,38,86,93,...|(100,[4,38,86,93,...|
+-----+--------------------+--------------------+--------------------+--------------------+
这是df1的结果,在内容上,它rescaledData与一样,但df1的每一个字段全部被转成了String类型
*/
val url = "jdbc:mysql://172.30.16.80:3306/test?characterEncoding=UTF-8"
val table="test_table"
val prop = new java.util.Properties
prop.setProperty("user","root")
prop.setProperty("password","123")
df1.write.mode("append").jdbc(url, table, prop)
println("数据库写入成功! 耗时:")
}
}