a.业务需求
- 公司最近随着业务量的剧增,每天会有新的数据从公司服务器采集上来,公司大数据部门ETL组会每天清洗这些数据然后 生成一堆Excel文件(.csv)结尾,单个Excel文件不会出现重复的数据,总的数据量大概在1000万条,平均单个Excel数据量在40~50 万条。但是根据业务部门的指示想最终把这些Excel文件再做合并处理(按照某一属性如公司名称) 生成唯一一个Excel(.csv)文件并保证数据的唯一性
b.业务分析 - 单纯的看其实需求很简单,用传统的java的编程的方式完全可以解决,譬如考虑使用POI或者easyexcel等组件可解决问题,但是这个地方考虑到数据量的问题,又要考虑效率问题就不得不让人觉得问题的复杂度。那么有没有渐简单可行的方式完美的解决掉问题呢 ?答案是可以的 本次使用sparkSQL可以全行代码不超过100行完美解决。
c.实现思路
总体的实现思路利用spark分布式计算的特性- 将数据全部读取到内存中
- 读取进来的数据利用sparksql的特性创建临时表
- 直接使用sql转换成DataFrame最后再写出到固定的磁盘目录
d.代码实现
- 1.打开IDEA创建MAVEN项目
- 2.log4j.properties
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Set the default spark-shell log level to ERROR. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=INFO
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=de
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=INFO
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=INFO
log4j.logger.parquet=INFO
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=INFO
log4j.logger.com.bigdata.spark=INFO
- 3.pom.xml
<properties>
<spark.version>2.1.1</spark.version>
<scala.version>2.11.8</scala.version>
<log4j.version>1.2.17</log4j.version>
<slf4j.version>1.7.22</slf4j.version>
</properties>
<dependencies>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>jcl-over-slf4j</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>${slf4j.version}</version>
</dependency>
<!-- 具体的日志实现 -->
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>${log4j.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!--<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
</dependency>-->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.47</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-pool2</artifactId>
<version>2.4.2</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-configuration2</artifactId>
<version>2.2</version>
</dependency>
<dependency>
<groupId>commons-beanutils</groupId>
<artifactId>commons-beanutils</artifactId>
<version>1.9.3</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.6</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.10.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
</dependency>
<dependency>
<groupId>redis.clients</groupId>
<artifactId>jedis</artifactId>
<version>2.9.0</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>druid</artifactId>
<version>1.1.10</version>
</dependency>
<dependency>
<groupId>net.sf.json-lib</groupId>
<artifactId>json-lib</artifactId>
<version>2.4</version>
</dependency>
<dependency>
<groupId>org.json4s</groupId>
<artifactId>json4s-native_2.11</artifactId>
<version>3.2.11</version>
</dependency>
<dependency>
<groupId>org.json4s</groupId>
<artifactId>json4s-jackson_2.11</artifactId>
<version>3.2.11</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
<pluginManagement>
<plugins>
<!-- 该插件用于将Scala代码编译成class文件 -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.0.0</version>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</pluginManagement>
</build>
4.具体代码
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.sql.types.{StringType, StructField, StructType}
object SparkExcelReadApp {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("SparkExcelReadApp").setMaster("local[*]")
val sc: SparkContext = new SparkContext(conf)
sc.setLogLevel("ERROR")
val fieldSchema = StructType(Array(
StructField("busi_key", StringType, true),
StructField("sde_names", StringType, true),
StructField("sdeid_card_name", StringType, true),
StructField("sde_tax_number", StringType, true),
StructField("sde_card_type", StringType, true),
StructField("anenterprise_name", StringType, true),
StructField("anenterprise_status", StringType, true),
StructField("anenterprise_bodytype", StringType, true),
StructField("anEnterprise_enonce_name", StringType, true)
))
val sqlContext = new SQLContext(sc)
//使用sparksql支持的库com.databricks.spark.csv
val df: DataFrame = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", ",")
//.option("inferSchema",true.toString)
.schema(fieldSchema)
.load("C:\\text")
df.createOrReplaceTempView("t_table_init")
val nameDF: DataFrame = sqlContext.sql("SELECT\n " +
"first(a.busi_key) as busi_key,\n " +
"first(a.sde_names) as sde_names,\n " +
"first(a.sdeid_card_name) as sdeid_card_name,\n " +
"first(a.sde_tax_number) as sde_tax_number,\n " +
"first(a.sde_card_type) as sde_card_type,\n " +
"first(a.anenterprise_name) as anenterprise_name,\n " +
"first(a.anenterprise_status) as anenterprise_status,\n " +
"first(a.anenterprise_bodytype) as anenterprise_bodytype,\n " +
"first(a.anEnterprise_enonce_name) as anEnterprise_enonce_name\n " +
" FROM\n " +
"t_table_init a GROUP BY a.anenterprise_name"
)
val lastDF: DataFrame = sqlContext.sql("SELECT" +
"\n\tbusi_key," +
"\n\tsde_names," +
"\n\tsdeid_card_name," +
"\n\tsde_tax_number," +
"\n\tsde_card_type," +
"\n\tanenterprise_name," +
"\n\tanenterprise_status," +
"\n\tanenterprise_bodytype," +
"\n\tanEnterprise_enonce_name\nFROM" +
"\n\tt_table_init\nWHERE" +
"\n\tanenterprise_name IN (" +
"\n\t\tSELECT" +
"\n\t\t\tanenterprise_name" +
"\n\t\tFROM" +
"\n\t\t\tt_table_init" +
"\n\t\tGROUP BY" +
"\n\t\t\tanenterprise_name" +
"\n\t\tHAVING" +
"\n\t\t\tCOUNT(anenterprise_name) > 1" +
"\n\t)\nORDER BY" +
"\n\tanenterprise_name")
nameDF.coalesce(1).write.option("header", "true").csv("E:\\测试Excel\\name")
lastDF.coalesce(1).write.option("header", "true") csv ("E:\\测试Excel\\out");
/*nameDF.write.option("header", "true").csv("E:\\测试Excel\\name")
lastDF.write.option("header", "true").csv("E:\\测试Excel\\out");*/
println("成功!!!!!")
}
}
e.完结
其实后面在做完这个需求后发现csv的文件特别大 业务人员的最终想法将其存放至MongoDB中,其实可以按照这个思路继续进行代码的扩展,如将数据写入到Hbase或者ES中都是可以的,spark本身也是也具有扩展性,支持多个框架的对接