Spark在集群上执行代码案例
java的切词使用案例(Demo)
@Test
public void testDemo() {
JiebaSegmenter segmenter = new JiebaSegmenter();
String[] sentences =
new String[] {"这是一个伸手不见五指的黑夜。我叫孙悟空,我爱北京,我爱Python和C++。", "我不喜欢日本和服。", "雷猴回归人间。",
"工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作", "结果婚的和尚未结过婚的"};
for (String sentence : sentences) {
System.out.println(segmenter.process(sentence, SegMode.INDEX).toString());
}
}
Spark中文切词代码
package com.badou
import com.huaban.analysis.jieba.{JiebaSegmenter, SegToken}
import com.huaban.analysis.jieba.JiebaSegmenter.SegMode
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
object JiebaSeg {
def main(args: Array[String]): Unit = {
// 定义结巴分词类的序列化
val conf = new SparkConf()
//注册自定义类交给KryoSerializer序列化处理类进行序列化
.registerKryoClasses(Array(classOf[JiebaSegmenter]))
//设置优化参数
.set("spark.rpc.message.maxSize","800")
// 建立sparkSession,并传入定义好的Conf
val spark = SparkSession
.builder()
.appName("Jieba UDF")
.enableHiveSupport()
.config(conf)
.getOrCreate()
// 定义结巴分词的方法,传入的是DataFrame,输出也是DataFrame多一列seg(分好词的一列)
def jieba_seg(df:DataFrame,colname:String): DataFrame ={
// 定义类的实例化
val segmenter = new JiebaSegmenter()
//使用广播变量进行序列化(广播变量只能读,不能修改,类似分发)
val seg = spark.sparkContext.broadcast(segmenter)
val jieba_udf = udf{(sentence:String)=>
// 通过value获取到一段中文
val segV = seg.value
segV.process(sentence.toString,SegMode.INDEX)
.toArray().map(_.asInstanceOf[SegToken].word)
.filter(_.length>1).mkString(" ")
}
df.withColumn("seg",jieba_udf(col(colname)))
}
val df =spark.sql("select content,label from badou.new_no_seg limit 300")
val df_seg = jieba_seg(df,"content")
df_seg.show()
// saveAsTable 直接存储到对应的hive表
df_seg.write.mode("overwrite").saveAsTable("badou.news_jieba")
}
}
------------------------------------+--------+----------+
| content| label| seg|
+-------------------------------------+--------+---------+
|成都猎豹飞腾经典版赠万元精美礼包搜...| auto| 成都 猎豹 ...|
|预算法修正案二审重申地方不得自行发...|business| 预算 算法 ...|
|黄宗泽曝徐子珊私下犹如男生女人味不...| yule| 黄宗泽 曝 ,...|
+-------------------------------------+--------+-------+
only showing top 20 rows
需求:利用jieba进行中文分词,并打包上传到集群进行执行
打包命令:
mvn clean install assembly:assembly
git上传文件到Linux系统
scp -rp badou_spark_20-1.0-SNAPSHOT-jar-with-dependencies.jar root@master:/usr/local/src/badou_code/spark/sub
注意:工作中慎用
rm -rf
run_cluster.sh(文件)
cd $SPARK_HOME
./bin/spark-submit \
#指定要执行的class类
--class com.badou.JiebaSeg \
--master yarn-cluster \
#将hive-site.xml分发到集群中
--file $HIVE_HOME/conf/hive-site.xml \
#将jar包上传
/usr/local/src/badou_code/spark/sub/badou_spark_20-1.0-SNAPSHOT-jar-with-dependencies.jar
run_local.sh(文件)
cd $SPARK_HOME
./bin/spark-submit \
--class com.badou.JiebaSeg \
#本地集群加载
--master local[2] \
--files $HIVE_HOME/conf/hive-site.xml \
/usr/local/src/badou_code/spark/sub/badou_spark_20-1.0-SNAPSHOT-jar-with-dependencies.jar
切词函数换成另一个函数
来查看为什么要用map(_.asInstanceOf[SegToken].word)
def jieba_seg(df:DataFrame,colname:String): DataFrame ={
// 定义类的实例化
val segmenter = new JiebaSegmenter()
val seg = spark.sparkContext.broadcast(segmenter)
val jieba_udf = udf{(sentence:String)=>
// 通过value获取到一段中文
val segV = seg.value
//切词处理,返回的是一个迭代器,一个个词组
segV.process(sentence.toString,SegMode.INDEX)
//将迭代器(返回多个元素),所以进行toArray转换成数组 scala转换成java实例类对象
.toArray().mkString(" ")
}
df.withColumn("seg",jieba_udf(col(colname)))
}
+-------------------------------------+--------+-------------------------+
| content| label| seg|
+-------------------------------------+--------+-------------------------+
|成都猎豹飞腾经典版赠万元精美礼包搜...| auto| [成都, 0, 2] [猎豹, 2...|
|预算法修正案二审重申地方不得自行发...|business| [预算, 0, 2] [算法, 1...|
|黄宗泽曝徐子珊私下犹如男生女人味不...| yule| [黄宗泽, 0, 3] [曝, 3...|
|成都英朗自动时尚版优惠现金万元搜狐...| auto| [成都, 0, 2] [英朗, 2...|
|彩合网排列三第期分析:个位路决杀搜...| sports|[彩合网, 0, 3] [排列, ...|
|金志文赞郭浩《其实你不懂我伤悲》男...| yule| [金志, 0, 2] [文赞, 2...|
|女超上海女足轻取长春尤佳一己之力定...| sports| [女超, 0, 2] [上海, 2...|
|从欧洲杯看理财阵型搭配:前锋犹如风...|business| [从, 0, 1] [欧洲, 1,...|
|不辱没无敌兔佳能售搜狐数码佳能报价...| it| [不, 0, 1] [辱没, 1,...|
|全国武术套路赛河南夺得两金项比赛团...| sports| [全国, 0, 2] [武术, 2...|
|法拉利预计年利润增长尚无上市计划搜...| auto|[法拉, 0, 2] [法拉利, ...|
|《向着炮火前进》质检:漫画式抗战剧...| yule| [《, 0, 1] [向着, 1,...|
|黑天鹅来袭金陵药业跌停机构卖出万元...|business| [黑天, 0, 2] [天鹅, 1...|
|全球经济向好家电业四季度将全面复苏...| it| [全球, 0, 2] [经济, 2...|
|郑洁透露晏紫有望中网复出赞张帅是要...| sports| [郑洁, 0, 2] [透露, 2...|
|网上掀起陪驾热价格猛涨搜狐闯红灯一...| it| [网上, 0, 2] [掀起, 2...|
|沪上四成大型企业已建首席信息官制搜...|business| [沪, 0, 1] [上, 1, ...|
| 家公司半年“报忧”周期行业成重灾区...|business| [家, 0, 1] [公司, 1,...|
|天津大众劲取现车有售直降元搜狐汽车...| auto| [天津, 0, 2] [大众, 2...|
|厦门购现代全系均可现金让利搜狐汽车...| auto| [厦门, 0, 2] [购, 2,...|
+-------------------------------------+--------+-------------------------+
only showing top 20 rows