测试数据:
146.202.84.90 江西 2020-10-28 1603879301437 6285032924209569490 www.jd.com Login
146.202.84.90 江西 2020-10-28 1603879301438 6285032924209569490 www.gome.com.cn Login
146.202.84.90 江西 2020-10-28 1603879301438 6285032924209569490 www.taobao.com Comment
118.62.67.216 北京 2020-10-28 1603879301438 2988409670998681798 www.dangdang.com Click
118.62.67.216 北京 2020-10-28 1603879301438 2988409670998681798 www.suning.com Click
118.62.67.216 北京 2020-10-28 1603879301439 2988409670998681798 www.gome.com.cn Comment
100.214.27.58 河北 2020-10-28 1603879301441 6531278323337129900 www.taobao.com View
100.214.27.58 河北 2020-10-28 1603879301444 6531278323337129900 www.taobao.com Click
100.214.27.58 河北 2020-10-28 1603879301444 6531278323337129900 www.mi.com Regist
42.222.37.182 香港 2020-10-28 1603879301444 4579529561379204385 www.dangdang.com View
42.222.37.182 香港 2020-10-28 1603879301444 4579529561379204385 www.baidu.com Regist
42.222.37.182 香港 2020-10-28 1603879301445 4579529561379204385 www.suning.com Comment
示例代码;
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
/**
* Tuple格式的DataSet加载DataFrame
*/
object ReadTupleDataSetToDF {
def main(args: Array[String]): Unit = {
val session: SparkSession = SparkSession.builder()
.master("local")
.appName("ReadTupleDataSetToDF")
.getOrCreate()
session.sparkContext.setLogLevel("Error")
val ds: Dataset[String] = session.read.textFile("T:/code/spark_scala/data/pvuvdata")
import session.implicits._
val tupleDs: Dataset[(String, String, String, String, String, String, String)] = ds.map(line => {
//126.54.121.136 浙江 2020-07-13 1594648118250 4218643484448902621 www.jd.com Comment
val arr: Array[String] = line.split("\t")
(arr(0), arr(1), arr(2), arr(3), arr(4), arr(5), arr(6))
})
val frame: DataFrame = tupleDs.toDF("ip", "local", "date", "ts", "uid", "site", "operator")
frame.createTempView("t")
// pv
session.sql(
"""
| select site ,count(*) as pv from t group by site order by pv
|""".stripMargin).show()
// uv
session.sql(
"""
|select site,count(*) uv from (select distinct ip,site from t) t1 group by site order by uv
|""".stripMargin).show()
}
}
结果显示:
+----------------+-----+
| site| pv|
+----------------+-----+
| www.baidu.com|18293|
| www.suning.com|18320|
| www.taobao.com|18375|
|www.dangdang.com|18576|
| www.gome.com.cn|18587|
| www.jd.com|18600|
| www.mi.com|18667|
+----------------+-----+
+----------------+-----+
| site| uv|
+----------------+-----+
| www.suning.com|15442|
| www.baidu.com|15489|
| www.taobao.com|15582|
|www.dangdang.com|15609|
| www.mi.com|15619|
| www.gome.com.cn|15672|
| www.jd.com|15683|
+----------------+-----+