SparkContext应用:rdd算子 sql练习

本文主要讲解应用SparkContext做SQL练习。

数据源Broken to Harness.txt,tags.csv是从hdfs上获取的,通过SparkContext做SQL。

Broken to Harness.txt部分内容(仅供参考)


Title: Broken to Harness
       A Story of English Domestic Life

Author: Edmund Yates

CHAPTER I.

MR. CHURCHILL'S IDEAS ARE MONASTIC.


The office of the Statesman daily journal was not popular with the neighbours, although its existence unquestionably caused a diminution of rent in its immediate proximity. It was very difficult to find--which was an immense advantage to those connected with it, as no one had any right there but the affiliated; and strangers burning to express their views or to resent imaginary imputations cast upon them had plenty of time to cool down while they wandered about the adjacent lanes in vain quest of their object. If you had business there, and were not thoroughly acquainted with the way, your best plan was to take a sandwich in your pocket, to prepare for an afternoon's campaign, and then to turn to the right out of Fleet Street, down any street leading to the river, and to wander about until you quite unexpectedly came upon your destination. There you found it, a queer, dumpy, black-looking old building,--like a warehouse that had been sat upon and compressed,--nestling down in a quaint little dreary square, surrounded by the halls of Worshipful Companies which had never been heard of save by their own Liverymen, and large churches with an average congregation of nine, standing mildewed and blue-mouldy, with damp voters'-notices peeling off their doors, and green streaks down the stuccoed heads of the angels and cherubim supporting the dripping arch over the porch, in little dank reeking churchyards, where the rank grass overtopped the broken tombstones, and stuck nodding out through the dilapidated railing.

The windows were filthy with the stains of a thousand showers; the paint had blistered and peeled off the heavy old door, and round the gaping chasm of the letter-box; and in the daytime the place looked woebegone and deserted. Nobody came there till about two in the afternoon, when three or four quiet-looking gentlemen would drop in one by one, and after remaining an hour or two, depart as they had come. But at night the old house woke up with a roar; its windows blazed with light; its old sides echoed to the creaking throes of a huge steam-engine; its querulous bell was perpetually being tugged; boys in paper caps and smeary faces and shirt-sleeves were perpetually issuing from its portals, and returning, now with fluttering slips of paper, now with bibulous refreshment. Messengers from the Electric Telegraph Companies were there about every half-hour; and cabs that had dashed up with a stout gentleman in spectacles dashed away with a slim gentleman in a white hat, returning with a little man in a red beard, and flying off with the stout gentleman again. Blinds were down all round the neighbourhood; porters of the Worshipful Companies, sextons of the congregationless churches, agents for printing-ink and Cumberland black-lead, wood-engravers, box-block sellers, and the proprietors of the Never-say-die or Health-restoring Drops, who held the corner premises,--were all sleeping the sleep of the just, or at least doing the best they could towards it, in spite of the reverberation of the steam-engine at the office of the Statesman daily journal.

On a hot night in September Mr. Churchill sat in a large room on the first-floor of the Statesman office. On the desk before him stood a huge battered old despatch-box, overflowing with papers--some in manuscript, neatly folded and docketed; others long printed slips, scored and marked all over with ink-corrections. Immediately in front of him hung an almanac and a packet of half-sheets of note-paper, strung together on a large hook. A huge waste-paper basket by his side was filled, while the floor was littered with envelopes of all sizes and colours, fragments cut from newspapers, ink-splashes, and piles of books in paper parcels waiting for review. A solemn old clock, pointing to midnight, ticked gravely on the mantelpiece; a small library of grim old books of reference, in solemn brown bindings, with the flaming cover of the Post-Office Directory like a star in the midst of them, was ranged against the wall; three or four speaking tubes, with ivory mouthpieces, were curling round Mr. Churchill's feet; and Mr. Churchill himself was reading the last number of the Revue de Deux Mondes by the light of a shaded lamp, when a heavy hand was laid on his shoulder, and a cheery voice said,

"Still at the mill, Churchill? still at the mill?"

"Ah, Harding, my dear fellow, I'm delighted to see you!"

"I should think you were," said Harding, laughing; "for my presence here means a good deal to you,--bed, and rest, and country, eh? Well, how have you been?--not knocked up? You've done capitally, my boy! I've watched you carefully, and am more than content." (For Mr. Harding was the editor of the Statesman, and Churchill, one of his principal contributors, had been taking his place while he made holiday.)

"That's a relief," said Churchill. "I've been rather nervous about it; but I thought that Tooby and I between us had managed to push the ship along somehow. Tooby's a capital fellow!"

"Yes, yes," said Mr. Harding, seating himself; "Tooby is a capital fellow, and there's not a better 'sub' in London. But Tooby couldn't have written that article on the Castle-Hedingham dinner, or shown up the Teaser's blunders in classical quotation, Master Frank. Palman qui meruit. Who did the Bishops and the Crystal Palace?"

"Oh, Slummer wrote those. Weren't they good?"

"Very smart; very smart indeed. A thought too strong of Billingsgate, though. That young man is a very hard hitter, but wants training. Where's Hawker?"

"Just gone. He's been very kind and very useful, so have Williams and Burke, and all. And you--how have you enjoyed yourself?"

"Never so much in my life. I've read nothing but the paper. I've done nothing but lie upon the beach and play with the children."

"And the children--are they all right? and Mrs. Harding?"

"Splendid! I never saw the wife look so well for the last six years. She sent all kind remembrances to you, and the usual inquiry."

"What! if I was going to be married? No, no; you must take back my usual answer. She must find me a wife, and it must be one after her own pattern."

"Seriously, Frank Churchill, it's time you began to look after a wife. In our profession, especially, it's the greatest blessing to have some one to care for and to be petted by in the intervals of business-strife. There used to be a notion that a literary man required to be perpetually 'seeing life,' which meant 'getting drunk, and never going home;' but that's exploded, and I believe that our best character-painters owe half their powers of delineation to their wives' suggestions. Women,--by Jove, sir!--women read character wonderfully."

"Mrs. Harding has made a bad shot at mine, old friend," said Churchill, laughing, "if she thinks that I am in any way desirous to be married. No, no! So far as the seeing life is concerned, I began early, and all that has been over long since. But I've got rather a queer temper of my own. I'm not the most tolerant man in the world; and I've had my own way so long, that any little missy fal-lals and pettishness would jar upon me horribly. Besides, I've not got money enough to marry upon. I like my comforts, and to be able to buy occasional books and pictures, and to keep my horse, and my club, and--"

"Well, but a fellow like you might pick up a woman with money!" said Harding.

"That's the worst pick-up possible,--to have to be civil to your wife's trustees, or listen to reproaches as to how 'poor papa's money' is being spent. No, no, no! So long as my dear old mother lives, I shall have a decent home; and afterwards--well, I shall go into chambers, I suppose, and settle down into a club-haunting old fogey."

"Stuff, Frank; don't talk such rubbish. Affectation of cynicism and affectation of premature age are two of the most pernicious cants of the day. Very likely now at the watering-place to which you're going for your holiday, you'll meet some pretty girl who--"

"Watering-place!" cried Frank, shouting with laughter; "I'm going to my old godfather'


tags.csv部分内容(仅供参考)

userId,movieId,tag,timestamp
3,260,classic,1439472355
3,260,sci-fi,1439472256
4,1732,dark comedy,1573943598
4,1732,great dialogue,1573943604
4,7569,so bad it's good,1573943455
4,44665,unreliable narrators,1573943619
4,115569,tense,1573943077
4,115713,artificial intelligence,1573942979
4,115713,philosophical,1573943033
4,115713,tense,1573943042
4,148426,so bad it's good,1573942965
4,164909,cliche,1573943721
4,164909,musical,1573943714
4,168250,horror,1573945163
4,168250,unpredictable,1573945171
19,2160,Oscar (Best Supporting Actress),1446909853
19,7099,adventure,1445286141
19,7099,anime,1445286127
19,7099,ecology,1445286153
19,7099,fantasy,1445286144
19,7099,Hayao Miyazaki,1445286120
19,7099,Miyazaki,1445286148
19,7099,post-apocalyptic,1445286136
20,1210,bah,1155082282
43,434,Clint Eastwood,1170492549
68,3481,music,1472113217
84,194728,art,1549387440
84,194728,contemporary art,1549387437
84,194728,documentary,1549387432
87,1127,aliens,1542308477
87,1127,amazing photography,1542308501
87,1127,Director: James Cameron,1542308487
87,1127,first contact,1542308468
87,1127,James Cameron,1542308492
87,1127,Michael Biehn,1542308483
87,1127,sci-fi,1542308464
87,6537,android(s)/cyborg(s),1542309549
87,6537,apocalypse,1542309703
87,6537,Arnold Schwarzenegger,1542309595
87,6537,artificial intelligence,1542309599
87,6537,franchise,1542309536
87,6537,terminator,1542309735
87,6537,time travel,1542309532
87,72998,James Cameron,1542308389
87,72998,sci-fi,1542308408
87,72998,science fiction,1542308395
87,79132,sci-fi,1522675497
87,102445,inferior sequel,1522677007
87,102445,setting:London (UK) (future),1522677058
87,102445,unoriginal,1522677043
87,104841,bad science,1522676752
87,109487,good science,1522676693
import cn.kgc.util.mysql.{Batch, MySqlDao}
import org.apache.spark.{SparkConf, SparkContext}

import scala.collection.mutable.ArrayBuffer

/*
SparkContext:应用
 */
object App {
  def main(args: Array[String]): Unit = {
    val config: SparkConf = new SparkConf()
      .setAppName("spark_rdd_03")//设置任务名称
      .setMaster("local[*]")//设置Master,本地模式,“ * ” 表示cup资源有多少,用多少;也可以用local[2],表示2个核
    val sc = new SparkContext(config)

 /*
 //1.word count
    sc
      .textFile("hdfs://single01:9000/spark/resource/src/Broken to Harness.txt",3)//文件路径和分区数
        .mapPartitions(it=>{    //多分区,每个分区都是一个迭代器,每个分区单独处理(单分区可以用map)
          it    //以行迭代
            .filter(_.trim.size>0)  //清除空行
            .flatMap(   //一行有多个单词(数据),不是单一数据所以降维(如果一行作为一条数据,可以用map)
              _
                .replaceAll(",|\\.|!|\\?|;|\"|-","")//清洗每行的标点符号
                .replaceAll("\\s{2,}"," ")//两个或以上的空白字符变为一个空格
                .split(" ")//以空格分割单词
                .map((_,1))//改变结果(单词,1)
                .groupBy(_._1)//以单词分组(有点:减少后面的shuffle操作数据迁移量,类似于combiner)
                .map(tp2=>(tp2._1,tp2._2.size))//统计单词和单词数量
            )
        })
        .reduceByKey(_+_)//自动按键分组,把相邻的两个同键(单词)的值相加
        .foreach(println)
    sc.stop()*/

    //***************************************************************************************************************
    //2.每个用户的观影次数
/*
    sc
    .textFile("hdfs://single01:9000/spark/resource/src/tags.csv",3)//文件路径和分区数
      .mapPartitionsWithIndex((index,it)=>{ //给每个分区建立索引
        if(index==0)it.drop(1)//删除第一个分区的表头
        it.map(line=>{
          val ps:Array[String]=line.split(",")
          (ps(0),1)
        }).toArray
          .groupBy(_._1)
          .map(tp2=>(tp2._1,tp2._2.size))
          .toIterator
      })
      .reduceByKey(_+_)
      .foreach(println)
    sc.stop()*/


    //***************************************************************************************************************
    //3.统计用户的观影数,重复看最多的前三名movieId,个人热点观影时段(每3个小时内的时间段(1,2)或(2,3)或(1,2,3)占三个小时总观影数>80%)
    //数据:tags.csv
    //样例类
   /* case class Tag(userId: Int, movieId: Int, tag: String, timestamp: Long)
   //用正则处理数据
    val r1 = "(.*?),(.*?),(.*?),(.*?)".r    //正常数据匹配
    val r2 = "(.*?),(.*?),(\".*?,.*?\"+),(.*?)".r //异常数据匹配
    //偏函数处理数据异常
    val pf: PartialFunction[String, (Int, Tag)] = (line: String) => line match {
      case r2(a, b, c, d) => (a.toInt, Tag(a.toInt, b.toInt, c, d.toLong))
      case r1(a, b, c, d) => (a.toInt, Tag(a.toInt, b.toInt, c, d.toLong))
    }
  //补全每天数据
    def fillIfNot24(arr:Array[(String,(Int,Int))]):Iterator[(String,(Int,Int,Int))]={
      val map: Map[Int, Int] = arr.map(_._2)
        .groupBy(_._1)  //以小时为单位分组
        .map(tp2 => (tp2._1, tp2._2.map(_._2).distinct.size))//(小时,每小时内的观看不同的影数量)

      val seqToTp3=(seq:IndexedSeq[(Int,Int)])=>(seq(0)._2,seq(1)._2,seq(2)._2)

      (0 to 23 )
        //补全每天缺失的小时数
        .map(hour=>(hour,map.getOrElse(hour,0)))
        .sliding(3,1)
        .map(seq=>{
          val h3: String = seq.map(_._1).mkString("_")
          //(连续3小时,每连续3小时内的观看不同的影数量(步进1))
          (h3,seqToTp3(seq))
        })

    }

    val int3OutTp2=(map:Map[String,(Int,Int,Int)])=>{
      val fstHourSum: Int = map.map(_._2._1).sum
      val sndHourSum: Int = map.map(_._2._2).sum
      val trdHourSum: Int = map.map(_._2._3).sum
      val sum =fstHourSum+sndHourSum+trdHourSum
      val h3 = map.map(_._1).toArray.apply(0)
       val h3s= h3.split("_")

      var accSum =0.0
      val buffer = new ArrayBuffer[Int](1)
      import scala.util.control.Breaks._
      var lastIx = -1
      //找出3个连续小时内达到要求的(>80%)小时对应的下标
      breakable(
        Array(fstHourSum, sndHourSum, trdHourSum)
          .zipWithIndex
          //按观影数降序排列
          .sortWith(_._1>_._1)
          .foreach(tp2=>{
            if(Math.abs(lastIx-tp2._2)<=1){
              accSum +=tp2._1
              buffer.append(tp2._2)
            }
            //达到80%跳出
            if(accSum/sum>=0.8){
              lastIx = -1
              break
            }
            lastIx=tp2._2
          })
      )
    //对达到要求的小时的下标进行升序排序后再拼接
      (sum,if(lastIx != -1) h3 else buffer
        .sortWith(_<_)
        .map(h3s(_))
        .mkString("_"))
    }

    implicit class LongExpand(v:Long){  //隐式类
      def toDate()="%tF".format(v)  //提取日期yyyy-MM-dd
      def toHour()="%tT".format(v).substring(0,2).toInt //提取小时
    }


    sc
      .textFile("hdfs://single01:9000/spark/resource/src/tags.csv", 3)
      .mapPartitionsWithIndex((index, it) => {  //给每个分区建立索引了
        if (index == 0) it.drop(1) //删除第一个分区的表头
        it.collect(pf)  //把每行数据解析
      })
      //shuffle 分3个分区,以键聚合 同一个userId
      .groupByKey(3)
      .mapValues(it=>{
        val arr = it.toArray //迭代器转成数组 (三个指标,重复使用)
        val uqMovieCount: Int= arr.map(_.movieId).distinct.size   //第一个指标
        val top3MovieId: String = arr   //第二个指标
          .map(tag => (tag.movieId, 1))
          .groupBy(_._1).map(tp2 => (tp2._1, tp2._2.size))//(movieId,重复观看电影数)
          .toArray
          .sortWith(_._2 > _._2)
          .take(3)
          .mkString("-")

        val tp2: (Int, String) = arr
          .map(tag => (tag.timestamp.toDate, (tag.timestamp.toHour, tag.movieId)))
          //以天分组
          .groupBy(_._1)
          .flatMap(tp2 => fillIfNot24(tp2._2)) //(String,((Int,Int),(Int,Int),(Int,Int))) 一天的数据
          //对所有天数以小时分组
          .groupBy(_._1)
          .mapValues(int3OutTp2)
          .map(_._2)
          .toArray
          .sortWith(_._1 > _._1)
          .apply(0)
        (uqMovieCount,top3MovieId,tp2._2)
      })
//        .foreach(println)

      // 把处理过得数据并写入到hdfs里
//      .saveAsTextFile(s"hdfs://single01:9000/spark/source/src/tags_${System.currentTimeMillis().toDate()}_${System.currentTimeMillis()}")

    sc.stop()
  }
}

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值