Spark
Apache Spark使用最先进的DAG调度器、查询优化器和物理执行引擎,实现了批处理和流数据的高性能。
Cocktail_py
这个作者很懒,什么都没留下…
展开
-
spark3.x 集群部署
一.zookeeper集群部署# 配置hosts解析cat > /etc/hosts <<EOF10.8.40.222 spark0110.8.111.220 spark0210.8.69.253 spark03EOF# 配置SSH免密,主备master节点对所有worker节点免密,需要在3个节点执行:ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsassh-copy-id spark01ssh-copy-id spark02ssh-原创 2021-11-04 15:40:34 · 16839 阅读 · 0 评论 -
pyspark sql mysql连表查询
1.安装mysql驱动https://dev.mysql.com/downloads/connector/j/cp mysql-connector-java-8.0.27.jar $SPARK_HOME/jars2.pyspark读取mysql# -*- coding: utf-8 -*-# @Time : 2021/10/29 20:23# @Author :from pyspark import SparkConffrom pyspark.sql import SparkS原创 2021-11-02 09:36:51 · 3443 阅读 · 0 评论 -
centos spark单节点部署 && pyspark DataFrame快速入门
一.spark单节点部署# 1.安装java环境# 略# 2.安装scala环境wget https://downloads.lightbend.com/scala/2.13.6/scala-2.13.6.tgztar -zxvvf scala-2.13.6.tgzcd scala-2.13.6# 修改配置文件,设置SCALA_HOME路径vim /etc/profileexport SCALA_HOME=/opt/scala-2.13.6export PATH=$PATH:$JAVA原创 2021-10-30 20:13:18 · 2775 阅读 · 0 评论 -
pyspark shell指定相应python版本
//whereis python3sudo -u hdfs pyspark --conf "spark.pyspark.driver.python=/usr/bin/python3" --conf "spark.pyspark.python=/usr/bin/python3"参考:Pyspark spark-submit 集群提交任务以及引入虚拟环境依赖包攻略原创 2021-05-06 15:25:46 · 621 阅读 · 0 评论 -
Could not find valid SPARK_HOME while searching [‘/home/user‘, ‘/home/user/.local/bin‘]
pip3 uninstall pysparkpip3 install --user pysparkRunning pyspark after pip install pyspark转载 2021-02-03 14:02:00 · 7418 阅读 · 0 评论 -
通过pyspark实现ip地址查询
机器内存不够,跑不动;先记录一波# -*- coding: utf-8 -*-# @Time : 2019/12/16 20:05# @Author :import osfrom pyspark.sql import SparkSessionfrom pyspark import SparkConfos.environ['PYSPARK_PYTHON'] = "/u...原创 2019-12-17 06:14:40 · 1515 阅读 · 2 评论 -
pyspark点击流日志分析
准备日志# cat access.log194.237.142.21 - - [18/Sep/2019:06:49:18 +0000] "GET /wp-content/uploads/2019/07/rstudio-git3.png HTTP/1.1" 304 0 "-" "Mozilla/4.0 (compatible;)"183.49.46.228 - - [18/Sep/2019:0...原创 2019-12-15 14:07:32 · 403 阅读 · 0 评论 -
pyspark常用的基本操作
一.通过外部数据创建RDD# 1.通过本地数据创建RDD# 准备本地文件cat myspark.txt >hello my name is Cocktail_py>welcome to my blogs# pyspark shellrdd = sc.textFile("file:////root/myspark.txt")rdd.collect()>['hel...原创 2019-12-14 20:03:22 · 624 阅读 · 0 评论 -
Spark2.4.4集群搭建
Spark是基于内存的分布式计算框架本文基于<<Hadoop2.7.7 HA完全分布式集群搭建>>搭建1.下载相应的安装包cd /usr/local# 下载Scala安装包wget https://downloads.lightbend.com/scala/2.13.1/scala-2.13.1.tgz# 下载Spark安装包wget https://m...原创 2019-12-10 20:40:33 · 706 阅读 · 0 评论