训练八:sqoop+pyspark实战

1. 下载安装

下载地址:
http://archive.cloudera.com/cdh5/cdh/5/sqoop-1.4.6-cdh5.9.3.tar.gz
下载 sqoop-1.4.6-cdh5.9.3.tar.gz

解压后安装在/home/hadoop/tools/文件夹下

2. 修改环境变量

sudo vim /etc/profile
# 
export PYTHONPATH=/home/hadoop/tools/spark2/python 
export PYSPARK_PYTHON=python3 
 
# pyspark 
export PYSPARK_DRIVER_PYTHON=ipython 
export PYSPARK_DRIVER_PYTHON_OPTS='notebook' 

# SQOOP
export SQOOP_HOME=/home/hadoop/tools/sqoop
export PATH=$PATH:$SQOOP_HOME/bin
export HIVE_CONF_DIR=/home/hadoop/tools/hive/conf
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/

3. 修改sqoop-env.sh:

#Set the path for where zookeper config dir is
#export ZOOCFGDIR=
export HADOOP_COMMON_HOME=/home/hadoop/tools/hadoop3
export HADOOP_MAPRED_HOME=/home/hadoop/tools/hadoop3
export HIVE_HOME=/home/hadoop/tools/hive

4. 修改bin/configure-sqoop:注释掉HCAT_HOME、ACCUMULO_HOME、ZOOKEEPER_HOME的检查

## Moved to be a runtime check in sqoop.
#if [ ! -d "${HCAT_HOME}" ]; then
#  echo "Warning: $HCAT_HOME does not exist! HCatalog jobs will fail."
#  echo 'Please set $HCAT_HOME to the root of your HCatalog installation.'
#fi

#if [ ! -d "${ACCUMULO_HOME}" ]; then
#  echo "Warning: $ACCUMULO_HOME does not exist! Accumulo imports will fail."
#  echo 'Please set $ACCUMULO_HOME to the root of your Accumulo installation.'
#fi
#if [ ! -d "${ZOOKEEPER_HOME}" ]; then
#  echo "Warning: $ZOOKEEPER_HOME does not exist! Accumulo imports will fail."
#  echo 'Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.'
#fi

5. 将mysql-connector-java-8.0.11.zip复制到sqoop的lib下面

6. 安装mysql8

7. 将mysql导入HDFS

sqoop import --connect jdbc:mysql://192.168.1.15:3306/agydata?zeroDateTimeBehavior=round --username root --password hadoop --query 'select * from student where $CONDITIONS' --target-dir /Hadoop/Input/student -m 3 --fields-terminated-by '\t' --split-by 'id'

-m 表示启动N个map来并行导入数据,默认是4个,最好不要将数字设置为高于集群的节点数

默认放在/user/用户名/

8. 词频统计测试

wordcountpy

# -*- coding:utf-8 -*-
from pyspark import SparkContext, SparkConf

inputFile = 'hdfs://master:9000/Hadoop/Input/wordcount/part-m*'        #测试文档
outputFile = 'hdfs://master:9000/Hadoop/Output/wordcount'    #结果目录

# 也可以用于web上监控
appName = "wordcount"
# 服务器名可以使用ip
master = "spark://master:7077"

conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)

text_file = sc.textFile(inputFile)

counts = text_file.flatMap(lambda line: line.split(',')).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
counts.saveAsTextFile(outputFile)

#!/bin/bash
echo -e "\033[31m ========Runing SQOOP to HDFS !!!======== \033[0m"
sqoop import --connect jdbc:mysql://192.168.1.15:3306/agydata?zeroDateTimeBehavior=round --username root --password hadoop --columns "wordcountcol" --delete-target-dir --target-dir /Hadoop/Input/wordcount --mapreduce-job-name mysql2hdfs --table wordcount -m 3

echo -e "\033[31m ========Readint HDFS Data and Run wordcount.py Now !!!======== \033[0m"
hadoop fs -test -e /Hadoop/Output/wordcount
if [ $? -eq 0 ];then
	echo -e "\033[31m ========Deleteing File directory !!!======== \033[0m"
	hadoop fs -rm -r /Hadoop/Output/wordcount
fi

echo -e "\033[31m ========Runing Wordcount Model !!!======== \033[0m"
export CURRENT=/home/hadoop/work
$SPARK_HOME/bin/spark-submit $CURRENT/wordcount.py

echo -e "\033[31m ========Result Output !!!======== \033[0m"
hadoop fs -cat /Hadoop/Output/wordcount/*

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值