How-to: Install Apache Zeppelin on CDH

最新推荐文章于 2022-08-11 15:39:52 发布

xiao_jun_0820

最新推荐文章于 2022-08-11 15:39:52 发布

阅读量1.8k

点赞数

分类专栏： zeppelin spark

spark 同时被 2 个专栏收录

44 篇文章 2 订阅

订阅专栏

zeppelin

1 篇文章 0 订阅

订阅专栏

http://blog.cloudera.com/blog/2015/07/how-to-install-apache-zeppelin-on-cdh/

Our thanks to Karthik Vadla and Abhi Basu, Big Data Solutions engineers at Intel, for permission to re-publish the following (which was originally available here).

Data science is not a new discipline. However, with the growth of big data and adoption of big data technologies, the request for better quality data has grown exponentially. Today data science is applied to every facet of life—product validation through fault prediction, genome sequence analysis, personalized medicine through population studies and Patient 360 view, credit card fraud-detection, improvement in customer experience through sentiment analysis and purchase patterns, weather forecast, detecting cyber or terrorist attacks, aircraft maintenance utilizing predictive analytics to repair critical parts before they fail, and many more. Every day, data scientists are detecting patterns in data and providing actionable insights to influence organizational changes.

The data scientist’s work broadly involves acquisition, cleanup, and analysis of data. Being a cross-functional discipline, this work involves communication, collaboration, and interaction with other individuals, internal and possibly external to your organization. This is one reason why the “notebook” features in data analysis tools are gaining popularity. They ease organizing, sharing, and interactively working with long workflows. IPython Notebook is a great example but is limited to usage of Python language. Apache Zeppelin (incubating at the time of this writing) is a new web-based notebook that enables data-driven, interactive data analytics, and visualization with the added bonus of supporting multiple languages, including Python, Scala, Spark SQL, Hive, Shell, and Markdown. Zeppelin also provides Apache Spark integration by default, making use of Spark’s fast in-memory, distributed, data processing engine to accomplish data science at lightning speed.

This post demonstrates how easy it is to install Apache Zeppelin notebook on CDH (for dev/test only, not supported). We assume familiarity with Linux (especially CentOS) commands, installation, and configuration.

System Setup and Configuration

Components

Listed below are the specs of our test Hadoop cluster.

Installed hardware

Installed software

These installation commands are specific to CentOS. If you do not login as ‘root’, you must use sudo for all the commands.

Update CentOS packages (yum update).
Install latest version of Java, preferably version 1.7 or later (yum install java-1.8.0-openjdk-devel).
Install Git (yum install git).
Install Node.js and npm (yum install nodejs npm).
Bower (is installed by npm).
Install Apache Maven – refer to these steps for installation.

Important Note: When you are working in a corporate environment, you need to set the proxies for Git, Nnpm, and Bower individually along with Maven.

Setting Proxies

For Git

 
            1 
          
           git  
           config 
             
           -- 
           global 
             
           http 
           . 
           proxy  
           http 
           : 
           //your.company.proxy:port git config --global https.proxy http://your.company.proxy:port

For npm

 
            1 
          
            2 
          
           npm  
           config  
           set  
           proxy  
           http 
           : 
           //your.company.proxy:8080 
          
           npm  
           config  
           set  
           https 
           - 
           proxy  
           http 
           : 
           //your.company.proxy:8080

For Bower

 
            1 
          
            2 
          
            3 
          
            4 
          
            5 
          
            6 
          
           Create 
             
           a 
             
           file 
           : 
             
           nano 
             
           ~ 
           / 
           . 
           bowerrc 
          
           { 
          
           "proxy" 
           : 
           "http ://: 
          
           " 
           , 
             
           "https-proxy" 
           : 
           "http ://: 
          
           " 
          
           }

Building Zeppelin Binaries

Download and extract the latest version of Apache Zeppelin from GitHub.
Now cd to /incubator-zeppelin-master
The current versions of CDH, Hadoop, and Spark are:
CDH 5.4.0
Spark 1.3.0
Hadoop 2.6.0

Maven command to build the Zeppelin (locally):

Maven command to build the Zeppelin for YARN (All spark queries are tracked in Yarn history):

Profiles included:

Pspark-1.3: Installs spark framework support for Zeppelin

Ppyspark: Installs all configurations required to run pyspark interpreter in Zeppelin Phadoop-2.6: Installs Hadoop version support for Zeppelin

Once the build is successful, continue with the configuration.

General Configuration of Zeppelin

To access the Hive metastore, copy the hive-site.xml from HIVE_HOME/conf into ZEPPELIN_HOME/conffolder (where HIVE_HOME and ZEPPELIN_HOME refers to the install locations of this software).
In ZEPPELIN_HOME/conf folder duplicate zeppelin-env.sh.template and rename it to zeppelin-env.sh.
In ZEPPELIN_HOME/conf folder duplicate zeppelin-site.xml.template and rename it to zeppelin-site.xml.

YARN Configuration of Zeppelin

If you have built binaries for yarn, set the master property for the Spark interpreter, i.e., master=yarn-client via Zeppelin UI (Interpreter tab)

In the Zeppelin /conf directory go to the zeppelin-env.sh file, uncomment the export HADOOP_CONF_DIRand specify the configuration directory location of the yarn-site.xml file (e.g., export HADOOP_CONF_DIR =/etc/hadoop/conf).

Start Zeppelin: ./bin/zeppelin-daemon.sh start (Note: Sometimes you may not be able to run the above command. In that case, make all scripts in /bin folder executable with the following command:

chmod –R 777.)

After this, try the previous command again to start Zeppelin.

And now you can access your notebook at http://localhost:8080 or http://host.ip.address:8080.

Stop Zeppelin: ./bin/zeppelin-daemon.sh stop

Testing

Start the Zeppelin application: ./bin/zeppelin-daemon.sh start and access http://localhost:8080 (or IP address of the node it is installed on).

If you already have data on the Apache Hive metastore, which is accessible via hive commands locally, let’s test Zeppelin commands. Use the %hiveinterpreter to access the Hive metastore and list all available databases. In this example we already have some public genome databases available in our Hive metastore. If you do not have any data in your Hive metastore, you may want to load some data before starting this test or skip to Step 4.

Now, type these commands in notebook:

The code snippet is echoed back and the code execution output is displayed:

To display tables in a specific database, such as “wellderly”, type these commands in the notebook:

Again, the code snippet is echoed back and the code execution output is displayed:

1. Download the test dataset (education.csv) and place it in your HDFS location. Using the Scala interpreter register a table using the .csv file in HDFS. Use the code snippet to register the table. Note: Scala interpreter is the default, so nothing needs to be specified in Zeppelin (like %hive) when using Hive.

 
       
     
 
      
             1 
           

             2 
           

             3 
           

             4 
           

             5 
           

             6 
           

             7 
           

             8 
           

             9 
           

             10 
           

             11 
           

             12 
           

             13 
           

             14 
           

             15 
           

             16 
           
 
            val  
            eduText 
              
            = 
              
            sc 
            . 
            textFile 
            ( 
            "hdfs://your.ip.address /user/hadoop/education.csv" 
            ) 
           

               
           
 
            case 
              
            class 
              
            Education 
            ( 
            unitid 
              
            : 
              
            Integer 
            , 
              
            instnm 
              
            : 
              
            String 
            , 
              
            addr 
              
            : 
              
            String 
            , 
              
            city 
              
            : 
              
            String 
            , 
              
            stabbr 
              
            : 
              
            String 
            , 
              
            zip 
              
            : 
              
            String 
            ) 
           

               
           
 
            val  
            education 
              
            = 
              
            eduText 
            . 
            map 
            ( 
            s 
            = 
            > 
            s 
            . 
            split 
            ( 
            "," 
            ) 
            ) 
            . 
            filter 
            ( 
            s 
            = 
            > 
            s 
            ( 
            0 
            ) 
            != 
            "UNITID" 
            ) 
            . 
            map 
            ( 
              
            s 
            = 
            > 
            Education 
            ( 
            s 
            ( 
            0 
            ) 
            . 
            toInt 
            , 
           
 
            s 
            ( 
            1 
            ) 
            . 
            replaceAll 
            ( 
            "\"" 
            , 
              
            "" 
            ) 
            , 
           
 
            s 
            ( 
            2 
            ) 
            . 
            replaceAll 
            ( 
            "\"" 
            , 
              
            "" 
            ) 
            , 
           
 
            s 
            ( 
            3 
            ) 
            . 
            replaceAll 
            ( 
            "\"" 
            , 
              
            "" 
            ) 
            , 
           
 
            s 
            ( 
            4 
            ) 
            . 
            replaceAll 
            ( 
            "\"" 
            , 
              
            "" 
            ) 
            , 
           
 
            s 
            ( 
            5 
            ) 
            . 
            replaceAll 
            ( 
            "\"" 
            , 
              
            "" 
            ) 
           
 
            ) 
           
 
            ) 
           

               
           
 
            // Below line works only in spark 1.3.0. 
           
 
            // For spark 1.1.x and spark 1.2.x, 
           
 
            // use bank.registerTempTable("educationdata") instead. education.toDF().registerTempTable("educationdata") 
           
 
     

After that, run the command below:

 
             1 
           
             2 
           
             3 
           
            % 
            sql 
           
            select * 
              
            from  
            educationdata

You now have installed and configured Zeppelin correctly and you have been able to test the installation successfully. Documentation for Zeppelin is available here.

Sharing a Notebook

1. If you want to share these notebook results with another user, you can simply send the URL of your notebook to that user. (That user must have access to the server node and cluster on which you created your notebook). That user not only can view all your queries, but also run all your queries to view your queries’ results.
2. If you want to share only the results without any queries (report-mode), please follow these steps:
  1. Go to right corner of the Zeppelin window, where you see a dropdown list after the settings icon.
  2. Change it from default to report. In this mode only results can be viewed without queries.
  3. Copy the URL and share with others (who have access to the server node and cluster).
  4. As the above image shows, three modes are available to share your notebooks:
    1. Default – In this mode, the notebook can be edited by anyone who has access to the notebook (edit queries and re-run to display different results).
    2. Simple – This mode is similar to default, the only difference is that all the available options are invisible. Options are visible only when you hover your mouse over the cell. This mode gives a cleaner view of the results when shared.
    3. Report – When this mode is enabled, only the final results are visible (read only). The notebook cannot be edited

Conclusion

Clearly, Apache Zeppelin is in the incubator stage, but it does show promise as a cross-platform notebook not tied to a particular platform, tool, or programming language. Our intent here was to demonstrate how you can install Apache Zeppelin on your own system and start experimenting with its many capabilities. In the future, we want to use Zeppelin for exploratory data analysis and also write more interpreters for it to improve the visualization capability, i.e., incorporate Google Charts and similar tools.

5 responses on “How-to: Install Apache Zeppelin on CDH”

Alex OttJuly 29, 2015 at 2:38 am
Maven 3.2.0 reports error when –DskipTests is used. It’s better to use -Dmaven.test.skip=true instead…

Reply ↓
Karthik VadlaAugust 3, 2015 at 2:59 pm
Hi Alex,

Other way is to use latest version of Maven 3.3.
That should fix this.

Thanks
Karthik Vadla

Reply ↓
NirFebruary 24, 2016 at 3:07 am
Is it possible to run Impala queries with Zeppelin?

Reply ↓
1. Justin KestelynPost authorFebruary 24, 2016 at 11:49 am
  Not out of the box, but this user discovered that Zeppelin’s Postgres connector is actually a generic JDBC driver that perhaps can be used for other purposes (including for Impala?):
  
  http://thedataist.com/tutorial-using-apache-zeppelin-with-mysql/
  
  (We haven’t tested this, just passing it along)
  
  Reply ↓
maloukeMarch 2, 2016 at 1:52 am
hello ,
thank you for topics but i have some issues whit install
i tried to use zeppelin on CDH5.5.1 (coporate server with 15 nodes (YARN)) i did like your propose :
mvn clean package -Pspark-1.5 -Ppyspark -Dhadoop.version=2.6.0-cdh5.5.1 -Phadoop-2.6 -Pyarn –DskipTests (ok work!)
and env variable
export HADOOP_CONF_DIR=/etc/hadoop/conf:/etc/hive/conf
export HADOOP_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p1254.1026/bin/../lib/hadoop
export SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p1254.1026/lib/spark
export MASTER=yarn-client
export SPARK_SUBMIT_OPTIONS=”–conf spark.driver.port=54321 –conf spark.fileserver.port=54322 –conf spark.blockManager.port=54323 –deploy-mode client –master yarn –num-executors 2 –executor-memory 2g”
export JAVA_HOME=/usr/lib/jvm/j2sdk1.7-oracle

#Licensed to the Apache Software Foundation (ASF) under one or more
export PYSPARK_PYTHON=/opt/cloudera/extras/python27/bin/python

export PYSPARK_DRIVER_PYTHON=/opt/cloudera/extras/python27/bin/python

export PATH=$PATH:/opt/cloudera/extras/python27/bin/
on yarn manger i see Running , but only interperter i can use it s spark like that
%spark
sc (ok work)

i want use pyspark for my work job with %pyspark
i get erreor like
%pyspark not value
%sql same error
please give me help
thank for advance