How-to: Install Apache Zeppelin on CDH

http://blog.cloudera.com/blog/2015/07/how-to-install-apache-zeppelin-on-cdh/



Our thanks to Karthik Vadla and Abhi Basu, Big Data Solutions engineers at Intel, for permission to re-publish the following (which was originally available here).

Data science is not a new discipline. However, with the growth of big data and adoption of big data technologies, the request for better quality data has grown exponentially. Today data science is applied to every facet of life—product validation through fault prediction, genome sequence analysis, personalized medicine through population studies and Patient 360 view, credit card fraud-detection, improvement in customer experience through sentiment analysis and purchase patterns, weather forecast, detecting cyber or terrorist attacks, aircraft maintenance utilizing predictive analytics to repair critical parts before they fail, and many more. Every day, data scientists are detecting patterns in data and providing actionable insights to influence organizational changes.

The data scientist’s work broadly involves acquisition, cleanup, and analysis of data. Being a cross-functional discipline, this work involves communication, collaboration, and interaction with other individuals, internal and possibly external to your organization. This is one reason why the “notebook” features in data analysis tools are gaining popularity. They ease organizing, sharing, and interactively working with long workflows. IPython Notebook is a great example but is limited to usage of Python language. Apache Zeppelin (incubating at the time of this writing) is a new web-based notebook that enables data-driven, interactive data analytics, and visualization with the added bonus of supporting multiple languages, including Python, Scala, Spark SQL, Hive, Shell, and Markdown. Zeppelin also provides Apache Spark integration by default, making use of Spark’s fast in-memory, distributed, data processing engine to accomplish data science at lightning speed.

This post demonstrates how easy it is to install Apache Zeppelin notebook on CDH (for dev/test only, not supported). We assume familiarity with Linux (especially CentOS) commands, installation, and configuration.

System Setup and Configuration

Components

Listed below are the specs of our test Hadoop cluster.


Installed hardware


Installed software

These installation commands are specific to CentOS. If you do not login as ‘root’, you must use sudo for all the commands.

  • Update CentOS packages (yum update).
  • Install latest version of Java, preferably version 1.7 or later (yum install java-1.8.0-openjdk-devel).
  • Install Git (yum install git).
  • Install Node.js and npm (yum install nodejs npm).
  • Bower (is installed by npm).
  • Install Apache Maven – refer to these steps for installation.

Important Note: When you are working in a corporate environment, you need to set the proxies for Git, Nnpm, and Bower individually along with Maven.

Setting Proxies
  • For Git
  • For npm
  • For Bower
Building Zeppelin Binaries
  • Download and extract the latest version of Apache Zeppelin from GitHub.
  • Now cd to /incubator-zeppelin-master
  • The current versions of CDH, Hadoop, and Spark are:

    CDH 5.4.0

    Spark 1.3.0

    Hadoop 2.6.0

  • Maven command to build the Zeppelin (locally):

    OR

    Maven command to build the Zeppelin for YARN (All spark queries are tracked in Yarn history):

    Profiles included:

    Pspark-1.3: Installs spark framework support for Zeppelin

    Ppyspark: Installs all configurations required to run pyspark interpreter in Zeppelin Phadoop-2.6: Installs Hadoop version support for Zeppelin

Once the build is successful, continue with the configuration.

General Configuration of Zeppelin
  • To access the Hive metastore, copy the hive-site.xml from HIVE_HOME/conf into ZEPPELIN_HOME/conffolder (where HIVE_HOME and ZEPPELIN_HOME refers to the install locations of this software).
  • In ZEPPELIN_HOME/conf folder duplicate zeppelin-env.sh.template and rename it to zeppelin-env.sh.
  • In ZEPPELIN_HOME/conf folder duplicate zeppelin-site.xml.template and rename it to zeppelin-site.xml.
YARN Configuration of Zeppelin

If you have built binaries for yarn, set the master property for the Spark interpreter, i.e., master=yarn-client via Zeppelin UI (Interpreter tab)

  • In the Zeppelin /conf  directory go to the zeppelin-env.sh file, uncomment the export HADOOP_CONF_DIRand specify the configuration directory location of the yarn-site.xml file (e.g., export HADOOP_CONF_DIR =/etc/hadoop/conf).

Start Zeppelin: ./bin/zeppelin-daemon.sh start (Note: Sometimes you may not be able to run the above command. In that case, make all scripts in /bin folder executable with the following command:

chmod –R 777.)

After this, try the previous command again to start Zeppelin.

And now you can access your notebook at http://localhost:8080 or http://host.ip.address:8080.

Stop Zeppelin: ./bin/zeppelin-daemon.sh  stop

Testing

    1. Start the Zeppelin application: ./bin/zeppelin-daemon.sh start and access http://localhost:8080 (or IP address of the node it is installed on).
    2. If you already have data on the Apache Hive metastore, which is accessible via hive commands locally, let’s test Zeppelin commands. Use the %hiveinterpreter to access the Hive metastore and list all available databases. In this example we already have some public genome databases available in our Hive metastore. If you do not have any data in your Hive metastore, you may want to load some data before starting this test or skip to Step 4.

      Now, type these commands in notebook:

      The code snippet is echoed back and the code execution output is displayed:

    3. To display tables in a specific database, such as “wellderly”, type these commands in the notebook:

          Again, the code snippet is echoed back and the code execution output is displayed:

            1. Download the test dataset (education.csv) and place it in your HDFS location. Using the Scala interpreter register a table using the .csv file in HDFS. Use the code snippet to register the table. Note: Scala interpreter is the default, so nothing needs to be specified in Zeppelin (like %hive) when using Hive.

          After that, run the command below:

          You now have installed and configured Zeppelin correctly and you have been able to test the installation successfully. Documentation for Zeppelin is available here.

          Sharing a Notebook

            1. If you want to share these notebook results with another user, you can simply send the URL of your notebook to that user. (That user must have access to the server node and cluster on which you created your notebook). That user not only can view all your queries, but also run all your queries to view your queries’ results.
            2. If you want to share only the results without any queries (report-mode), please follow these steps:
              1. Go to right corner of the Zeppelin window, where you see a dropdown list after the settings icon.
              2. Change it from default to report. In this mode only results can be viewed without queries.
              3. Copy the URL and share with others (who have access to the server node and cluster).

              4. As the above image shows, three modes are available to share your notebooks:
                1. Default – In this mode, the notebook can be edited by anyone who has access to the notebook (edit queries and re-run to display different results).
                2. Simple – This mode is similar to default, the only difference is that all the available options are invisible. Options are visible only when you hover your mouse over the cell. This mode gives a cleaner view of the results when shared.
                3. Report – When this mode is enabled, only the final results are visible (read only). The notebook cannot be edited

          Conclusion

          Clearly, Apache Zeppelin is in the incubator stage, but it does show promise as a cross-platform notebook not tied to a particular platform, tool, or programming language. Our intent here was to demonstrate how you can install Apache Zeppelin on your own system and start experimenting with its many capabilities. In the future, we want to use Zeppelin for exploratory data analysis and also write more interpreters for it to improve the visualization capability, i.e., incorporate Google Charts and similar tools.

          facebooktwittergoogle_pluslinkedinmail
       

      5 responses on “How-to: Install Apache Zeppelin on CDH

      1. Alex OttJuly 29, 2015 at 2:38 am

        Maven 3.2.0 reports error when –DskipTests is used. It’s better to use -Dmaven.test.skip=true instead…

      2. Karthik VadlaAugust 3, 2015 at 2:59 pm

        Hi Alex,

        Other way is to use latest version of Maven 3.3.
        That should fix this.

        Thanks
        Karthik Vadla

      3. NirFebruary 24, 2016 at 3:07 am

        Is it possible to run Impala queries with Zeppelin?

        1. Justin KestelynPost authorFebruary 24, 2016 at 11:49 am

          Not out of the box, but this user discovered that Zeppelin’s Postgres connector is actually a generic JDBC driver that perhaps can be used for other purposes (including for Impala?):

          http://thedataist.com/tutorial-using-apache-zeppelin-with-mysql/

          (We haven’t tested this, just passing it along)

      4. maloukeMarch 2, 2016 at 1:52 am

        hello ,
        thank you for topics but i have some issues whit install
        i tried to use zeppelin on CDH5.5.1 (coporate server with 15 nodes (YARN)) i did like your propose :
        mvn clean package -Pspark-1.5 -Ppyspark -Dhadoop.version=2.6.0-cdh5.5.1 -Phadoop-2.6 -Pyarn –DskipTests (ok work!)
        and env variable
        export HADOOP_CONF_DIR=/etc/hadoop/conf:/etc/hive/conf
        export HADOOP_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p1254.1026/bin/../lib/hadoop
        export SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p1254.1026/lib/spark
        export MASTER=yarn-client
        export SPARK_SUBMIT_OPTIONS=”–conf spark.driver.port=54321 –conf spark.fileserver.port=54322 –conf spark.blockManager.port=54323 –deploy-mode client –master yarn –num-executors 2 –executor-memory 2g”
        export JAVA_HOME=/usr/lib/jvm/j2sdk1.7-oracle

        #Licensed to the Apache Software Foundation (ASF) under one or more
        export PYSPARK_PYTHON=/opt/cloudera/extras/python27/bin/python

        export PYSPARK_DRIVER_PYTHON=/opt/cloudera/extras/python27/bin/python

        export PATH=$PATH:/opt/cloudera/extras/python27/bin/
        on yarn manger i see Running , but only interperter i can use it s spark like that
        %spark
        sc (ok work)

        i want use pyspark for my work job with %pyspark
        i get erreor like
        %pyspark not value
        %sql same error
        please give me help
        thank for advance


      评论
      添加红包

      请填写红包祝福语或标题

      红包个数最小为10个

      红包金额最低5元

      当前余额3.43前往充值 >
      需支付:10.00
      成就一亿技术人!
      领取后你会自动成为博主和红包主的粉丝 规则
      hope_wisdom
      发出的红包
      实付
      使用余额支付
      点击重新获取
      扫码支付
      钱包余额 0

      抵扣说明:

      1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
      2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

      余额充值