Our thanks to Karthik Vadla and Abhi Basu, Big Data Solutions engineers at Intel, for permission to re-publish the following (which was originally available here).
Data science is not a new discipline. However, with the growth of big data and adoption of big data technologies, the request for better quality data has grown exponentially. Today data science is applied to every facet of life—product validation through fault prediction, genome sequence analysis, personalized medicine through population studies and Patient 360 view, credit card fraud-detection, improvement in customer experience through sentiment analysis and purchase patterns, weather forecast, detecting cyber or terrorist attacks, aircraft maintenance utilizing predictive analytics to repair critical parts before they fail, and many more. Every day, data scientists are detecting patterns in data and providing actionable insights to influence organizational changes.
The data scientist’s work broadly involves acquisition, cleanup, and analysis of data. Being a cross-functional discipline, this work involves communication, collaboration, and interaction with other individuals, internal and possibly external to your organization. This is one reason why the “notebook” features in data analysis tools are gaining popularity. They ease organizing, sharing, and interactively working with long workflows. IPython Notebook is a great example but is limited to usage of Python language. Apache Zeppelin (incubating at the time of this writing) is a new web-based notebook that enables data-driven, interactive data analytics, and visualization with the added bonus of supporting multiple languages, including Python, Scala, Spark SQL, Hive, Shell, and Markdown. Zeppelin also provides Apache Spark integration by default, making use of Spark’s fast in-memory, distributed, data processing engine to accomplish data science at lightning speed.
This post demonstrates how easy it is to install Apache Zeppelin notebook on CDH (for dev/test only, not supported). We assume familiarity with Linux (especially CentOS) commands, installation, and configuration.
System Setup and Configuration
Components
Listed below are the specs of our test Hadoop cluster.
Installed hardware
Installed software
These installation commands are specific to CentOS. If you do not login as ‘root’, you must use sudo
for all the commands.
- Update CentOS packages (
yum update
). - Install latest version of Java, preferably version 1.7 or later (
yum install java-1.8.0-openjdk-devel
). - Install Git (
yum install git
). - Install Node.js and npm (
yum install nodejs npm
). - Bower (
is installed by npm
). - Install Apache Maven – refer to these steps for installation.
Important Note: When you are working in a corporate environment, you need to set the proxies for Git, Nnpm, and Bower individually along with Maven.
Setting Proxies
- For Git
- For npm
- For Bower
Building Zeppelin Binaries
- Download and extract the latest version of Apache Zeppelin from GitHub.
- Now cd to
/incubator-zeppelin-master
- The current versions of CDH, Hadoop, and Spark are:
CDH 5.4.0
Spark 1.3.0
Hadoop 2.6.0
- Maven command to build the Zeppelin (locally):
OR
Maven command to build the Zeppelin for YARN (All spark queries are tracked in Yarn history):
Profiles included:
Pspark-1.3: Installs spark framework support for Zeppelin
Ppyspark: Installs all configurations required to run pyspark interpreter in Zeppelin Phadoop-2.6: Installs Hadoop version support for Zeppelin
Once the build is successful, continue with the configuration.
General Configuration of Zeppelin
- To access the Hive metastore, copy the
hive-site.xml
fromHIVE_HOME/conf
intoZEPPELIN_HOME/conf
folder (whereHIVE_HOME
andZEPPELIN_HOME
refers to the install locations of this software). - In
ZEPPELIN_HOME/conf
folder duplicatezeppelin-env.sh.template
and rename it tozeppelin-env.sh
. - In
ZEPPELIN_HOME/conf
folder duplicatezeppelin-site.xml.template
and rename it tozeppelin-site.xml.
YARN Configuration of Zeppelin
If you have built binaries for yarn, set the master property for the Spark interpreter, i.e., master=yarn-client
via Zeppelin UI (Interpreter tab)
- In the Zeppelin
/conf
directory go to thezeppelin-env.sh
file, uncomment the exportHADOOP_CONF_DIR
and specify the configuration directory location of theyarn-site.xml
file (e.g.,export
HADOOP_CONF_DIR =/etc/hadoop/conf
).
Start Zeppelin: ./bin/zeppelin-daemon.sh start
(Note: Sometimes you may not be able to run the above command. In that case, make all scripts in /bin
folder executable with the following command:
chmod –R 777
.)
After this, try the previous command again to start Zeppelin.
And now you can access your notebook at http://localhost:8080 or http://host.ip.address:8080.
Stop Zeppelin: ./bin/zeppelin-daemon.sh stop
Testing
- Start the Zeppelin application:
./bin/zeppelin-daemon.sh start and access http://localhost:8080
(or IP address of the node it is installed on). - If you already have data on the Apache Hive metastore, which is accessible via hive commands locally, let’s test Zeppelin commands. Use the
%hive
interpreter to access the Hive metastore and list all available databases. In this example we already have some public genome databases available in our Hive metastore. If you do not have any data in your Hive metastore, you may want to load some data before starting this test or skip to Step 4.Now, type these commands in notebook:
The code snippet is echoed back and the code execution output is displayed:
- To display tables in a specific database, such as “wellderly”, type these commands in the notebook:
Again, the code snippet is echoed back and the code execution output is displayed:
- Download the test dataset (education.csv) and place it in your HDFS location. Using the Scala interpreter register a table using the .csv file in HDFS. Use the code snippet to register the table. Note: Scala interpreter is the default, so nothing needs to be specified in Zeppelin (like
%hive
) when using Hive.
- Download the test dataset (education.csv) and place it in your HDFS location. Using the Scala interpreter register a table using the .csv file in HDFS. Use the code snippet to register the table. Note: Scala interpreter is the default, so nothing needs to be specified in Zeppelin (like
After that, run the command below:
You now have installed and configured Zeppelin correctly and you have been able to test the installation successfully. Documentation for Zeppelin is available here.
Sharing a Notebook
- If you want to share these notebook results with another user, you can simply send the URL of your notebook to that user. (That user must have access to the server node and cluster on which you created your notebook). That user not only can view all your queries, but also run all your queries to view your queries’ results.
- If you want to share only the results without any queries (report-mode), please follow these steps:
- Go to right corner of the Zeppelin window, where you see a dropdown list after the settings icon.
- Change it from default to report. In this mode only results can be viewed without queries.
- Copy the URL and share with others (who have access to the server node and cluster).
- As the above image shows, three modes are available to share your notebooks:
- Default – In this mode, the notebook can be edited by anyone who has access to the notebook (edit queries and re-run to display different results).
- Simple – This mode is similar to default, the only difference is that all the available options are invisible. Options are visible only when you hover your mouse over the cell. This mode gives a cleaner view of the results when shared.
- Report – When this mode is enabled, only the final results are visible (read only). The notebook cannot be edited
Conclusion
Clearly, Apache Zeppelin is in the incubator stage, but it does show promise as a cross-platform notebook not tied to a particular platform, tool, or programming language. Our intent here was to demonstrate how you can install Apache Zeppelin on your own system and start experimenting with its many capabilities. In the future, we want to use Zeppelin for exploratory data analysis and also write more interpreters for it to improve the visualization capability, i.e., incorporate Google Charts and similar tools.
- Start the Zeppelin application:
Maven 3.2.0 reports error when –DskipTests is used. It’s better to use -Dmaven.test.skip=true instead…
Hi Alex,
Other way is to use latest version of Maven 3.3.
That should fix this.
Thanks
Karthik Vadla
Is it possible to run Impala queries with Zeppelin?
Not out of the box, but this user discovered that Zeppelin’s Postgres connector is actually a generic JDBC driver that perhaps can be used for other purposes (including for Impala?):
http://thedataist.com/tutorial-using-apache-zeppelin-with-mysql/
(We haven’t tested this, just passing it along)
hello ,
thank you for topics but i have some issues whit install
i tried to use zeppelin on CDH5.5.1 (coporate server with 15 nodes (YARN)) i did like your propose :
mvn clean package -Pspark-1.5 -Ppyspark -Dhadoop.version=2.6.0-cdh5.5.1 -Phadoop-2.6 -Pyarn –DskipTests (ok work!)
and env variable
export HADOOP_CONF_DIR=/etc/hadoop/conf:/etc/hive/conf
export HADOOP_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p1254.1026/bin/../lib/hadoop
export SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p1254.1026/lib/spark
export MASTER=yarn-client
export SPARK_SUBMIT_OPTIONS=”–conf spark.driver.port=54321 –conf spark.fileserver.port=54322 –conf spark.blockManager.port=54323 –deploy-mode client –master yarn –num-executors 2 –executor-memory 2g”
export JAVA_HOME=/usr/lib/jvm/j2sdk1.7-oracle
#Licensed to the Apache Software Foundation (ASF) under one or more
export PYSPARK_PYTHON=/opt/cloudera/extras/python27/bin/python
export PYSPARK_DRIVER_PYTHON=/opt/cloudera/extras/python27/bin/python
export PATH=$PATH:/opt/cloudera/extras/python27/bin/
on yarn manger i see Running , but only interperter i can use it s spark like that
%spark
sc (ok work)
i want use pyspark for my work job with %pyspark
i get erreor like
%pyspark not value
%sql same error
please give me help
thank for advance