hadoop课程字幕

Welcome back. Let's go on the Cloudera QuickStart
VM Tour together, I'll be your guide. So if you download and open up your
virtual machine, once the launch of the Cloudera QuickStart happen, you're
gonna see a screen kind of like this. What we're going to do now,
is we're going to go inside this VM and see what kind of services and utilities
does this virtual machine provide us. So, when you first start,
you're gonna see that there is a browser inside something that
looks like a desktop. And there's a little welcome
page inside that browser. We're gonna be exploring this browser,
opening up additional tabs, and going through numerous different services
to learn more about this environment. So if you look at inside these browsers, you can see that here across you have
several different services, Cloudera. First, Coudera will take you to
the web page, but then you have Hue, you have Hadoop, HBase,
Impala, Spark, Uzi, etc. These are all these applications
we just mentioned and learned about in our previous lecture. And let's go through and click through
them, to see, what do they allow us and provide us? So let's click on the Hadoop first. You can either open a new tab,
and open it in there. Or you can just click
on it in this same tab. What we can see here is
an overview of our Hadoop stack. We can see when the particular
initiation of this stack has happened, we can see which version
of the software we have. It gives us entire summary
of all the configurations, the security of number of files used, etc. Since we have downloaded this VM from it
already comes preloaded with a number of different files that we're gonna use
though our exercises and tutorials. So we already have a number of
different files in storage taken up. So let's click on the Datanodes next. We have mentioned datanodes
a little bit earlier in terms of HTFS and
the storage system underneath. So this particular service will allow us
to look at all the datanodes that we have. And once we start rounding jobs, we're
gonna be able to see what these datanodes have, and the summary of all
the things loaded into the nodes. Next, let's click on HBase,
as we have mentioned earlier. HBase is a columnar data store, that will store our unstructured
data within the Hadoop file system. It will show us, here, the number
of requests, as we make to read and write into this HBase. And we can see all
the different kinds of calls. And tasks that have been
submitted to the database. Right now we don't have any,
because we haven't launched any tasks, but as we go through the tutorial,
we're gonna come back to this and look at the summaries of jobs submitted. When I mentioned Impala earlier, and we said that Impala allows
us to submit some really high performance SQL like queries
to our data stored into HDFS. And you can see here that once
we start submitting these jobs, we're gonna be able to look at the last
25 completed queries, we can look at the queries that are happening right now,
we can look at the different locations, the different fragments that these
queries have been submitted to. Next, let's click on Oozie. If we go to Oozie, we can see a number of
jobs submitted, when they were started, and added how large they were, etc. And once we submitting the jobs, we're
going to come back here and look at more, in more details,
in all of these different applications and services provided to us on
top of the Hadoop framework. So right now,
let's go back to the original webpage, the Cloudera Live welcome page within
the browser, and let's start the tutorial. Click on Start Tutorial. So this tutorial will
offer us an introduction to the Cloudera's live tutorial. And you can see that within this quick VM,
we're gonna be able to run a number of different jobs
within the tutorial and we're gonna be able to understand how some of
these tools within the Cloudera VM work. We're gonna understand how to set up and execute a number of different business
intelligence and analytics use cases. And we're gonna be able to explain
to other people in our team how these particular services work. So, let's click on getting started. This will take us to another
page where we can [COUGH] just find a business question we're trying to answer for
this particular tutorial. In this particular tutorial we're gonna
imagine that we have a corporation called Datakel and we have a mission
to help our organization. Get better inside from it's data and
answer some bigger questions. So our first scenario for
this part of the tutorial will be taking the idea of big data and
gaining more insight. Can we actually get some data
into the database into some simple analytics on
a large amount of data. So in this specific first scenario,
Data Co's business question is, what products do our
customers like to buy? The answer to this question is, the first thought might be to
look at the transaction data. And maybe that would indicate
what customers' actually buy and what they would like to buy, right? Well, this is probably something that
you would do in your regular job daily possibly. And you are probably going back
to your RDBMS environment and trying to submit some SQL language in
order to try and answer this question. The benefit of the Hadoop platform is that
now you can do this at a greater scale at less cost. And you might be able to use this
same system and the same data to answer some additional questions and
do some additional analytics of this data. So what this data exercise will
demonstrate is how to exactly do the same thing but within the environment and use some of the seamless integration
that's available to us within this video. So here on the sidebar you
can learn more about Scoop. We talked about Scoop in
our previous lecture, and it's a tool that uses map reduce to
transfer data between Hadoop cluster and a relational database very efficiently. So, if you remember,
Scoop stands for sequel to Hadoop. It works by spawning takes to
multiple data nodes to download various portions of the data in
parallel so it can do it faster. And when you are finished each piece of
data is replicated to ensure availability and spread out across the cluster to
ensure you can process data in parallel on the cluster. There is actually two versions of
Sqoop included in Cloudera platform. Sqoop1 is a thick client and is what
we're going to be using in this tutorial. The commands we're going
to run directly submit snap reduced jobs to transfer this data. Scoop2 consists of the central server that submit the mass produced
jobs on behalf of its clients. It's a much lighter weight client that
you might use to connect to the server. So, on this page we can see
what the table structure is for the data we're bringing in. And in order to analyze the transactional
data in this new platform, we're going to need to
ingest this data in to HDFS. We have already found the scoop is
probably our best tool for this job. And we're gonna take data from our
relational MySQL database, and load it up into HDFS. With a few additional configuration
parameters, we can take this step further and load this relational data directly
into a form ready to be queried by Impala. And in order to do that,
we may want to average the power of something called Apache Avro,
which we haven't mentioned to much so far. But there really is
a specific file format for other work floats on
the the Hadoop cluster. It's specifically optimized to work
really well on the Hadoop platform. So let's scroll down through this page and make sure we read the rest
of our instructions. And now we're gonna try and
open the terminal, which we can do by clicking
on this little Dark square which is supposed to indicate
a terminal and by clicking on this, we're gonna open up a new window with
a code error prompt inside the terminal. Within your tutorial, you have
this little button that says copy. This particular button will copy
the code inside the window. And then we can take that code and paste
it into the Cloudera's terminal window. Our first task is to copy the sqoop
command in order to launch the sqoop job, for moving this data. So we're gonna take this data and in order
to paste it in to terminal window, you can either click right on top of the terminal
window under the add it and paste. Or you can just you the shift control
V command if that is easier for you. We're gonna paste this
command into the terminal and this command might take a little while to
complete but it is doing a lot of things. It is launching map reduce jobs to export
the data from sequel data base and put those export files
into the avra format into the HDFS It is also
creating the overall schema. So then we can easily load our
height tables and use it for querying inside to the empower layer. So in order to confirm that our
overall data files exist in HDFS, we're gonna copy the next
command into the terminal. And know that the schema and
the data are stored in separate files. The schema is only applied to
the data when the data is queried. And this is what we call
the schema on read. If you remember,
we mentioned that in our previous class. This gives you the flexibility
to query the data with SQL, while it's still in a format
useable by other systems as well. This is in contrast to the traditional
databases that require you to have this schema well defined before
entering any data into it. And at this point we have already
imported a lot of data and will now just specify how its
structure should be interpreted. So, since we're gonna want
to use the Apache Hive, we will need the Schema files as well. So let's copy that into HDFS where
Hive can easily access them. Now that we have the data we
can prepare it to be queried. We're going to do this in
the next section using Impala. But you might have noticed that
we have imported this data into high directories at this point of time. Have and empower both reads
their data from file in HDFS and they even share metadata about a table. The only difference is that have executes
queries by compiling them to map reduce jobs. And as you will see later, this means
that it can be more flexible, but typically slower, while Impala is
a parallel query system engine, and we call that massively parallel queries,
that reads the data directly from the file system itself, and then allows
itself to execute queries very fast. In the more of an interactive and
exploration kind of a mode. Now that you have gone through the first
steps to Sqoop the data into HTFS, we have transformed it into file format. Now we're gonna import
the schema files and start using the schema by
querying the data inside it. So let's click on to our next step. Let's take the next step and
click on to the tutorial exercise one. We're gonna use Hue, which is, Impala
application to create the metadata for our tables. We're gonna create that metadata and then we're gonna query our table
through the Hue environment. Hue is really nice because it
provides a web-based interface for many of the tools in our cloudera
hadoop virtual machine, and it can be found under the master node,
which you can see here. In order to login to the Hue,
we're gonna type in Cloudera for both username and
password in order to enter this system. Once inside of Hue, we're gonna click
on the Impala Query Editors under Query Editors and
click on the Impala underneath there. We're gonna copy and paste the code
that will create the tables and then we're gonna go back and first delete the queries we just wrote
down for creation of the tables. We're gonna empty out and clean out this
query [COUGH] there inside the editor and then gonna copy the command
called show tables. This command will show us all of the
tables we have just created in the window below it. Now that our transaction data
is readily available for structured queries,
it's time to actually start addressing the question we asked earlier at
the beginning of our tutorial. Our business question was can we
find out some of the products that our customers are interested in buying? So we're going to copy and paste the
standard sequel example queries there in our tutorials and we're going to look
into what kinds of products are showing. We're gonna look at the total
revenue per product, and look at the top ten revenue
generating products. Now we learn how to create
a query table using Impala. And we have learned how to use regular
interfaces in tools like the terminal and interfaces within the Hadoop environment. The idea here is to take this data and
start creating some interesting reports that can provide us a better
insight into the data we have, and understand how the traditional system and the Hadoop system do similar
things in different ways. And understand how
the Hadoop system provides us with much larger scale but
still offers a lot flexibility.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Hadoop课程设计通常涵盖了Apache Hadoop生态系统的核心组件,这是一个开源的分布式计算框架,用于处理大规模数据集。在设计这样的课程时,学生会学习到以下几个关键部分: 1. **Hadoop基础知识**:开始时,会介绍Hadoop的背景、目的和架构,包括Hadoop分布式文件系统(HDFS)和MapReduce模型。 2. **Hadoop安装与配置**:学习如何安装Hadoop集群,配置核心配置文件(如core-site.xml, hdfs-site.xml, mapred-site.xml)和环境变量。 3. **HDFS操作**:使用Hadoop命令行工具(如hadoop fs、hdfs dfs)进行文件系统管理,如上传、下载、复制和删除文件。 4. **MapReduce编程**:编写MapReduce作业,包括Mapper, Reducer的实现,以及JobTracker和TaskTracker的工作原理。 5. **Hadoop流式处理**:了解其他Hadoop组件如Hadoop Streaming和YARN(Yet Another Resource Negotiator)的任务调度。 6. **Hadoop的扩展**:介绍Hadoop生态系统的其他组件,如Hive(SQL查询)、Pig(数据流语言)、HBase(列式存储的NoSQL数据库)和Spark(实时数据处理框架)。 7. **案例研究和项目实践**:通过实际项目应用Hadoop解决数据分析问题,例如日志分析、社交网络数据挖掘等。 **相关问题**: 1. Hadoop生态系统的其他组件有哪些? 2. MapReduce编程中,Mapper和Reducer的主要作用是什么? 3. 在实际项目中,如何选择使用Hadoop还是Spark进行数据处理?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值