最简单的方式玩转Hadoop集群

Hadoop作为大数据不可必备的载体和工具,今天就来玩一下,绝对超级简单,不会你搭建环境

首先简单介绍 一下概念:

Hadoop主要分为三个部分:

  • hdfs:这是Hadoop专门用来存文件的,所有的文件 都是储存在这个上面
  • mapreduce:这个是Hadoop的计算引擎,光有了数据,我们肯定还得计算,不然大数据光存数据也没意义,不过现在基本上不用这个进行开发,取而代之的是hive,当然还有很多计算引擎,比如spark
  • yarm:Hadoop的任务调度的工具

了解这些以后,我们知道其实现在的Hadoop最有用的还是hdfs,其他的早就已经变了模样了

好了,我们可以来玩玩Hadoop了

准备好Hadoop

这里使用的docker进行操作,docker使用起来简单很多

1.拉取Hadoop的docker镜像到本地

docker pull singularities/hadoop

注意拉取时间稍晚有点慢 ,耐心等待 

2.新建docker-compose.yml文件

这个文件是用来运行Hadoop的docker容器

mkdir hadoop
cd  hadoop
vim docker-compose.yml

docker-compose.yml写入如下内容并保存:

version: "2"

services:
  namenode:
    image: singularities/hadoop
    command: start-hadoop namenode
    hostname: namenode
    environment:
      HDFS_USER: hdfsuser
    ports:
      - "8020:8020"
      - "14000:14000"
      - "50070:50070"
      - "50075:50075"
      - "10020:10020"
      - "13562:13562"
      - "19888:19888"
  datanode:
    image: singularities/hadoop
    command: start-hadoop datanode namenode
    environment:
      HDFS_USER: hdfsuser
    links:
      - namenode

3.启动Hadoop集群 

在启动 以前需要安装

pip  install docker-compose

注意:要在刚才新建的那个文件下进行启动

启动命令:

docker-compose up -d
docker-compose scale datanode=3

启动成功后,我们可以查看一下:

[root@localhost hadoop]# docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS                                                                                                                                                                                                                                                 NAMES
01981e71197c        singularities/hadoop   "start-hadoop datano…"   19 hours ago        Up 19 hours         8020/tcp, 9000/tcp, 10020/tcp, 13562/tcp, 14000/tcp, 19888/tcp, 50010/tcp, 50020/tcp, 50070/tcp, 50075/tcp, 50090/tcp, 50470/tcp, 50475/tcp                                                                                                           hadoop_datanode_2
ffd6eb856746        singularities/hadoop   "start-hadoop datano…"   19 hours ago        Up 19 hours         8020/tcp, 9000/tcp, 10020/tcp, 13562/tcp, 14000/tcp, 19888/tcp, 50010/tcp, 50020/tcp, 50070/tcp, 50075/tcp, 50090/tcp, 50470/tcp, 50475/tcp                                                                                                           hadoop_datanode_3
4e4d82b20c47        singularities/hadoop   "start-hadoop datano…"   19 hours ago        Up 19 hours         8020/tcp, 9000/tcp, 10020/tcp, 13562/tcp, 14000/tcp, 19888/tcp, 50010/tcp, 50020/tcp, 50070/tcp, 50075/tcp, 50090/tcp, 50470/tcp, 50475/tcp                                                                                                           hadoop_datanode_1
cc7642836c26        singularities/hadoop   "start-hadoop nameno…"   19 hours ago        Up 19 hours         0.0.0.0:8020->8020/tcp, 0.0.0.0:10020->10020/tcp, 0.0.0.0:13562->13562/tcp, 0.0.0.0:14000->14000/tcp, 9000/tcp, 50010/tcp, 0.0.0.0:19888->19888/tcp, 0.0.0.0:50070->50070/tcp, 50020/tcp, 50090/tcp, 50470/tcp, 0.0.0.0:50075->50075/tcp, 50475/tcp   hadoop_namenode_1

这里我们看到了启动了4个docker容器,这4个容器相当于四台机器

我们还可以尝试打开网页看看:

注意:这个ip是运行docker镜像宿主机上的ip,不是docker的ip

因为我们刚才在用docker-compose.yml的时候,设置了端口号对应,也就是可以访问到我们docker里面的资源

http://192.168.XXX.XX:50070/

显示 页面如下:

我们可以在网页里面 浏览hdfs上面存的文件

结果:

因为我已经存了文件了,如果第一次打开这个文件系统的空的

至此,说明我们的Hadoop是正常的,要是打开不这个页面,自己检查是不是本机端口没有开放

接下来我们就可以在Hadoop里面操作文件了 

要想操作Hadoop ,我们肯定需要随便进一台docker容器,因为docker是集群模式,我们可以随便找一个就行,我们先看看目前 运行的docker容器。

操作Hadoop 

[root@localhost ~]# docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS                                                                                                                                                                                                                                                 NAMES
01981e71197c        singularities/hadoop   "start-hadoop datano…"   19 hours ago        Up 19 hours         8020/tcp, 9000/tcp, 10020/tcp, 13562/tcp, 14000/tcp, 19888/tcp, 50010/tcp, 50020/tcp, 50070/tcp, 50075/tcp, 50090/tcp, 50470/tcp, 50475/tcp                                                                                                           hadoop_datanode_2
ffd6eb856746        singularities/hadoop   "start-hadoop datano…"   19 hours ago        Up 19 hours         8020/tcp, 9000/tcp, 10020/tcp, 13562/tcp, 14000/tcp, 19888/tcp, 50010/tcp, 50020/tcp, 50070/tcp, 50075/tcp, 50090/tcp, 50470/tcp, 50475/tcp                                                                                                           hadoop_datanode_3
4e4d82b20c47        singularities/hadoop   "start-hadoop datano…"   19 hours ago        Up 19 hours         8020/tcp, 9000/tcp, 10020/tcp, 13562/tcp, 14000/tcp, 19888/tcp, 50010/tcp, 50020/tcp, 50070/tcp, 50075/tcp, 50090/tcp, 50470/tcp, 50475/tcp                                                                                                           hadoop_datanode_1
cc7642836c26        singularities/hadoop   "start-hadoop nameno…"   19 hours ago        Up 19 hours         0.0.0.0:8020->8020/tcp, 0.0.0.0:10020->10020/tcp, 0.0.0.0:13562->13562/tcp, 0.0.0.0:14000->14000/tcp, 9000/tcp, 50010/tcp, 0.0.0.0:19888->19888/tcp, 0.0.0.0:50070->50070/tcp, 50020/tcp, 50090/tcp, 50470/tcp, 0.0.0.0:50075->50075/tcp, 50475/tcp   hadoop_namenode_1
[root@localhost ~]# docker exec -it 01981e71197c /bin/bash

这里我挑的容器id为01981e71197c 的容器进入

接下来我们就可以在里面操作我们的Hadoop了 

(1)查看hdfs的根目录有哪些文件

root@01981e71197c:/# hadoop fs -ls /
Found 3 items
drwxr-xr-x   - hdfsuser hdfsuser          0 2020-08-25 09:01 /AA
drwxr-xr-x   - hdfsuser hdfsuser          0 2020-08-25 07:56 /hdfs
-rwxr-xr-x   1 hdfsuser hdfsuser        316 2020-08-25 10:02 /test.csv

看看是不是和刚才在网页浏览到的内容一样

(2)上传本地文件到hdfs

hadoop fs -put python.txt /
hadoop fs -ls /

结果:

root@01981e71197c:/# hadoop fs -put python.txt /
root@01981e71197c:/# hadoop fs -ls /
Found 4 items
drwxr-xr-x   - hdfsuser hdfsuser          0 2020-08-25 09:01 /AA
drwxr-xr-x   - hdfsuser hdfsuser          0 2020-08-25 07:56 /hdfs
-rw-r--r--   1 hdfsuser hdfsuser         11 2020-08-26 02:35 /python.txt
-rwxr-xr-x   1 hdfsuser hdfsuser        316 2020-08-25 10:02 /test.csv

(3)hdfs还有很多操作 ,这里 就不一一列举了,可以使用帮助看看

hadoop fs -help

到这里,是不是觉得hdfs实际上也就那么回事,不过本人还想远程使用python进行操作hdfs

下面的操作就不是在docker容器里面操作的了,随便找台远程机器就行

首先需要安装工具:

pip  install pyhdfs

从本地上传文件:

import pyhdfs
client = pyhdfs.HdfsClient(hosts="192.168.XX.XX,9000",user_name="hdfsuser")
client.copy_from_local("/opt/python.py","/test.csv")

不过很有可能会报错,因为本机 不一定认识host,错误如下:

Traceback (most recent call last):
  File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/urllib3/connection.py", line 141, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/opt/AI/AN3.5.2/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/urllib3/connectionpool.py", line 357, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/opt/AI/AN3.5.2/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/opt/AI/AN3.5.2/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/opt/AI/AN3.5.2/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/opt/AI/AN3.5.2/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/opt/AI/AN3.5.2/lib/python3.6/http/client.py", line 964, in send
    self.connect()
  File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/urllib3/connection.py", line 166, in connect
    conn = self._new_conn()
  File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/urllib3/connection.py", line 150, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f956cc24240>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/urllib3/connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/urllib3/util/retry.py", line 388, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='ffd6eb856746', port=50075): Max retries exceeded with url: /webhdfs/v1/test.csv?op=CREATE&user.name=hdfsuser&namenoderpcaddress=namenode:8020&createflag=&createparent=true&overwrite=false (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f956cc24240>: Failed to establish a new connection: [Errno -2] Name or service not known',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/pyhdfs/__init__.py", line 875, in copy_from_local
    self.create(dest, f, **kwargs)
  File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/pyhdfs/__init__.py", line 504, in create
    metadata_response.headers['location'], data=data, **self._requests_kwargs)
  File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/requests/api.py", line 134, in put
    return request('put', url, data=data, **kwargs)
  File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='ffd6eb856746', port=50075): Max retries exceeded with url: /webhdfs/v1/test.csv?op=CREATE&user.name=hdfsuser&namenoderpcaddress=namenode:8020&createflag=&createparent=true&overwrite=false (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f956cc24240>: Failed to establish a new connection: [Errno -2] Name or service not known',))
>>> 

原因是因为远程的机器上面 没有配置hosts,因此不认识hostname

解决方法是:

[root@localhost hd]# vim  /etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

172.19.0.4      01981e71197c
172.19.0.5      ffd6eb856746
172.19.0.3      4e4d82b20c47
172.19.0.2      namenode

这个ip是每台docker的ip和docker容器的id

如果还不生效,就去每台docker容器也添加这个内容,保证能够认识到这个hostname 

不过,要想修改hosts文件,需要用到 vim命令,而vim在docker里面是没有带的,需要apt安装

因此,我们对每一台docker 容器都安装一下

mv /etc/apt/sources.list /etc/apt/sources.list.bak

echo "deb http://mirrors.163.com/debian/ jessie main non-free contrib" >> /etc/apt/sources.list

echo "deb http://mirrors.163.com/debian/ jessie-proposed-updates main non-free contrib" >>/etc/apt/sources.list
echo "deb-src http://mirrors.163.com/debian/ jessie main non-free contrib" >>/etc/apt/sources.list
echo "deb-src http://mirrors.163.com/debian/ jessie-proposed-updates main non-free contrib" >>/etc/apt/sources.list
#更新安装源
apt-get update

pyhdfs这个工具包除开上传文件,其他的你不配置host是可以成功玩转的

 

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值