Hadoop作为大数据不可必备的载体和工具,今天就来玩一下,绝对超级简单,不会你搭建环境
首先简单介绍 一下概念:
Hadoop主要分为三个部分:
- hdfs:这是Hadoop专门用来存文件的,所有的文件 都是储存在这个上面
- mapreduce:这个是Hadoop的计算引擎,光有了数据,我们肯定还得计算,不然大数据光存数据也没意义,不过现在基本上不用这个进行开发,取而代之的是hive,当然还有很多计算引擎,比如spark
- yarm:Hadoop的任务调度的工具
了解这些以后,我们知道其实现在的Hadoop最有用的还是hdfs,其他的早就已经变了模样了
好了,我们可以来玩玩Hadoop了
准备好Hadoop
这里使用的docker进行操作,docker使用起来简单很多
1.拉取Hadoop的docker镜像到本地
docker pull singularities/hadoop
注意拉取时间稍晚有点慢 ,耐心等待
2.新建docker-compose.yml文件
这个文件是用来运行Hadoop的docker容器
mkdir hadoop
cd hadoop
vim docker-compose.yml
docker-compose.yml写入如下内容并保存:
version: "2"
services:
namenode:
image: singularities/hadoop
command: start-hadoop namenode
hostname: namenode
environment:
HDFS_USER: hdfsuser
ports:
- "8020:8020"
- "14000:14000"
- "50070:50070"
- "50075:50075"
- "10020:10020"
- "13562:13562"
- "19888:19888"
datanode:
image: singularities/hadoop
command: start-hadoop datanode namenode
environment:
HDFS_USER: hdfsuser
links:
- namenode
3.启动Hadoop集群
在启动 以前需要安装
pip install docker-compose
注意:要在刚才新建的那个文件下进行启动
启动命令:
docker-compose up -d
docker-compose scale datanode=3
启动成功后,我们可以查看一下:
[root@localhost hadoop]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
01981e71197c singularities/hadoop "start-hadoop datano…" 19 hours ago Up 19 hours 8020/tcp, 9000/tcp, 10020/tcp, 13562/tcp, 14000/tcp, 19888/tcp, 50010/tcp, 50020/tcp, 50070/tcp, 50075/tcp, 50090/tcp, 50470/tcp, 50475/tcp hadoop_datanode_2
ffd6eb856746 singularities/hadoop "start-hadoop datano…" 19 hours ago Up 19 hours 8020/tcp, 9000/tcp, 10020/tcp, 13562/tcp, 14000/tcp, 19888/tcp, 50010/tcp, 50020/tcp, 50070/tcp, 50075/tcp, 50090/tcp, 50470/tcp, 50475/tcp hadoop_datanode_3
4e4d82b20c47 singularities/hadoop "start-hadoop datano…" 19 hours ago Up 19 hours 8020/tcp, 9000/tcp, 10020/tcp, 13562/tcp, 14000/tcp, 19888/tcp, 50010/tcp, 50020/tcp, 50070/tcp, 50075/tcp, 50090/tcp, 50470/tcp, 50475/tcp hadoop_datanode_1
cc7642836c26 singularities/hadoop "start-hadoop nameno…" 19 hours ago Up 19 hours 0.0.0.0:8020->8020/tcp, 0.0.0.0:10020->10020/tcp, 0.0.0.0:13562->13562/tcp, 0.0.0.0:14000->14000/tcp, 9000/tcp, 50010/tcp, 0.0.0.0:19888->19888/tcp, 0.0.0.0:50070->50070/tcp, 50020/tcp, 50090/tcp, 50470/tcp, 0.0.0.0:50075->50075/tcp, 50475/tcp hadoop_namenode_1
这里我们看到了启动了4个docker容器,这4个容器相当于四台机器
我们还可以尝试打开网页看看:
注意:这个ip是运行docker镜像宿主机上的ip,不是docker的ip
因为我们刚才在用docker-compose.yml的时候,设置了端口号对应,也就是可以访问到我们docker里面的资源
显示 页面如下:
我们可以在网页里面 浏览hdfs上面存的文件
结果:
因为我已经存了文件了,如果第一次打开这个文件系统的空的
至此,说明我们的Hadoop是正常的,要是打开不这个页面,自己检查是不是本机端口没有开放
接下来我们就可以在Hadoop里面操作文件了
要想操作Hadoop ,我们肯定需要随便进一台docker容器,因为docker是集群模式,我们可以随便找一个就行,我们先看看目前 运行的docker容器。
操作Hadoop
[root@localhost ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
01981e71197c singularities/hadoop "start-hadoop datano…" 19 hours ago Up 19 hours 8020/tcp, 9000/tcp, 10020/tcp, 13562/tcp, 14000/tcp, 19888/tcp, 50010/tcp, 50020/tcp, 50070/tcp, 50075/tcp, 50090/tcp, 50470/tcp, 50475/tcp hadoop_datanode_2
ffd6eb856746 singularities/hadoop "start-hadoop datano…" 19 hours ago Up 19 hours 8020/tcp, 9000/tcp, 10020/tcp, 13562/tcp, 14000/tcp, 19888/tcp, 50010/tcp, 50020/tcp, 50070/tcp, 50075/tcp, 50090/tcp, 50470/tcp, 50475/tcp hadoop_datanode_3
4e4d82b20c47 singularities/hadoop "start-hadoop datano…" 19 hours ago Up 19 hours 8020/tcp, 9000/tcp, 10020/tcp, 13562/tcp, 14000/tcp, 19888/tcp, 50010/tcp, 50020/tcp, 50070/tcp, 50075/tcp, 50090/tcp, 50470/tcp, 50475/tcp hadoop_datanode_1
cc7642836c26 singularities/hadoop "start-hadoop nameno…" 19 hours ago Up 19 hours 0.0.0.0:8020->8020/tcp, 0.0.0.0:10020->10020/tcp, 0.0.0.0:13562->13562/tcp, 0.0.0.0:14000->14000/tcp, 9000/tcp, 50010/tcp, 0.0.0.0:19888->19888/tcp, 0.0.0.0:50070->50070/tcp, 50020/tcp, 50090/tcp, 50470/tcp, 0.0.0.0:50075->50075/tcp, 50475/tcp hadoop_namenode_1
[root@localhost ~]# docker exec -it 01981e71197c /bin/bash
这里我挑的容器id为01981e71197c 的容器进入
接下来我们就可以在里面操作我们的Hadoop了
(1)查看hdfs的根目录有哪些文件
root@01981e71197c:/# hadoop fs -ls /
Found 3 items
drwxr-xr-x - hdfsuser hdfsuser 0 2020-08-25 09:01 /AA
drwxr-xr-x - hdfsuser hdfsuser 0 2020-08-25 07:56 /hdfs
-rwxr-xr-x 1 hdfsuser hdfsuser 316 2020-08-25 10:02 /test.csv
看看是不是和刚才在网页浏览到的内容一样
(2)上传本地文件到hdfs
hadoop fs -put python.txt /
hadoop fs -ls /
结果:
root@01981e71197c:/# hadoop fs -put python.txt /
root@01981e71197c:/# hadoop fs -ls /
Found 4 items
drwxr-xr-x - hdfsuser hdfsuser 0 2020-08-25 09:01 /AA
drwxr-xr-x - hdfsuser hdfsuser 0 2020-08-25 07:56 /hdfs
-rw-r--r-- 1 hdfsuser hdfsuser 11 2020-08-26 02:35 /python.txt
-rwxr-xr-x 1 hdfsuser hdfsuser 316 2020-08-25 10:02 /test.csv
(3)hdfs还有很多操作 ,这里 就不一一列举了,可以使用帮助看看
hadoop fs -help
到这里,是不是觉得hdfs实际上也就那么回事,不过本人还想远程使用python进行操作hdfs
下面的操作就不是在docker容器里面操作的了,随便找台远程机器就行
首先需要安装工具:
pip install pyhdfs
从本地上传文件:
import pyhdfs
client = pyhdfs.HdfsClient(hosts="192.168.XX.XX,9000",user_name="hdfsuser")
client.copy_from_local("/opt/python.py","/test.csv")
不过很有可能会报错,因为本机 不一定认识host,错误如下:
Traceback (most recent call last):
File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/urllib3/connection.py", line 141, in _new_conn
(self.host, self.port), self.timeout, **extra_kw)
File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/urllib3/util/connection.py", line 60, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/opt/AI/AN3.5.2/lib/python3.6/socket.py", line 745, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
chunked=chunked)
File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/urllib3/connectionpool.py", line 357, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/opt/AI/AN3.5.2/lib/python3.6/http/client.py", line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/opt/AI/AN3.5.2/lib/python3.6/http/client.py", line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/opt/AI/AN3.5.2/lib/python3.6/http/client.py", line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/opt/AI/AN3.5.2/lib/python3.6/http/client.py", line 1026, in _send_output
self.send(msg)
File "/opt/AI/AN3.5.2/lib/python3.6/http/client.py", line 964, in send
self.connect()
File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/urllib3/connection.py", line 166, in connect
conn = self._new_conn()
File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/urllib3/connection.py", line 150, in _new_conn
self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f956cc24240>: Failed to establish a new connection: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/urllib3/connectionpool.py", line 639, in urlopen
_stacktrace=sys.exc_info()[2])
File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/urllib3/util/retry.py", line 388, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='ffd6eb856746', port=50075): Max retries exceeded with url: /webhdfs/v1/test.csv?op=CREATE&user.name=hdfsuser&namenoderpcaddress=namenode:8020&createflag=&createparent=true&overwrite=false (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f956cc24240>: Failed to establish a new connection: [Errno -2] Name or service not known',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/pyhdfs/__init__.py", line 875, in copy_from_local
self.create(dest, f, **kwargs)
File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/pyhdfs/__init__.py", line 504, in create
metadata_response.headers['location'], data=data, **self._requests_kwargs)
File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/requests/api.py", line 134, in put
return request('put', url, data=data, **kwargs)
File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/requests/sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/requests/sessions.py", line 643, in send
r = adapter.send(request, **kwargs)
File "/opt/AI/AN3.5.2/lib/python3.6/site-packages/requests/adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='ffd6eb856746', port=50075): Max retries exceeded with url: /webhdfs/v1/test.csv?op=CREATE&user.name=hdfsuser&namenoderpcaddress=namenode:8020&createflag=&createparent=true&overwrite=false (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f956cc24240>: Failed to establish a new connection: [Errno -2] Name or service not known',))
>>>
原因是因为远程的机器上面 没有配置hosts,因此不认识hostname
解决方法是:
[root@localhost hd]# vim /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
172.19.0.4 01981e71197c
172.19.0.5 ffd6eb856746
172.19.0.3 4e4d82b20c47
172.19.0.2 namenode
这个ip是每台docker的ip和docker容器的id
如果还不生效,就去每台docker容器也添加这个内容,保证能够认识到这个hostname
不过,要想修改hosts文件,需要用到 vim命令,而vim在docker里面是没有带的,需要apt安装
因此,我们对每一台docker 容器都安装一下
mv /etc/apt/sources.list /etc/apt/sources.list.bak
echo "deb http://mirrors.163.com/debian/ jessie main non-free contrib" >> /etc/apt/sources.list
echo "deb http://mirrors.163.com/debian/ jessie-proposed-updates main non-free contrib" >>/etc/apt/sources.list
echo "deb-src http://mirrors.163.com/debian/ jessie main non-free contrib" >>/etc/apt/sources.list
echo "deb-src http://mirrors.163.com/debian/ jessie-proposed-updates main non-free contrib" >>/etc/apt/sources.list
#更新安装源
apt-get update
pyhdfs这个工具包除开上传文件,其他的你不配置host是可以成功玩转的