python操作hdfs

安装hdfs包

  pip3 install hdfs

 

查看hdfs目录

1

2

3

4

5

6

[root@hadoop hadoop]# hdfs dfs -ls -R /

drwxr-xr-x - root supergroup 0 2017-05-18 23:57 /Demo

-rw-r--r-- 1 root supergroup 3494 2017-05-18 23:57 /Demo/hadoop-env.sh

drwxr-xr-x - root supergroup 0 2017-05-18 19:01 /logs

-rw-r--r-- 1 root supergroup 2223 2017-05-18 19:01 /logs/anaconda-ks.cfg

-rw-r--r-- 1 root supergroup 57162 2017-05-18 18:32 /logs/install.log

  

创建hdfs连接实例

1

2

3

4

5

6

#!/usr/bin/env python

# -*- coding:utf-8 -*-

__Author__ = 'kongZhaGen'

 

import hdfs

client = hdfs.Client("http://172.10.236.21:50070")

  

list:返回远程文件夹包含的文件或目录名称,如果路径不存在则抛出错误。

  hdfs_path:远程文件夹的路径

  status:同时返回每个文件的状态信息

1

2

3

4

5

6

7

8

def list(self, hdfs_path, status=False):

    """Return names of files contained in a remote folder.

 

    :param hdfs_path: Remote path to a directory. If `hdfs_path` doesn't exist

      or points to a normal file, an :class:`HdfsError` will be raised.

    :param status: Also return each file's corresponding FileStatus_.

 

    """

  示例:

1

2

3

print client.list("/",status=False)

结果:

[u'Demo', u'logs']

  

status:获取hdfs系统上文件或文件夹的状态信息

  hdfs_path:路径名称

  strict:

    False:如果远程路径不存在返回None

    True:如果远程路径不存在抛出异常

1

2

3

4

5

6

7

8

9

10

11

def status(self, hdfs_path, strict=True):

    """Get FileStatus_ for a file or folder on HDFS.

 

    :param hdfs_path: Remote path.

    :param strict: If `False`, return `None` rather than raise an exception if

      the path doesn't exist.

 

    .. _FileStatus: FS_

    .. _FS: http://hadoop.apache.org/docs/r1.0.4/webhdfs.html#FileStatus

 

    """

  示例:

1

2

3

print client.status(hdfs_path="/Demoo",strict=False)

结果:

None

  

makedirs:在hdfs上创建目录,可实现递归创建目录

  hdfs_path:远程目录名称

  permission:为新创建的目录设置权限

1

2

3

4

5

6

7

8

9

10

11

12

13

def makedirs(self, hdfs_path, permission=None):

   """Create a remote directory, recursively if necessary.

 

   :param hdfs_path: Remote path. Intermediate directories will be created

     appropriately.

   :param permission: Octal permission to set on the newly created directory.

     These permissions will only be set on directories that do not already

     exist.

 

   This function currently has no return value as WebHDFS doesn't return a

   meaningful flag.

 

   """

  示例:

  如果想在远程客户端通过脚本给hdfs创建目录,需要修改hdfs-site.xml

1

2

3

4

<property>

<name>dfs.permissions</name>

<value>false</value>

</property>

  重启hdfs

1

2

stop-dfs.sh

start-dfs.sh

  递归创建目录

1

client.makedirs("/data/rar/tmp",permission=755)

  

rename:移动一个文件或文件夹

  hdfs_src_path:源路径

  hdfs_dst_path:目标路径,如果路径存在且是个目录,则源目录移动到此目录中。如果路径存在且是个文件,则会抛出异常

1

2

3

4

5

6

7

8

9

10

def rename(self, hdfs_src_path, hdfs_dst_path):

    """Move a file or folder.

 

    :param hdfs_src_path: Source path.

    :param hdfs_dst_path: Destination path. If the path already exists and is

      a directory, the source will be moved into it. If the path exists and is

      a file, or if a parent destination directory is missing, this method will

      raise an :class:`HdfsError`.

 

    """

  示例:

1

client.rename("/SRC_DATA","/dest_data")

  

delete:从hdfs删除一个文件或目录

  hdfs_path:hdfs系统上的路径

  recursive:如果目录非空,True:可递归删除.False:抛出异常。

1

2

3

4

5

6

7

8

9

10

11

12

def delete(self, hdfs_path, recursive=False):

    """Remove a file or directory from HDFS.

 

    :param hdfs_path: HDFS path.

    :param recursive: Recursively delete files and directories. By default,

      this method will raise an :class:`HdfsError` if trying to delete a

      non-empty directory.

 

    This function returns `True` if the deletion was successful and `False` if

    no file or directory previously existed at `hdfs_path`.

 

    """

  示例:

1

client.delete("/dest_data",recursive=True)

  

 upload:上传文件或目录到hdfs文件系统,如果目标目录已经存在,则将文件或目录上传到此目录中,否则新建目录。

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

def upload(self, hdfs_path, local_path, overwrite=False, n_threads=1,

    temp_dir=None, chunk_size=2 ** 16, progress=None, cleanup=True, **kwargs):

    """Upload a file or directory to HDFS.

 

    :param hdfs_path: Target HDFS path. If it already exists and is a

      directory, files will be uploaded inside.

    :param local_path: Local path to file or folder. If a folder, all the files

      inside of it will be uploaded (note that this implies that folders empty

      of files will not be created remotely).

    :param overwrite: Overwrite any existing file or directory.

    :param n_threads: Number of threads to use for parallelization. A value of

      `0` (or negative) uses as many threads as there are files.

    :param temp_dir: Directory under which the files will first be uploaded

      when `overwrite=True` and the final remote path already exists. Once the

      upload successfully completes, it will be swapped in.

    :param chunk_size: Interval in bytes by which the files will be uploaded.

    :param progress: Callback function to track progress, called every

      `chunk_size` bytes. It will be passed two arguments, the path to the

      file being uploaded and the number of bytes transferred so far. On

      completion, it will be called once with `-1` as second argument.

    :param cleanup: Delete any uploaded files if an error occurs during the

      upload.

    :param \*\*kwargs: Keyword arguments forwarded to :meth:`write`.

 

    On success, this method returns the remote upload path.

 

    """

  示例:

1

2

3

4

5

6

>>> import hdfs

>>> client=hdfs.Client("http://172.10.236.21:50070")

>>> client.upload("/logs","/root/training/jdk-7u75-linux-i586.tar.gz")

'/logs/jdk-7u75-linux-i586.tar.gz'

>>> client.list("/logs")

[u'anaconda-ks.cfg', u'install.log', u'jdk-7u75-linux-i586.tar.gz']

  

content:获取hdfs系统上文件或目录的概要信息

1

2

3

print client.content("/logs/install.log")

结果:

{u'spaceConsumed': 57162, u'quota': -1, u'spaceQuota': -1, u'length': 57162, u'directoryCount': 0, u'fileCount': 1}

  

write:在hdfs文件系统上创建文件,可以是字符串,生成器或文件对象

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

def write(self, hdfs_path, data=None, overwrite=False, permission=None,

    blocksize=None, replication=None, buffersize=None, append=False,

    encoding=None):

    """Create a file on HDFS.

 

    :param hdfs_path: Path where to create file. The necessary directories will

      be created appropriately.

    :param data: Contents of file to write. Can be a string, a generator or a

      file object. The last two options will allow streaming upload (i.e.

      without having to load the entire contents into memory). If `None`, this

      method will return a file-like object and should be called using a `with`

      block (see below for examples).

    :param overwrite: Overwrite any existing file or directory.

    :param permission: Octal permission to set on the newly created file.

      Leading zeros may be omitted.

    :param blocksize: Block size of the file.

    :param replication: Number of replications of the file.

    :param buffersize: Size of upload buffer.

    :param append: Append to a file rather than create a new one.

    :param encoding: Encoding used to serialize data written.

"""

 

常见问题:

(1)无法导入hdfs包

注意包路径是在python2下面还是python3目录下,有的系统是自带的python2,自己安装了python3。只需要进入python3或者在python脚本中注明解释器版本。

注明解释器版本:#!/usr/bin/python3

或者使用python3进入命令行

使用python2进入命令行导入包可能报错

(2)

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/urllib3/connection.py", line 160, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.6/site-packages/urllib3/util/connection.py", line 61, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib64/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/site-packages/hdfs/client.py", line 611, in upload
    raise err
  File "/usr/local/lib/python3.6/site-packages/hdfs/client.py", line 600, in upload
    _upload(path_tuple)
  File "/usr/local/lib/python3.6/site-packages/hdfs/client.py", line 531, in _upload
    self.write(_temp_path, wrap(reader, chunk_size, progress), **kwargs)
  File "/usr/local/lib/python3.6/site-packages/hdfs/client.py", line 477, in write
    consumer(data)
  File "/usr/local/lib/python3.6/site-packages/hdfs/client.py", line 469, in consumer
    data=(c.encode(encoding) for c in _data) if encoding else _data,
  File "/usr/local/lib/python3.6/site-packages/hdfs/client.py", line 214, in _request
    **kwargs
  File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/requests/adapters.py", line 467, in send
    low_conn.endheaders()
  File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output
    self.send(msg)
  File "/usr/lib64/python3.6/http/client.py", line 974, in send
    self.connect()
  File "/usr/local/lib/python3.6/site-packages/urllib3/connection.py", line 187, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.6/site-packages/urllib3/connection.py", line 172, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f2b34f23f98>: Failed to establish a new connection: [Errno -2] Name or service not known

这个是因为在/etc/hosts文件中没有加上hadoop集群的所有节点。在/etc/hosts文件中加上。

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值