大数据技术----HBase Python编程

最新推荐文章于 2024-01-10 20:11:01 发布

laufing

最新推荐文章于 2024-01-10 20:11:01 发布

阅读量1.1k

点赞数 2

分类专栏：大数据技术文章标签： Hbase python

本文链接：https://blog.csdn.net/weixin_45228198/article/details/119250610

版权

大数据技术专栏收录该内容

15 篇文章 0 订阅

订阅专栏

本文详细介绍了如何使用Thrift框架在Java编写的Hbase分布式数据库上进行跨语言通信，特别是如何在Python中建立客户端，进行表的创建、删除、数据的增删查改操作。Thrift通过定义IDL接口并生成不同语言的客户端和服务端代码，实现了C++、Java、Python等语言间的高效通信。同时，文章还提供了在Ubuntu上安装Thrift的步骤，并展示了具体的Python编程示例。

摘要由CSDN通过智能技术生成

Thrift 服务

Hbase分布式数据库，使用Java语言编写，除了提供原生的Java接口外，还可以使用其他语言连接，但是需要使用Thrift服务，进行跨语言转换！！！
在这里插入图片描述
Thrift是一种C/S模式的软件框架，定义了一种描述对象和服务接口定义语言IDL，用于可扩展、跨语言的服务开发。它结合了功能强大的软件堆栈、代码生成引擎，利用IDL来定义RPC接口和数据类型，通过 Thrift编译器将用户定义的服务接口文件生成不同语言的客户端、服务端，如Python客户端、Java服务端，由生成的代码负责协议层、传输层的实现，从而构建C++、Java、Python、PHP、JavaScript等不同语言之间的高效通信。

Thrift包含三个重要的组件，分别为Protocol、Transport、Server。Protocol为协议层，用来定义数据传输的格式；Transport是传输层，用来定义数据的传输方式，可以是TCP/IP传输，也可以是内存共享的方式；Server定义服务模型。
在这里插入图片描述

Apache官网
可以在页面最后的项目列表中找到所有的项目
Thrift官网
当前最新版本的Thrift
历史版本Thrift

安装Thrift

下载Thrift，以Ubuntu1804为例

#http浏览器下载
http://archive.apache.org/dist/thrift/thrift-0.11.0.tar.gz  
#wget下载
wget https://downloads.apache.org/thrift/0.11.1/thrift-0.11.0.tar.gz

解压到指定目录

安装Thrift依赖

sudo apt-get install automake bison flex g++ git libboost-all-dev libevent-dev libssl-dev libtool make pkg-config

编译，安装

tarena@tedu:~/thrift-0.11.0$ ./configure
tarena@tedu:~/thrift-0.11.0$ make
tarena@tedu:~/thrift-0.11.0$ sudo make install

验证是否安装成功

tarena@tedu:~/thrift-0.11.0$ thrift -version
Thrift version 0.11.0

服务端安装成功。

启动HBase thrift服务

$ hbase-daemon.sh start thrift #默认绑定0.0.0.0：9090，只能本地连接
#远程连接thrift
$ hbase-daemon.sh start thrift -b ip -p 9090

使用jps查看进程，会有ThriftServer ，监听9090

编译Hbase.thrift 文件，生成Python客户端的库文件
这里需使用src版本的hbase
编译如下HBase服务定义的接口文件：
hbase1.4.13-src/hbase-thrift/src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.thrift

$ thrift --gen py /usr/local/hbase1.4.13-src/hbase-thrift/src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.thrift

会在当前目录下生成一个gen-py文件夹，将该文件夹下的hbase包放入Python3的dist-packages/目录下
使用sys.path找到对应的dist-packages，还需知道执行python3时执行的是python3.5还是3.6?
执行的哪个，就放入到对应版本的dist-packages目录下。

hbase包：

/hadoop/hbase/thrift/gen-py/hbase$ ls
constants.py  Hbase.py  Hbase-remote  __init__.py  ttypes.py

安装thrift包

sudo pip3 install thrift

具体Python编程

不要在IPython3中操作，一次操作后就会异常
在这里插入图片描述

Python操作表

获取所有的表名
client.getTableNames()—>返回表名字节串列表
获取表的结构信息
client.getColumnDescriptors(b"stu3")
创建表
client.createTable(b"stu4",[cf1,cf2,…])
cf1 = ColumnDescriptor(name=b"StuInfo")
删除表
client.disableTable(b"stu3")
client.deleteTable(b"stu3")


#transport package-->TSocket module
from thrift.transport import TSocket  
from thrift.transport import TTransport
# TBinaryProtocol
from thrift.protocol import TBinaryProtocol

#hbase package-->Hbase module
from hbase import Hbase               
from hbase.ttypes import *
from hbase.ttypes import ColumnDescriptor, Mutation


class HbaseClient(object):
    def __init__(self, host='127.0.0.1', port=9090):
        #socket连接ThriftServer
        conn = TSocket.TSocket(host,port)
        #传输层
        transport = TTransport.TBufferedTransport(conn)

        #协议层
        protocol = TBinaryProtocol.TBinaryProtocol(transport)

        #创建客户端实例
        self.client = Hbase.Client(protocol)

        #传输层打开
        transport.open()

    def get_tables(self):
        """
        获取所有表
        """
        return self.client.getTableNames() #字节串列表

    def create_table(self, tableName, *columns):
        """
        创建表
        tableName:表名字,字节串
        columns:列族,字节串
        """
        #from hbase.ttypes import ColumnDescriptor
        #cd = ColumnDescriptor(name=b"StuInfo",maxVersions=3,timeToLive=10) 10s过期
        try:
            self.client.createTable(tableName, list(map(lambda column: ColumnDescriptor(column), columns)))
        except AlreadyExists as e:
            print("%s表已经存在"%tableName)

    def delete_table(self,tableName):
        """
        删除表
        :param tableName:表名字，字节串
        :return: None
        """
        self.client.disableTable(tableName)
        self.client.deleteTable(tableName)

        print("当前%s表已删除"%tableName)

if __name__ == "__main__":
    client = HbaseClient()
    # client.create_table("student", "name", "coruse")
    print(client.get_tables())

    #创建表
    client.create_table(b"stu2",b"info",b"score")

    #删除表
    client.delete_table(b"stu2")

Python操作数据

插入数据

from hbase.ttypes import Mutation  #插入一个cell的值，用于一次插入一行
from hbase.ttypes import BatchMutation  #插入多个cell的值，用于一次插入多行

#一次插入一行的一个或者多个cell
mutations = [Mutation(column=b"StuInfo:Name",value=b"Jack"),.....]
client.mutateRow(b"stu3",b"001",mutations,attributes={ })

#一次插入  多行
rows = [
    BatchMutation(row=b"2",mutations=[Mutation(column=b"StuInfo:Name",value=b"jack"),Mutation(column=b"StuInfo:Age",value=b"13")]),\
    BatchMutation(row=b"3",mutations=[Mutation(column=b"StuInfo:Name",value=b"Lucy"),Mutation(column=b"StuInfo:Age",value=b"18")]),
    ]
client.mutateRows(b"stu4",rows,attributes={})

删除数据
client.deleteAll(b"stu3",b"001",b"StuInfo:Name"，attributes={}) #删除列、列族
client.deleteAllRow(b"stu3",b"001",attributes={ })#删除一行
查询数据
单元格数据
client.get(b"stu3",b"003",b"StuInfo:Name",attributes={})#获取单个cell数据，返回[TCell,…]
cellObj.value/cellObj.timestamp

获取一行数据
[TRowResult, ]= client.getRow(b"stu3",b"001")

[TRowResult(row=b'001', columns={b'Grades:BigData': TCell(value=b'80', timestamp=2), b'Grades:Computer': TCell(value=b'90', timestamp=2), b'Grades:Math': TCell(value=b'85', timestamp=2), b'StuInfo:Age': TCell(value=b'19', timestamp=3), b'StuInfo:Class': TCell(value=b'02', timestamp=2), b'StuInfo:Name': TCell(value=b'Tom Green', timestamp=1), b'StuInfo:Sex': TCell(value=b'Male', timestamp=1)}, sortedColumns=None)]

tRowResult.row -->行键
tRowResult.columns -->kv 字典

获取一行的多个列：client.getRowWithColumns
client.getRowWithColumns(b"stu3",b"
001",[b"StuInfo:Name",b"StuInfo:Age"],attributes={})
返回TRowResult列表
client.getRowWithColumnsTs, 小于当前的时间戳

扫描数据
client.scannerOpen(table,startRow,columns)
table: 表名字，字节串
startRow: 扫描的起始行
columns: 指定列族列表，返回列族的所有列；列的列表，返回具体的列
返回scannerId
从scannerId中获取数据
client.scannerGet(scannerId),一行
client.scannerGetList(scannerId,n),多行

#从001行到最后，扫描columns列的数据
scannerId = client.scannerOpen(b"stu3",b"001",columns)
#获取sannerId中的一行数据，迭代一次，下次只能从下一行获取
tRowResult_list = client.scannerGet(scannerId)
#获取scannerId中的多行数据
tRowResult_lists = client.scannerGetList(scannerId,n)
#[TRowResult(row=b'002', columns={b'StuInfo:Age': TCell(value=b'18', timestamp=1631895046584), b'StuInfo:Name': TCell(value=b'Tom Green', timestamp=1631895046584)}, sortedColumns=None), TRowResult(row=b'003', columns={b'StuInfo:Age': TCell(value=b'20', timestamp=1631895046588), b'StuInfo:Name': TCell(value=b'Lucy', timestamp=1631895046588)}, sortedColumns=None)]

只要n足够大，就可以获取所有的扫描结果。

PrefixFilter
扫描行键以前缀开头的行

ss = client.scannerOpenWithPrefix(b"stu3",b"xx5",[b"StuInfo:Name",],attributes={})
print("ss:",client.scannerGetList(ss,100))
#关闭scannerId
client.scannerClose(ss)

其他过滤器的使用
使用TScan对象

#更细致的过滤扫描
from hbase.ttypes import TScan
scan = TScan(startRow=b"002",stopRow=b"005",filterString=b"ColumnPrefixFilter('A')")#filterString写法同HBase shell
scannerId1 = client.scannerOpenWithScan(b"stu3",scan,attributes={})
print(client.scannerGetList(scannerId1,100))
#关闭scannerId
client.scannerClose(scannerId1)

//组合的过滤条件
from hbase.ttypes import TScan
tscan = TScan(startRow=b"1",stopRow=b"9",filterString=b"SingleColumnValueFilter('StuInfo','Age',<,'binary:20') AND \
    SingleColumnValueFilter('StuInfo','Sex',=,'substring:Female')")

scannerId = client.scannerOpenWithStop(b"stu4",startRow=b"1",stopRow=b"3",columns=[b"StuInfo",b"Grades"],attributes={})
trows_list = client.scannerGetList(scannerId,100)
设置起始行，不含stopRow

含最后一行的过滤

from hbase.ttypes import TScan
scan  = TScan(startRow=b"1",filterString=b"InclusiveStopFilter('2')")
scannerId = client.scannerOpenWithScan(b"stu4",scan,attributes={})
trows_list = client.scannerGetList(scannerId,100)
print(trows_list)

相关操作代码：


from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol
from hbase import Hbase
from hbase.ttypes import ColumnDescriptor,AlreadyExists,Mutation

class HbaseClient(object):
    def __init__(self, host='127.0.0.1', port=9090):
        #socket连接ThriftServer
        conn = TSocket.TSocket(host,port)
        #传输层
        transport = TTransport.TBufferedTransport(conn)

        #协议层
        protocol = TBinaryProtocol.TBinaryProtocol(transport)

        #创建客户端实例
        self.client = Hbase.Client(protocol)
        self.transport = transport
        #传输层打开
        transport.open()

    def insert_data(self,tableName,row,column,value):
        """
            插入数据
            tableName: 字节串的表名
            row: 行键，字节串
            column: 列，字节串
            value: 值，字节串
        """
        mutations = [Mutation(column=column,value=value),]
        self.client.mutateRow(tableName,row,mutations,attributes={})
        print("插入单个cell数据成功！")

    def insert_datas(self,tableName,dict_):
        """
            插入多行的多个cell
            tableName: 字节串的表名字
            dict_: 字符串的key-value的字典
        """
        #逐行插入
        for key,value in dict_.items():
            mutations = []
            for k,v in value.items():
                mutations.append(Mutation(column=k.encode(),value=v.encode()))

            self.client.mutateRow(tableName,key.encode(),mutations,attributes={})
        print("插入多个cell数据成功！")

    def get_cell(self,tableName,row,column,attributes={}):
        """
        查询单元格数据，返回列表
        tableName: 字节串
        row: 字节串
        column: 字节串
        attributes: 默认{}
        """
        return self.client.get(tableName,row,column,attributes=attributes)

    def get_row(self, tableName, row, attributes={}):
        """
        获取一行的数据，返回TRowResult列表
        tableName:表名字,字节串
        row:行键,字节串
        """
        return self.client.getRow(tableName,row,attributes=attributes)

    def get_row_with_columns(self,tableName,row,columns,attributes={}):
        """
        获取一行中的多个列的数据，返回列表
        :param tableName:表名字，字节串
        :param row:行键，字节串
        :param columns: [b"列簇：列",]
        :return:TRowResult 列表
        """
        return self.client.getRowWithColumns(tableName,row,columns,attributes=attributes)

    def get_row_ts(self,tableName,row,ts,attributes={}):
        """
        时间戳小于ts的最近的版本数据
        :param tableName:表名字，字节串
        :param row:行键，字节串
        :param ts:时间戳int
        :param attributes:
        :return:TRowResult 列表
        """
        return self.client.getRowTs(tableName,row,ts,attributes=attributes)

    def get_row_with_columns_ts(self,tableName,row,columns,ts,attributes={}):
        """
        一行中的多个列，小于当前时间戳的最近的版本数据
        :param tableName: 表名字，字节串
        :param row: 行键，字节串
        :param columns: [b"列簇：列",]
        :param ts: 时间戳int
        :param attributes:
        :return: hbase.ttypes.TRowResult 列表
        """
        return self.client.getRowWithColumnsTs(tableName,row,columns,ts,attributes=attributes)

    def delete_all(self,tableName,row,col,attributes={}):
        """
        删除一行的单元格数据/列族数据
        :param tableName:
        :param row:
        :param col:
        :param attributes:
        :return:
        """
        self.client.deleteAll(tableName, row, col,attributes=attributes)

        print('数据已删除')

    def delete_all_row(self,tableName,row,attributes={}):
        """
            删除一行数据
            tableName: 字节串的表名
            row: 字节串的行键
        """
        self.client.deleteAllRow(tableName,row,attributes=attributes)
        print("删除表%r:%r行完成"%(tableName.decode(),row.decode()))

    def delete_all_ts(self,tableName,row,col,ts,attributes={}):
        """
        删除小于当前时间戳的单元格数据/列族数据
        :param tableName:
        :param row:
        :param col:
        :param ts:
        :param attributes:
        :return:
        """

        self.client.deleteAllTs(tableName,row,col,ts)
        
    def scannerOpen(self,tableName,row,columns,attributes={}):
        """
            扫描表中的数据，从row行开始到最后，获取columns列的数据
            返回scannerId
        """
        return self.client.scannerOpen(tableName,row,columns,attributes=attributes)


if __name__ == "__main__":
    client = HbaseClient()

    #插入一个单元格数据
    client.insert_data(b"stu3",b"001",b"StuInfo:Name",b"Jack")
    #插入多个行、多个单元格数据
    my_dict = {
        "001":{"StuInfo:Age":"23","StuInfo:Sex":"Male"},
        "002":{"StuInfo:Name":"Tom Green","StuInfo:Age":"18"}, 
        "003":{"StuInfo:Name":"Lucy","StuInfo:Age":"20","StuInfo:Class":"2"},   
        "004":{"StuInfo:Name":"Jack4"},
        "005":{"StuInfo:Name":"Jack5"},
    }
    client.insert_datas(b"stu3",my_dict)
    
    #删除单元格数据
    client.delete_all(b'stu3',b'001',b'StuInfo:Name')
    #删除一行的数据
    client.delete_all_row(b"stu3",b"001")

    #查询单元格数据/列族数据,返回TCell 列表
    cellValue = client.get_cell(b"stu3",b"003",b"StuInfo:Name")
    print("single cell:",cellValue)
    print("cell value:%r, cell timestamp:%r"%(cellValue[0].value,cellValue[0].timestamp))
    #查询一行数据
    row = client.get_row(b"Student",b"001")  #返回一行数据，放入列表
    print("single row:",row)
    print("rowKey:",row[0].row)
    print("kv:",row[0].columns)
    print("TCell:",row[0].columns[b"Grades:BigData"])
    print("value of TCell:",row[0].columns[b"Grades:BigData"].value)
    print("timestamp of TCell:",row[0].columns[b"Grades:BigData"].timestamp)

    #
    r2 = client.get_row_with_columns(b"Student",b"001",[b"StuInfo:Name",b"Grades:BigData"])
    print("获取一行中的多个列：",r2)

    #小于当前时间戳的行
    # r3 = client.get_row_ts(b"Student",b"001",3)
    # print("little than ts:",r3)

    #一行、多列中小于当前时间戳的行数据
    # r4 = client.get_row_with_columns_ts(b"Student",b"001",[b"StuInfo:Name",],2)
    # print("multiple columns and little than ts:",r4)

    #scan data
    scannerId = client.scannerOpen(b"stu3",b"004",[b"StuInfo:Name",b"StuInfo:Age"])
    print("scanner id:",scannerId)
    #获取scannerId的一行数据
    # row = client.client.scannerGet(scannerId)
    # print("获取扫描结果的一行数据:",row)
    #获取scannerId的多行数据
    rows = client.client.scannerGetList(scannerId,100)
    print("获取扫描结果的3行数据:",rows)
    
    #关闭scannerId
    client.client.scannerClose(scannerId)

	#PrefixFilter
    ss = client.client.scannerOpenWithPrefix(b"stu3",b"xx5",[b"StuInfo:Name",],attributes={})
    print("ss:",client.client.scannerGetList(ss,100))
    client.client.scannerClose(ss)

    #更细致的过滤扫描
    from hbase.ttypes import TScan
    #filterString  写法同Hbase shell过滤器的写法
    scan = TScan(startRow=b"002",stopRow=b"005",filterString=b"ColumnPrefixFilter('A')")
    scannerId1 = client.client.scannerOpenWithScan(b"stu3",scan,attributes={})
    print(client.client.scannerGetList(scannerId1,100))
    client.client.scannerClose(scannerId1)
    
    #最后断开传输层
    client.transport.close()