Hive通过元数据库获取表的数据量,占用空间
SELECT
a.TBL_ID,
d.`NAME` dbName,
a.TBL_NAME,
b.PARAM_VALUE numRows,
c.PARAM_VALUE totalSize
FROM
TBLS AS a
left JOIN TABLE_PARAMS AS b
on a.TBL_ID = b.TBL_ID
left JOIN TABLE_PARAMS AS c
on a.TBL_ID = c.TBL_ID
left JOIN DBS as d
on d.DB_ID = a.DB_ID
where
b.PARAM_KEY = "numRows"
and
c.PARAM_KEY = "totalSize"
通过hive元数据表获取hive分区表的相关信息
SELECT main.*,b.numRows,b.totalSize FROM (
SELECT main.TBL_ID AS metaObjectId
,main.TBL_NAME AS tabName
,b.PART_ID AS partId
,b.PART_NAME AS partitionName
,b.CREATE_TIME AS createTime
,b.LAST_ACCESS_TIME AS updateTime
FROM TBLS main
INNER JOIN PARTITIONS b ON main.TBL_ID = b.TBL_ID
WHERE b.PART_NAME IS NOT NULL
) main
LEFT JOIN
(
SELECT PART_ID AS partId,
MAX(CASE PARAM_KEY WHEN 'numRows' THEN PARAM_VALUE ELSE 0 END) AS numRows,
MAX(CASE PARAM_KEY WHEN 'totalSize' THEN PARAM_VALUE ELSE 0 END) AS totalSize
FROM PARTITION_PARAMS
GROUP BY PART_ID
) b ON main.partId=b.partId;
重点:对应hive分区的几个比较重要的元数据表
TBLS:hive所有表的基础信息,包括id,表名,等等
PARTITIONS:跟分区有关的表, 包括分区id,创建时间,分区表名,以及TBLS主表关联id等等
PARTITION_PARAMS:分区表比较重要的一些数据
主要字段:
numfiles:该分区下的文件数,
numRows:该分区下记录数,
rawDataSize是指原始数据的大小,
totalSize是指占用HDFS存储空间大小,
transient_lastDdlTime最后一次ddl时间