AWS EMR 环境下的 HBase 快照管理与清理自动化方案

大大大大丹

已于 2025-03-22 00:13:38 修改

阅读量624

点赞数 27

文章标签： aws 自动化 hbase emr

于 2025-03-22 00:09:16 首次发布

本文链接：https://blog.csdn.net/weixin_42540108/article/details/146431441

版权

This article provides a detailed description of an automated solution for managing and cleaning up HBase snapshots in an AWS EMR environment. By combining Shell scripts and AWS Lambda, this solution efficiently creates, exports, and cleans up HBase snapshots while ensuring data reliability and recoverability.

1. Overview

1.1 Core Features

Snapshot Creation: Supports creating snapshots by namespace or table.
S3 Export: Exports snapshots to an S3 bucket for off-site backup.
Automatic Cleanup: Deletes expired HBase snapshots and S3 exports.
Parallel Processing: Utilizes multi-threading to accelerate large-scale operations.
Logging: Stores all operation logs in S3 and CloudWatch for auditing and monitoring.

1.2 Use Cases

Data Backup: Regularly backs up HBase data to ensure disaster recovery.
Data Migration: Exports HBase data to S3 for cross-region migration or analysis.
Resource Optimization: Cleans up expired snapshots to free up storage space.

2. Technical Architecture

2.1 Components

Component	Purpose
AWS EMR	Runs the HBase cluster and executes snapshot operations.
AWS Lambda	Automates the scheduling and management of snapshot tasks.
AWS S3	Stores snapshot data and operation logs.
CloudWatch	Monitors Lambda execution status and logs.
Shell Scripts	Implements the core logic for snapshot creation and cleanup.

2.2 Workflow

Task Trigger: Initiated via CloudWatch Events or manual Lambda invocation.
Cluster Filtering: Lambda filters target EMR clusters based on tags.
Task Submission: Lambda submits Shell scripts as EMR Steps for execution.
Snapshot Operations: Shell scripts execute snapshot creation, export, or cleanup on the EMR cluster.
Logging: Operation logs are uploaded to S3 and CloudWatch.

3. Script Implementation

3.1 Snapshot Management Script (`hbase-snapshot.sh`)

#!/bin/bash
# Purpose: Create and export HBase snapshots on AWS EMR
set -eo pipefail

show_help() {
  cat <<EOF
HBase Snapshot Manager v1.2
Usage: ${0##*/} [OPTIONS]

Required Parameters:
  --opt <MODE>          Operation mode: 'namespaces' or 'tables'
  --items <LIST>        Comma-separated list of namespaces/tables

Optional Parameters:
  --export              Enable S3 export (default: false)
  --export-bucket <S3>  Target S3 bucket for exports
  --export-path <PATH>  S3 path prefix (default: '')
  --parallel <N>        Parallel thread count (default: 1)
  --help                Show this help message

Examples:
  ./hbase-snapshot.sh --opt namespaces --items ns1,ns2
  ./hbase-snapshot.sh --opt tables --items tbl1,tbl2 --export --export-bucket my-bucket
EOF
  exit 0
}

# Parameter initialization
declare -A PARAMS=(
  [opt]="" [items]="" [export]="false"
  [export-bucket]="" [export-path]="" [parallel]="1"
)

# Parse command-line arguments
while [[ $# -gt 0 ]]; do
  case "$1" in
    --help) show_help ;;
    --*) PARAMS["${1#--}"]="$2"; shift 2 ;;
    *) echo "Invalid option: $1"; show_help ;;
  esac
done

# Validate required parameters
[[ "${PARAMS[opt]}" =~ ^(namespaces|tables)$ ]] || {
  echo "ERROR: Invalid --opt value. Valid options: namespaces|tables"
  exit 1
}

[[ -n "${PARAMS[items]}" ]] || {
  echo "ERROR: --items parameter is required"
  exit 1
}

# Core functions
generate_snapshot_name() {
  # Generate standardized snapshot name: namespace_table_TIMESTAMP.snapshot
  local ts=$(date +%Y%m%d%H%M%S)
  echo "${1}_${2}_${ts}.snapshot"
}

export_to_s3() {
  aws s3 cp "$1" "s3://${PARAMS[export-bucket]}/${PARAMS[export-path]}/" \
    --recursive --quiet
}

# Main process
if [[ "${PARAMS[opt]}" == "namespaces" ]]; then
  parallel -j "${PARAMS[parallel]}" -I{} '
    namespace={}
    tables=$(hbase shell -n <<< "list_namespace_tables \"$namespace\"" | grep -Eo "\S+")
    for table in $tables; do
      snapshot=$(generate_snapshot_name "$namespace" "$table")
      hbase snapshot create -n "$snapshot" -t "$namespace:$table"
      [[ "${PARAMS[export]}" == "true" ]] && export_to极客时间s3 "$snapshot"
    done
  ' ::: ${PARAMS[items]//,/ }
elif [[ "${PARAMS[opt]}" == "tables" ]]; then
  parallel -j "${PARAMS[parallel]}" -I{} '
    if [[ "{}" == *:* ]]; then
      namespace="${{}%:*}" table="${{}#*:}"
    else
      namespace="default" table="{}"
    fi
    snapshot=$(generate_snapshot_name "$namespace" "$table")
    hbase snapshot create -n "$snapshot" -t "{}"
    [[ "${PARAMS[export]}" == "true" ]] && export_to_s3 "$snapshot"
  ' ::: ${PARAMS[items]//,/ }
else
  echo "ERROR: Unsupported operation mode: ${PARAMS[opt]}"
  exit 1
fi

3.2 Cleanup Script (`hbase-cleanup.sh`)

#!/bin/bash
# Purpose: Clean expired snapshots and S3 exports on AWS EMR
set -eo pipefail

show_help() {
  cat <<EOF
HBase Cleanup Manager v1.2
Usage: ${极客时间0##*/} [OPTIONS]

Required Parameters:
  --retention-days <N>  Days to retain HBase snapshots

Optional Parameters:
  --s3-retention <N>    Days to retain S3 exports (default: 10)
  --s3-bucket <NAME>    S3 bucket for cleanup
  --s3-path <PATH>      S3 path prefix (default: '')
  --parallel <N>        Parallel thread count (default: 1)
  --dry-run             Simulation mode (no actual deletion)
  --help                Show this help message

Examples:
  ./hbase-cleanup.sh --retention-days 7
  ./hbase-cleanup.sh --retention-days 7 --s3-bucket my-bucket --s3-retention 30
EOF
  exit 0
}

# Parameter initialization
declare -A PARAMS=(
  [retention-days]="" [s3-retention]="10"
  [s3-bucket]="" [s3-path]="" [parallel]="1" [dry-run]="false"
)

# Parse command-line arguments
while [[ $# -gt 0 ]]; do
  case "$1" in
    --help) show_help ;;
    --*) PARAMS["${1#--}"]="$2"; shift 2 ;;
    *) echo "Invalid option: $1"; show_help ;;
  esac
done

# Validate required parameters
[[ "${PARAMS[retention-days]}" =~ ^[0-9]+$ ]] || {
  echo "ERROR: Invalid retention days"
  exit 1
}

# Core functions
clean_hbase_snapshots() {
  local cutoff=$(date -d "${PARAMS[retention-days]} days ago" +%s)
  hbase shell -n <<< "list_snapshots" | grep -Po '\S+\.snapshot' | while read snap; do
    snap_date=$(date -d "$(grep -Po '\d{14}' <<< "$snap" | sed 's/....$//')" +%s)
    (( snap_date < cutoff )) && {
      echo "Deleting snapshot: $snap"
      [[ "${PARAMS[dry-run]}" == "false" ]] && hbase shell -n <<< "delete_snapshot '$snap'"
    }
  done
}

clean_s3_exports() {
  [[ -z "${PARAMS[s3-bucket]}" ]] && return
  
  local cutoff=$(date -d "${PARAMS[s3-retention]} days ago" +%s)
  aws s3api list-objects-v2 --bucket "${PARAMS[s3-bucket]}" \
    --prefix "${PARAMS[s3-path]}" \
    --query "Contents[?LastModified<=\`$(date -d @$cutoff -u +%Y-%m-%dT%H:%M:%SZ)\`].Key" \
    --output text | xargs -P "${PARAMS[parallel]}" -n 1000 aws s3 rm "s3://${PARAMS[s3-bucket]}/" --recursive
}

# Main process
exec &> >(tee "/tmp/hbase-cleanup-$(date +%s).log")
clean_hbase_snapshots
clean_s3_exports

4. Lambda Implementation

4.1 Lambda Function Code (`lambda_function.py`)

import boto3
import json
from functools import cached_property
from botocore.exceptions import ClientError

class HBaseManager:
    """AWS Lambda handler for managing HBase snapshots and cleanup operations."""
    
    def __init__(self, event):
        self.event = event
        self._validate_params()
        
    @cached_property
    def emr(self):
        """Cached EMR client for AWS API calls."""
        return boto3.client('emr')
    
    def _validate_params(self):
        """Validate and normalize input parameters."""
        required_params = {'tag_key', 'tag_value', 'action'}
        missing = required_params - self.event.keys()
        if missing:
            raise ValueError(f"Missing required parameters: {', '.join(missing)}")

        action = self.event['action'].lower()
        if action not in ('snapshot', 'cleanup'):
            raise ValueError("Invalid action. Must be 'snapshot' or 'cleanup'.")

        # Validate snapshot-specific parameters
        if action == 'snapshot':
            if not isinstance(self.event.get('items', ''), str) or not self.event['items'].strip():
                raise ValueError("Invalid or missing 'items' for snapshot action.")
            if self.event.get('export') and not self.event.get('export_bucket'):
                raise ValueError("'export_bucket' is required when export is enabled.")

        # Validate cleanup-specific parameters
        if action == 'cleanup':
            if not isinstance(self.event.get('retention_days', 0), int):
                raise ValueError("'retention_days' must be an integer.")
            if 's3_retention_days' in self.event and not isinstance(self.event['s3_retention_days'], int):
                raise ValueError("'s3_retention_days' must be an integer.")

    def clusters(self):
        """Find EMR clusters matching the specified tags."""
        try:
            return [
                cluster['Id'] for cluster in self.emr.list_clusters(
                    ClusterStates=['RUNNING'],
                    ClusterFilters=[{
                        'Name': 'tag:' + self.event['tag_key'],
                        'Values': [self.event['tag_value']]
                    }]
                )['Clusters']
            ]
        except ClientError as e:
            raise RuntimeError(f"Failed to list EMR clusters: {e}")

    def _build_command(self):
        """Construct the shell command for the EMR step."""
        script = f"hbase-{self.event['action']}.sh"
        cmd = [
            "bash", "-c",
            f"aws s3 cp s3://my-bucket/scripts/{script} /tmp/ && "
            f"chmod +x /tmp/{script} && /tmp/{script}"
        ]
        
        # Add parameters excluding metadata fields
        params = []
        for k, v in self.event.items():
            if k in ('tag_key', 'tag_value', 'action'): continue
            if isinstance(v, bool):
                if v: params.append(f"--{k}")
            else:
                params.append(f"--{k} {v}")
        
        cmd[-1] += ' ' + ' '.join(params)
        return cmd

    def submit_step(self, cluster_id):
        """Submit a step to the EMR cluster."""
        try:
            return self.emr.add_steps(
                ClusterId=cluster_id,
                Steps=[{
                    'Name': f"HBase{self.event['action'].title()}",
                    'ActionOnFailure': 'CANCEL_AND_WAIT',
                    'HadoopJarStep': {
                        'Jar': 'command-runner.jar',
                        'Args': self._build_command()
                    }
                }]
            )
        except ClientError as e:
            raise RuntimeError(f"Failed to submit EMR step: {e}")

def lambda_handler(event, context):
    """AWS Lambda entry point."""
    try:
        manager = HBaseManager(event)
        clusters = manager.clusters()
        
        if not clusters:
            return {
                'statusCode': 404,
                'body': json.dumps({'message': 'No matching EMR clusters found'})
            }
            
        # Submit steps to all matching clusters
        for cluster_id in clusters:
            manager.submit_step(cluster_id)
            
        return {
            'statusCode': 200,
            'body': json.dumps({
                'message': 'Steps submitted successfully',
                'clusters': clusters,
                'action': event['action']
            })
        }
    except Exception as e:
        return {
            'statusCode': 400,
            'body': json.dumps({
                'error_type': type(e).__name__,
                'error_message': str(e)
            })
        }

5. Deployment and Usage

5.1 Deployment Steps

- Upload Scripts to S3:

aws s3 cp hbase-snapshot.sh s3://my-bucket/scripts/
aws s3 cp hbase-cleanup.sh s3://my-bucket/scripts/

- Package the Lambda Function:

zip -r lambda-package.zip lambda_function.py

- Create the Lambda Function:

aws lambda create-function \
  --function-name HBaseSnapshotManager \
  --runtime python3.9 \
  --handler lambda_function.lambda_handler \
  --role arn:aws:iam::123456789012:role/lambda-execution-role \
  --zip-file fileb://lambda-package.zip

- Configure Triggers (e.g., CloudWatch Events)

aws events put-rule \
  --name "daily-hbase-snapshot" \
  --schedule-expression "cron(0 2 * * ? *)"

aws lambda add-permission \
  --function-name HBaseSnapshotManager \
  --statement-id "cloudwatch-daily" \
  --action "lambda:InvokeFunction" \
  --principal events.amazonaws.com \
  --source-arn arn:aws:events:us-east-1:123456789012:rule/daily-hbase-snapshot

aws events put-targets \
  --rule daily-hbase-snapshot \
  --targets "Id"="1","Arn"="arn:aws:lambda:us-east-1:123456789012:function:HBaseSnapshotManager"

6. Conclusion

This solution provides automated HBase snapshot management in an AWS EMR environment, offering the following advantages:

Efficiency: Accelerates large-scale operations through parallel processing.
Reliability: Ensures robust error handling and logging.
Flexibility: Adapts to various scenarios through parameterized configurations.
Ease of Use: Includes clear documentation and examples.

By combining Shell scripts and AWS Lambda, this solution significantly improves the management efficiency of HBase clusters while reducing operational costs.