This article provides a detailed description of an automated solution for managing and cleaning up HBase snapshots in an AWS EMR environment. By combining Shell scripts and AWS Lambda, this solution efficiently creates, exports, and cleans up HBase snapshots while ensuring data reliability and recoverability.
1. Overview
1.1 Core Features
- Snapshot Creation: Supports creating snapshots by namespace or table.
- S3 Export: Exports snapshots to an S3 bucket for off-site backup.
- Automatic Cleanup: Deletes expired HBase snapshots and S3 exports.
- Parallel Processing: Utilizes multi-threading to accelerate large-scale operations.
- Logging: Stores all operation logs in S3 and CloudWatch for auditing and monitoring.
1.2 Use Cases
- Data Backup: Regularly backs up HBase data to ensure disaster recovery.
- Data Migration: Exports HBase data to S3 for cross-region migration or analysis.
- Resource Optimization: Cleans up expired snapshots to free up storage space.
2. Technical Architecture
2.1 Components
Component | Purpose |
---|---|
AWS EMR | Runs the HBase cluster and executes snapshot operations. |
AWS Lambda | Automates the scheduling and management of snapshot tasks. |
AWS S3 | Stores snapshot data and operation logs. |
CloudWatch | Monitors Lambda execution status and logs. |
Shell Scripts | Implements the core logic for snapshot creation and cleanup. |
2.2 Workflow
- Task Trigger: Initiated via CloudWatch Events or manual Lambda invocation.
- Cluster Filtering: Lambda filters target EMR clusters based on tags.
- Task Submission: Lambda submits Shell scripts as EMR Steps for execution.
- Snapshot Operations: Shell scripts execute snapshot creation, export, or cleanup on the EMR cluster.
- Logging: Operation logs are uploaded to S3 and CloudWatch.
3. Script Implementation
3.1 Snapshot Management Script (hbase-snapshot.sh
)
#!/bin/bash
# Purpose: Create and export HBase snapshots on AWS EMR
set -eo pipefail
show_help() {
cat <<EOF
HBase Snapshot Manager v1.2
Usage: ${0##*/} [OPTIONS]
Required Parameters:
--opt <MODE> Operation mode: 'namespaces' or 'tables'
--items <LIST> Comma-separated list of namespaces/tables
Optional Parameters:
--export Enable S3 export (default: false)
--export-bucket <S3> Target S3 bucket for exports
--export-path <PATH> S3 path prefix (default: '')
--parallel <N> Parallel thread count (default: 1)
--help Show this help message
Examples:
./hbase-snapshot.sh --opt namespaces --items ns1,ns2
./hbase-snapshot.sh --opt tables --items tbl1,tbl2 --export --export-bucket my-bucket
EOF
exit 0
}
# Parameter initialization
declare -A PARAMS=(
[opt]="" [items]="" [export]="false"
[export-bucket]="" [export-path]="" [parallel]="1"
)
# Parse command-line arguments
while [[ $# -gt 0 ]]; do
case "$1" in
--help) show_help ;;
--*) PARAMS["${1#--}"]="$2"; shift 2 ;;
*) echo "Invalid option: $1"; show_help ;;
esac
done
# Validate required parameters
[[ "${PARAMS[opt]}" =~ ^(namespaces|tables)$ ]] || {
echo "ERROR: Invalid --opt value. Valid options: namespaces|tables"
exit 1
}
[[ -n "${PARAMS[items]}" ]] || {
echo "ERROR: --items parameter is required"
exit 1
}
# Core functions
generate_snapshot_name() {
# Generate standardized snapshot name: namespace_table_TIMESTAMP.snapshot
local ts=$(date +%Y%m%d%H%M%S)
echo "${1}_${2}_${ts}.snapshot"
}
export_to_s3() {
aws s3 cp "$1" "s3://${PARAMS[export-bucket]}/${PARAMS[export-path]}/" \
--recursive --quiet
}
# Main process
if [[ "${PARAMS[opt]}" == "namespaces" ]]; then
parallel -j "${PARAMS[parallel]}" -I{} '
namespace={}
tables=$(hbase shell -n <<< "list_namespace_tables \"$namespace\"" | grep -Eo "\S+")
for table in $tables; do
snapshot=$(generate_snapshot_name "$namespace" "$table")
hbase snapshot create -n "$snapshot" -t "$namespace:$table"
[[ "${PARAMS[export]}" == "true" ]] && export_to极客时间s3 "$snapshot"
done
' ::: ${PARAMS[items]//,/ }
elif [[ "${PARAMS[opt]}" == "tables" ]]; then
parallel -j "${PARAMS[parallel]}" -I{} '
if [[ "{}" == *:* ]]; then
namespace="${{}%:*}" table="${{}#*:}"
else
namespace="default" table="{}"
fi
snapshot=$(generate_snapshot_name "$namespace" "$table")
hbase snapshot create -n "$snapshot" -t "{}"
[[ "${PARAMS[export]}" == "true" ]] && export_to_s3 "$snapshot"
' ::: ${PARAMS[items]//,/ }
else
echo "ERROR: Unsupported operation mode: ${PARAMS[opt]}"
exit 1
fi
3.2 Cleanup Script (hbase-cleanup.sh
)
#!/bin/bash
# Purpose: Clean expired snapshots and S3 exports on AWS EMR
set -eo pipefail
show_help() {
cat <<EOF
HBase Cleanup Manager v1.2
Usage: ${极客时间0##*/} [OPTIONS]
Required Parameters:
--retention-days <N> Days to retain HBase snapshots
Optional Parameters:
--s3-retention <N> Days to retain S3 exports (default: 10)
--s3-bucket <NAME> S3 bucket for cleanup
--s3-path <PATH> S3 path prefix (default: '')
--parallel <N> Parallel thread count (default: 1)
--dry-run Simulation mode (no actual deletion)
--help Show this help message
Examples:
./hbase-cleanup.sh --retention-days 7
./hbase-cleanup.sh --retention-days 7 --s3-bucket my-bucket --s3-retention 30
EOF
exit 0
}
# Parameter initialization
declare -A PARAMS=(
[retention-days]="" [s3-retention]="10"
[s3-bucket]="" [s3-path]="" [parallel]="1" [dry-run]="false"
)
# Parse command-line arguments
while [[ $# -gt 0 ]]; do
case "$1" in
--help) show_help ;;
--*) PARAMS["${1#--}"]="$2"; shift 2 ;;
*) echo "Invalid option: $1"; show_help ;;
esac
done
# Validate required parameters
[[ "${PARAMS[retention-days]}" =~ ^[0-9]+$ ]] || {
echo "ERROR: Invalid retention days"
exit 1
}
# Core functions
clean_hbase_snapshots() {
local cutoff=$(date -d "${PARAMS[retention-days]} days ago" +%s)
hbase shell -n <<< "list_snapshots" | grep -Po '\S+\.snapshot' | while read snap; do
snap_date=$(date -d "$(grep -Po '\d{14}' <<< "$snap" | sed 's/....$//')" +%s)
(( snap_date < cutoff )) && {
echo "Deleting snapshot: $snap"
[[ "${PARAMS[dry-run]}" == "false" ]] && hbase shell -n <<< "delete_snapshot '$snap'"
}
done
}
clean_s3_exports() {
[[ -z "${PARAMS[s3-bucket]}" ]] && return
local cutoff=$(date -d "${PARAMS[s3-retention]} days ago" +%s)
aws s3api list-objects-v2 --bucket "${PARAMS[s3-bucket]}" \
--prefix "${PARAMS[s3-path]}" \
--query "Contents[?LastModified<=\`$(date -d @$cutoff -u +%Y-%m-%dT%H:%M:%SZ)\`].Key" \
--output text | xargs -P "${PARAMS[parallel]}" -n 1000 aws s3 rm "s3://${PARAMS[s3-bucket]}/" --recursive
}
# Main process
exec &> >(tee "/tmp/hbase-cleanup-$(date +%s).log")
clean_hbase_snapshots
clean_s3_exports
4. Lambda Implementation
4.1 Lambda Function Code (lambda_function.py
)
import boto3
import json
from functools import cached_property
from botocore.exceptions import ClientError
class HBaseManager:
"""AWS Lambda handler for managing HBase snapshots and cleanup operations."""
def __init__(self, event):
self.event = event
self._validate_params()
@cached_property
def emr(self):
"""Cached EMR client for AWS API calls."""
return boto3.client('emr')
def _validate_params(self):
"""Validate and normalize input parameters."""
required_params = {'tag_key', 'tag_value', 'action'}
missing = required_params - self.event.keys()
if missing:
raise ValueError(f"Missing required parameters: {', '.join(missing)}")
action = self.event['action'].lower()
if action not in ('snapshot', 'cleanup'):
raise ValueError("Invalid action. Must be 'snapshot' or 'cleanup'.")
# Validate snapshot-specific parameters
if action == 'snapshot':
if not isinstance(self.event.get('items', ''), str) or not self.event['items'].strip():
raise ValueError("Invalid or missing 'items' for snapshot action.")
if self.event.get('export') and not self.event.get('export_bucket'):
raise ValueError("'export_bucket' is required when export is enabled.")
# Validate cleanup-specific parameters
if action == 'cleanup':
if not isinstance(self.event.get('retention_days', 0), int):
raise ValueError("'retention_days' must be an integer.")
if 's3_retention_days' in self.event and not isinstance(self.event['s3_retention_days'], int):
raise ValueError("'s3_retention_days' must be an integer.")
def clusters(self):
"""Find EMR clusters matching the specified tags."""
try:
return [
cluster['Id'] for cluster in self.emr.list_clusters(
ClusterStates=['RUNNING'],
ClusterFilters=[{
'Name': 'tag:' + self.event['tag_key'],
'Values': [self.event['tag_value']]
}]
)['Clusters']
]
except ClientError as e:
raise RuntimeError(f"Failed to list EMR clusters: {e}")
def _build_command(self):
"""Construct the shell command for the EMR step."""
script = f"hbase-{self.event['action']}.sh"
cmd = [
"bash", "-c",
f"aws s3 cp s3://my-bucket/scripts/{script} /tmp/ && "
f"chmod +x /tmp/{script} && /tmp/{script}"
]
# Add parameters excluding metadata fields
params = []
for k, v in self.event.items():
if k in ('tag_key', 'tag_value', 'action'): continue
if isinstance(v, bool):
if v: params.append(f"--{k}")
else:
params.append(f"--{k} {v}")
cmd[-1] += ' ' + ' '.join(params)
return cmd
def submit_step(self, cluster_id):
"""Submit a step to the EMR cluster."""
try:
return self.emr.add_steps(
ClusterId=cluster_id,
Steps=[{
'Name': f"HBase{self.event['action'].title()}",
'ActionOnFailure': 'CANCEL_AND_WAIT',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': self._build_command()
}
}]
)
except ClientError as e:
raise RuntimeError(f"Failed to submit EMR step: {e}")
def lambda_handler(event, context):
"""AWS Lambda entry point."""
try:
manager = HBaseManager(event)
clusters = manager.clusters()
if not clusters:
return {
'statusCode': 404,
'body': json.dumps({'message': 'No matching EMR clusters found'})
}
# Submit steps to all matching clusters
for cluster_id in clusters:
manager.submit_step(cluster_id)
return {
'statusCode': 200,
'body': json.dumps({
'message': 'Steps submitted successfully',
'clusters': clusters,
'action': event['action']
})
}
except Exception as e:
return {
'statusCode': 400,
'body': json.dumps({
'error_type': type(e).__name__,
'error_message': str(e)
})
}
5. Deployment and Usage
5.1 Deployment Steps
-
- Upload Scripts to S3:
-
aws s3 cp hbase-snapshot.sh s3://my-bucket/scripts/ aws s3 cp hbase-cleanup.sh s3://my-bucket/scripts/
- Package the Lambda Function:
-
zip -r lambda-package.zip lambda_function.py
- Create the Lambda Function:
-
aws lambda create-function \ --function-name HBaseSnapshotManager \ --runtime python3.9 \ --handler lambda_function.lambda_handler \ --role arn:aws:iam::123456789012:role/lambda-execution-role \ --zip-file fileb://lambda-package.zip
- Configure Triggers (e.g., CloudWatch Events)
-
aws events put-rule \ --name "daily-hbase-snapshot" \ --schedule-expression "cron(0 2 * * ? *)" aws lambda add-permission \ --function-name HBaseSnapshotManager \ --statement-id "cloudwatch-daily" \ --action "lambda:InvokeFunction" \ --principal events.amazonaws.com \ --source-arn arn:aws:events:us-east-1:123456789012:rule/daily-hbase-snapshot aws events put-targets \ --rule daily-hbase-snapshot \ --targets "Id"="1","Arn"="arn:aws:lambda:us-east-1:123456789012:function:HBaseSnapshotManager"
6. Conclusion
This solution provides automated HBase snapshot management in an AWS EMR environment, offering the following advantages:
- Efficiency: Accelerates large-scale operations through parallel processing.
- Reliability: Ensures robust error handling and logging.
- Flexibility: Adapts to various scenarios through parameterized configurations.
- Ease of Use: Includes clear documentation and examples.
By combining Shell scripts and AWS Lambda, this solution significantly improves the management efficiency of HBase clusters while reducing operational costs.