有两种方式第一种通过调用boto3使用,第二种pyathenajdbc使用,相对而言两种方式各有优劣。
第一种
import boto3
import pandas as pd
import io
import time
athena = boto3.client('athena', region_name='ap-southeast-1')
s3 = boto3.client('s3')
def get_data_by_boto3(sql):
bucket='bigdata-athena'
print(sql)
s3_output="s3://bigdata-athena/staging/parquet/mysql/"
response = athena.start_query_execution(
QueryString=sql,
QueryExecutionContext={
'Database': 'systemlogs'
},
ResultConfiguration={
'OutputLocation': s3_output
}
)
print(response)
file_name = response['QueryExecutionId'] + '.csv'
file_name = response['QueryExecutionId'] + '.txt'
key="staging/parquet/mysql/"+file_name
print(key)
obj = None
for i in range(2):
try:
time.sleep(5)
obj = s3.get_object(Bucket=bucket, Key=key)['Body'].read().decode("utf-8")
df = pd.read_csv(io.StringIO(obj))
obj = s3.get_object(Bucket=bucket, Key=key)['Body'].read()
df = pd.read_csv(io.BytesIO(obj))
df.to_csv("txt",index=0)
print(df)
break
except Exception as e:
print(e)
if __name__ == '__main__':
sql="show tables"
get_data_by_boto3(sql)
查询时间比较长,这种方式适合查询大批量数据,当数据比较多时间增长也不太多。查询table时返回的是文件是txt,若使用csv则读不出来。
第二种
from pyathenajdbc import connect
def get_data_by_athena(sql):
# schema_name 命名空间
conn = connect(s3_staging_dir="s3://bigdata-athena/staging/parquet/mysql/",region_name="ap-southeast-1",schema_name='systemlogs')
#一种方式较为简单
df=pd.read_sql(sql,con=conn)
#第二种
try:
with conn.cursor() as cursor:
cursor.execute(sql)
conn.commit()
results = cursor.fetchall()
indexs = cursor.description
print(results,indexs)
ins=[]
for index in indexs:
ins.append(index[0])
return {"results":results,"ins":ins}
finally:
conn.close()
if __name__ == '__main__':
sql="show tables"
get_data_by_athena(sql)
查询时间根据数量而定,第二种可以使用df=pandas.read_sql(sql,con=conn)。数据多的时候时间过于长,适用于少量数据