一、Install Docker Desktop
从docker官网下载软件并做安装Install Docker Desktop on Windows | Docker Documentation
安装完成之后执行如下命令确认docker版本信息
C:\Users\27178>docker --version
Docker version 20.10.14, build a224086
C:\Users\27178>
二、拉起AWS-Glue-Libs 镜像
从docker官网下载https://hub.docker.com/r/amazon/aws-glue-libs/tags?page=1&ordering=last_updated
三、运行容器
docker run -itd -p 8888:8888 -p 4040:4040 --name glue_jupyter amazon/aws-glue-libs:glue_libs_3.0.0_image_01 /home/jupyter/jupyter_start.sh
docker run -itd -p 8888:8888 -p 4040:4040 -v $Home/.aws:/root/.aws:rw --name glue_jupyter amazon/aws-glue-libs:glue_libs_3.0.0_image_01 /home/jupyter/jupyter_start.sh
四、运行Glue script
1.使用http://127.0.0.1:8888/ 打开 笔记本
2.其中persons文件内容如下
Id,Name,Sallary,DepartmentId
1,Joe,70000,1
2,Herry,8000,""
3.运行脚本
其中执行的脚本信息如下
from pyspark import SparkContext
from awsglue.context import GlueContext
sc=SparkContext.getOrCreate()
glueContext=GlueContext(sc)
spark=glueContext.spark_session
# inputDF=spark.read.option("header","true").csv("jupyter_workspace/test/persons.csv")
inputDF=spark.read.csv("jupyter_workspace/test/persons.csv",header=True)
inputDF.printSchema()
inputDF.show()
inputDF.select("Id","Name").show() #根据列进行筛选输出
inputDF.printSchema()
五、连接到AWS
1. 创建IAM用户
2. 打开终端配置AWS CLI的证书信息
3.验证是否有权限
4.调试的整个源码
from pyspark import SparkContext
from awsglue.context import GlueContext
sc=SparkContext.getOrCreate()
glueContext=GlueContext(sc)
spark=glueContext.spark_session
# inputDF=spark.read.option("header","true").csv("jupyter_workspace/test/persons.csv")
inputDF=spark.read.csv("jupyter_workspace/test/persons.csv",header=True)
inputDF.printSchema()
inputDF.show()
inputDF.select("Id","Name").show()
inputDF.printSchema()
from awsglue.dynamicframe import DynamicFrame
persons_data=DynamicFrame.fromDF(inputDF,glueContext,'EmployeeData')
persons_data.printSchema()
glueContext.write_dynamic_frame_from_options(
frame=persons_data,
connection_type='s3',
connection_options={
'path': 's3://testbigdata'
},
format='json')
下载确认文件是否被成功写入到S3中去