虚拟化半虚拟化全虚拟化_我的虚拟实习经历

最新推荐文章于 2023-11-04 23:53:15 发布

郝ren

最新推荐文章于 2023-11-04 23:53:15 发布

阅读量687

点赞数

原文链接：https://medium.com/expedia-group-tech/the-virtual-internship-my-experience-5cc52a997bb1

版权

虚拟化半虚拟化全虚拟化

Expedia Group Technology —软件 (EXPEDIA GROUP TECHNOLOGY — SOFTWARE)

This summer, I interned at one of the largest travel companies in the world, Expedia Group™️. Who knew during a pandemic, when travel is at an all-time low, Expedia had a great intern program planned for us!

今年夏天，我在全球最大的旅游公司之一Expedia Group™️实习。谁知道在大流行期间，当旅行达到历史最低水平时，Expedia为我们计划了一个很棒的实习计划！

Luckily, my previous internship was remote too, so I was kind of used to work from home. In this internship, I mostly worked on cloud and data engineering.

幸运的是，我以前的实习也很遥远，所以我习惯于在家工作。在这次实习中，我主要从事云和数据工程方面的工作。

入职周 (The induction week)

Expedia organized an induction week for us, and it involved workshops on various topics: how to become a better leader, what a designer does, and we even got to know how Expedia operates. We kept on having workshops throughout our internships, and there was a lot to learn from each session.

Expedia为我们组织了一个入门周，其中涉及各种主题的研讨会：如何成为更好的领导者，设计师的工作方式，甚至我们都知道Expedia的运作方式。在整个实习期间，我们一直在举办研讨会，每次会议都需要学习很多东西。

我的设定 (My setup)

我的团队 (My Team)

I was a part of the Stay Experience team, which looks after the post-booking experience for a traveller. The team was divided into pods, and I worked on the .Net Stack Modernisation pod. My work involved making a microservice and figuring out how to deploy it to the cloud.

我是“住宿体验”团队的成员，该团队负责为旅行者提供预订后的体验。团队被划分为Pod，我在.Net Stack Modernization Pod中工作。我的工作涉及制作微服务，并弄清楚如何将其部署到云中。

I worked with a manager and a buddy. We synced up every week and discussed work updates every day. It motivated me to stay on track. I also synced up with our project sponsor a few times. I enjoyed such one on one sessions. They helped me connect with the team better.

我和一个经理和一个伙伴一起工作。我们每周同步并每天讨论工作更新。这激励了我继续前进。我还与项目发起人进行了几次同步。我很喜欢这样的一对一会议。他们帮助我更好地与团队联系。

All the Vrbo interns together had weekly sync with the Vrbo managers. We socialized, played some games, got help on some issues. It was nice to have someone for support throughout the program, and these calls also helped me get to know the other interns better.

所有Vrbo实习生都每周与Vrbo经理进行同步。我们进行了社交，玩了一些游戏，在某些问题上得到了帮助。能在整个计划中得到支持是一件很高兴的事，这些电话也帮助我更好地了解了其他实习生。

项目：创建一个Spark作业以将数据从MongoDB同步到S3 (The project: Create a Spark job to sync data from MongoDB to S3)

动机 (Motivation)

We have a database which stores the notifications for guests of guests in a homestay. Now we wanted a way in which we can store the data in S3 and keep it updated.

我们有一个数据库，用于存储寄宿家庭中客人的通知。现在我们想要一种可以将数据存储在S3中并保持更新的方法。

Why are you storing the data in two different places?

为什么将数据存储在两个不同的位置？

This data is not only used to send a notification but also by an analyst or data scientist. The MongoDB database also backs the guests of guests API, and if everyone does their operations on MongoDB, it might severely affect the API.

该数据不仅用于发送通知，而且还用于分析人员或数据科学家。 MongoDB数据库还支持guest虚拟机API的guest虚拟机，如果每个人都在MongoDB上进行操作，则可能会严重影响API。

We finally came up with the idea to use a Spark job. This job would take the data from MongoDB and store it in an S3 bucket.

我们终于想到了使用Spark作业的想法。此作业将从MongoDB中获取数据并将其存储在S3存储桶中。

Why a Spark job?

为什么要从事Spark工作？

Spark and Scala are the industry standard for data management. The jobs will be able to manage huge amounts of data without any hiccups.

Spark和Scala是数据管理的行业标准。这些作业将能够管理大量数据而不会造成任何麻烦。

快速修订 (A quick revision)

This blog is going to get more technical now, so here’s a quick reference if you need a reminder.

该博客现在将获得更多技术知识，因此如果您需要提醒，这里有快速参考。

What’s MongoDB? It’s a NoSQL database.What’s Scala? It’s a programming language widely used for data management. I did all my coding in the Scala language.What’s Apache Spark? It’s a fast and general-purpose cluster computing system for large scale data processing. Some important terms to know about from Spark are:

什么是MongoDB？ 这是一个NoSQL数据库。 什么是Scala？ 这是一种广泛用于数据管理的编程语言。我用Scala语言进行了所有编码。 什么是Apache Spark？ 它是用于大规模数据处理的快速通用集群计算系统。从Spark需要了解的一些重要术语是：

Dataframe: This is data in the form of a table, just like a relational database table.
数据框：这是表形式的数据，就像关系数据库表一样。
Dataset: Extension of Dataframes. They provide the functionality of being type-safe and support an object-oriented programming interface.
数据集：数据框的扩展。它们提供类型安全的功能，并支持面向对象的编程接口。
SparkSession vs SparkContext: SparkSession is a unified entry point of a Spark application, starting with Spark 2.0. It provides a way to interact with various Spark functionality, and encapsulates Spark context, hive context, and SQL context all together.
SparkSession与SparkContext：SparkSession是Spark应用程序的统一入口，从Spark 2.0开始。它提供了一种与各种Spark功能进行交互的方式，并将Spark上下文，配置单元上下文和SQL上下文封装在一起。
Prior to Spark 2.0, SparkContext was the entry point of any Spark application and used to access all Spark features. More recently, it is encapsulated in a SparkSession.
在Spark 2.0之前，SparkContext是任何Spark应用程序的入口点，用于访问所有Spark功能。最近，它封装在SparkSession中。

What’s a Spark job? In a Spark application, when you invoke an action on the data set, a job is created. It’s the main function that has to be done and submitted to Spark.What’s a vault? Hashicorp Vault is a tool for secrets management, encryption as a service, and privileged access management.What’s an S3 bucket? Just a simple distributed file storage system, think of it as the hard disk on your computer

什么是Spark工作？ 在Spark应用程序中，当您对数据集调用操作时，将创建一个作业。这是必须完成并提交给Spark的主要功能。 什么是金库？ Hashicorp Vault是用于秘密管理，加密即服务和特权访问管理的工具。 什么是S3存储桶？ 只是一个简单的分布式文件存储系统，可以将其视为计算机上的硬盘

方法 (Approach)

The Spark job does the following:

Spark作业执行以下操作：

Authenticates against the vault and gets the secrets
对金库进行身份验证并获取秘密
Reads the updated data from MongoDB
从MongoDB读取更新的数据
Reads the data stored in the S3 bucket (it’s in Apache Parquet format)
读取存储在S3存储桶中的数据(采用Apache Parquet格式)
Does a left anti join on the S3 and Mongo data, yielding the data that did not change
在S3和Mongo数据上进行左反连接，产生未更改的数据
Merges the data that didn’t change with the new data
合并未更改的数据与新数据
Writes the data in Parquet format to the S3 bucket
将Parquet格式的数据写入S3存储桶

The authentication process is more complicated than just sending a saved token:

认证过程比仅发送保存的令牌还要复杂：

Grab the EC2 metadata and nonce
获取EC2元数据和随机数
Send the metadata and nonce to Vault
将元数据和随机数发送到保险柜
Vault checks if the EC2 instance is allowed and the nonce is correct
保险柜检查是否允许EC2实例且随机数正确
Send a nonce to the Vault server and get a token
将随机数发送到Vault服务器并获得令牌
Get secrets like AWS access key, AWS secret key, or MongoDB password via the token
通过令牌获取诸如AWS访问密钥，AWS秘密密钥或MongoDB密码之类的秘密

Here’s a simplified example explaining the method:

这是解释该方法的简化示例：

Mongo data:
id  author book   updateTime
1   A1     B1     T1
2   A2     B2     T2
3   A3     B3     T3
4   A4     B4     T4S3 data:
id  author book   updateTime
1   A1     B1     T1
2   A2     B2     T2
3   A3     B3     T2.1Step 1:
Get all docs from Mongo between updateTime1 and updateTime2 where T2 < updateTime1 < updateTime2
Fetched records:
3   A3     B3     T3
4   A4     B4     T4Step 2:
Read all records from S3.
Fetched records:
1   A1     B1     T1
2   A2     B2     T2
3   A3     B3     T2.1Step 3:
Left anti join on id between S3 and Mongo data so that we have the data that did not change
1   A1     B1     T1
2   A2     B2     T2Step 4:
Merge results of Step 3 with data fetched in Step 1.
Resulting data set:
1   A1     B1     T1
2   A2     B2     T2
3   A3     B3     T3
4   A4     B4     T4
Save this back to S3. You can see how this data is in sync with Mongo data.

挑战(以及我如何应对) (Challenges (and how I tackled them))

VPN Issues: I needed to use a VPN to remotely access project documentation and tools, but the VPN I was using did not support Linux or WSL, so I had to work in Windows, which I’m not as familiar with. It was not fun to learn Windows versions of the one-line commands I know in Linux.

VPN问题：我需要使用VPN来远程访问项目文档和工具，但是我使用的VPN不支持Linux或WSL ，因此我不得不在Windows上工作，而我对此并不熟悉。学习Linux中我知道的单行命令的Windows版本并不有趣。

However, I found a hack for this. I created an EC2 instance on the Expedia network and shifted all my work to that, using Windows just for the VPN and tools like a terminal window. Because the instance is on the corporate network, I could access everything I needed from there, and as a bonus the scripts ran much faster on the instance than on my computer.

但是，我为此找到了一个办法。我在Expedia网络上创建了一个EC2实例，并将所有工作转移到该实例上，仅将Windows用于VPN和诸如终端窗口之类的工具。因为实例位于公司网络上，所以我可以从那里访问所需的一切，此外，脚本在实例上的运行速度比在计算机上快得多。

RAM Issues: I started off using IntelliJ since it was easy to set up a Scala project with it. But on my machine Intellij was terribly slow and took a lot of RAM.

RAM问题：我开始使用IntelliJ，因为使用它轻松设置Scala项目。但是在我的机器上，Intellij非常慢，并占用了大量RAM。

I later switched to VSCode plus Metals. Although it was a little more time consuming to set up, once understood, all RAM issues were solved. Also, using bloop instead of sbt decreased compile time.
后来我改用VSCode加金属。尽管建立起来比较耗时，但是一旦了解，所有的RAM问题都得到解决。同样，使用bloop代替sbt可以减少编译时间。
Having a local MongoDB server for testing was also quite RAM hungry, MongoDB Atlas helped me a lot here. I deployed a free cluster in the cloud, and my work was done :)
拥有用于测试的本地MongoDB服务器也非常耗费RAM，MongoDB Atlas在这里给了我很多帮助。我在云中部署了一个免费的集群，我的工作就完成了:)

Spark and MongoDB

Spark和MongoDB

Spark has some underlying concepts which were a little nontrivial to understand, at least for a first-timer like me. Above that, the documentation of the MongoDB-Spark connector was following old Spark conventions in some places and new conventions in other places. The documentation confused me a bit.
Spark具有一些根本的概念，至少对于像我这样的初学者来说，这些概念是不容易理解的。除此之外，MongoDB-Spark连接器的文档在某些地方遵循旧的Spark约定，而在其他地方遵循新的约定。文档使我有些困惑。

Understanding Vault EC2 authentication

了解Vault EC2身份验证

Learning how to do Authentication took most of the time. Being a college student, I hadn’t had to worry about securing my applications in projects, since we really didn’t put something into production.
学习身份验证的大部分时间都花在了身上。作为一名大学生，我不必担心在项目中确保我的应用程序的安全，因为我们确实没有在生产中投入任何东西。
I learned about Hashicorp Vault (for storing secrets), Terraform (for building instances from code) and Consul (a backend for Vault).
我了解了Hashicorp Vault(用于存储机密)，Terraform(用于从代码构建实例)和Consul(Vault的后端)。
Rather than having a token to authenticate against Vault and get the secrets, a better method would be to use the EC2 metadata in which the Spark job will be running to authenticate against Vault.
更好的方法是使用EC2元数据(其中将运行Spark作业对Vault进行身份验证)，而不是使用令牌对Vault进行身份验证并获取机密。
Making EC2 instances with Terraform and Consul.
使用Terraform和Consul制作EC2实例。

Setting up EC2 instances

设置EC2实例

Setting up an instance is easy, but determining the right VPC, subnet, security groups, etc. for production really is a security design issue, and one has to decide carefully.
设置实例很容易，但是为生产确定正确的VPC，子网，安全组等确实是一个安全设计问题，必须仔细决定。
The majority of AWS services that get hacked, are hacked because of misconfigurations, and they aren’t really Amazon’s fault.
大多数被黑客入侵的AWS服务都是由于配置错误而被黑客入侵，而这并不是亚马逊的错。

谢谢 (Thanks)

Big thanks to Himanshu Verma and Arushi Dhunna for guiding me and answering my endless doubts and queries.

非常感谢Himanshu Verma和Arushi Dhunna指导我并回答了我无尽的疑问和疑问。

I would also want to thank Rohini Gupta, Anu Chopra, and Bhupinder Guleria for the weekly calls and support throughout the internship.

我还要感谢Rohini Gupta，Anu Chopra和Bhupinder Guleria在整个实习期间的每周电话和支持。

Thanks to the Early Careers Team for organizing the amazing workshops from Expedians all around the world: Jackie Kam, Rachel Tan, Chetna Mahobia. Thanks to Jim McCoy for all the help throughout the workshops.

感谢Early Careers团队组织了来自世界各地Expedians的令人惊叹的研讨会：Jackie Kam，Rachel Tan，Chetna Mahobia。感谢Jim McCoy在整个研讨会中提供的所有帮助。

The folks on internal Slack channels #scala, #vault and #just-amazon-things also helped me with my doubts, and are very much appreciated.

内部Slack频道＃scala，＃vault和＃just-amazon-things上的人也帮助了我，我对此表示怀疑，并非常感谢。

Originally published at https://rohanrajpal.github.io on June 17, 2020.

最初于 2020年6月17日 发布在 https://rohanrajpal.github.io 。

http://lifeatexpediagroup.com/ (http://lifeatexpediagroup.com/)

翻译自: https://medium.com/expedia-group-tech/the-virtual-internship-my-experience-5cc52a997bb1

虚拟化半虚拟化全虚拟化

郝ren

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
虚拟化半虚拟化全虚拟化_我的虚拟实习经历

虚拟化半虚拟化全虚拟化 Expedia Group Technology —软件 (EXPEDIA GROUP TECHNOLOGY — SOFTWARE)This summer, I interned at one of the largest travel companies in the world, Expedia Group™️. Who knew during a pandemi...
复制链接

扫一扫