aws emr使用_使用aws emr和step函数来处理非常宽的矩阵

最新推荐文章于 2024-11-26 10:30:16 发布

weixin_26713457

最新推荐文章于 2024-11-26 10:30:16 发布

阅读量584

点赞数

文章标签： python matlab 机器学习算法

原文链接：https://medium.com/allen-institute-for-brain-science-engineering/using-aws-emr-and-step-functions-to-process-extremely-wide-matrices-40c8d2e932c

版权

aws emr使用

我们的任务(Our Mission)

Our brains are what make us human. They give rise to our thoughts, actions, movements, and desires, store our memories, and enable us to navigate our world every day. Yet despite decades of research — and impressive knowledge gathered about other aspects of the human body, including our entire genetic sequence — the brain remains largely unknown. The Allen Institute for Brain Science was established to answer some of the most pressing questions in neuroscience, grounded in an understanding of the brain and inspired by our quest to uncover the essence of what makes us human.

我们的大脑使我们成为人类。它们引起了我们的思想，行动，动作和欲望，存储了我们的记忆，并使我们每天都能畅游世界。然而，尽管进行了数十年的研究，并且关于人体其他方面(包括我们的整个遗传序列)的知识令人印象深刻，但是大脑仍然很大程度上未知。艾伦脑科学研究所的成立是为了回答神经科学中一些最紧迫的问题，其基础是对大脑的理解，并受到我们寻求揭示使我们成为人类的本质的追求的启发。

Our mission is to accelerate the understanding of how the human brain works in health and disease. Using a big science approach, we generate useful public resources, drive technological and analytical advances, and discover fundamental brain properties through the integration of experiments, modeling, and theory.

我们的任务是加深对人脑在健康和疾病中的工作方式的了解。我们使用大型科学方法，通过整合实验，建模和理论，产生了有用的公共资源，推动了技术和分析的发展，并发现了基本的大脑特性。

处理极宽的数据集 (Processing Extremely Wide Datasets)

As a part of “big science”, one of our core principles, we seek to tackle scientific challenges at scales no one else has attempted before. One of these challenges is the processing of large-scale transcriptomic datasets. Transcriptomics is the study of RNA. In particular, we are interested in the genes that are expressed in individual neurons. The human brain contains almost 100 billion neurons — how do they differ from each other, and what genes do they express? After a series of complex analysis using cutting-edge techniques such as Smart-Seq and 10x Genomics Chromium Sequencing, we produce extremely large matrices of numeric values.

作为“大科学”(我们的核心原则之一)的一部分，我们力求以前所未有的规模应对科学挑战。这些挑战之一是大规模转录组数据集的处理。 转录组学是RNA的研究。特别地，我们对单个神经元中表达的基因感兴趣。人脑包含近1000亿个神经元-它们彼此之间有什么区别，它们表达什么基因？在使用诸如Smart-Seq和10x Genomics Chromium Sequencing之类的尖端技术进行了一系列复杂的分析之后，我们产生了非常大的数值矩阵。

Such matrices are called feature matrices. Each column represents the feature of a cell, which in this case are genes. The mouse genome is over 50,000 genes, so a single matrix can have over 50,000 columns! We expect the number of rows in our matrices to increase over time, reaching 10s of millions, if not more. These matrices can reach 500GB or more in size. Over the next few years, we want to be able to ingest tens or hundreds of such matrices.

这样的矩阵称为特征矩阵。 每列代表一个细胞的特征，在这种情况下是基因。小鼠基因组有超过50,000个基因，因此单个矩阵可以有50,000多个列！我们期望矩阵中的行数会随着时间增加，达到数千万(如果不是更多)。这些矩阵的大小可以达到500GB或更大。在未来几年中，我们希望能够摄取数十或数百个这样的矩阵。

Our goal is to provide low-latency visualizations on such matrices, allowing researchers to aggregate, slice, and dissect our data in real-time. In order to do this, we run a series of precomputations that store expensive calculations in a database for future retrieval.

我们的目标是在此类矩阵上提供低延迟的可视化效果，使研究人员可以实时汇总，切片和剖析我们的数据。为此，我们运行了一系列预计算，这些预计算将昂贵的计算结果存储在数据库中，以备将来检索。

We wanted to create a flexible, scalable pipeline to run computations on these matrices and store the results for visualizations.

我们想要创建一个灵活，可扩展的管道，以在这些矩阵上运行计算并存储结果以进行可视化。

管道 (The Pipeline)

We wanted to build a pipeline that would take these large matrices as inputs, run various Spark jobs, and store the outputs in an Apache HBase cluster. We wanted to create something flexible so that we could easily add additional Spark Transformations.

我们想要建立一个将这些大型矩阵作为输入，运行各种Spark作业并将输出存储在Apache HBase集群中的管道。我们想创建一些灵活的东西，以便我们可以轻松添加其他Spark转换。

We decided on AWS Step Functions as our workflow-orchestration tool of choice. Much like the open-source Apache Airflow, AWS Step Functions allows us to create a state machine that orchestrates the dataflow from payload submission to database loading.

我们决定选择AWS Step Functions作为我们的工作流程编排工具。类似于开源Apache Airflow一样，AWS Step Functions允许我们创建一个状态机，以协调从有效负载提交到数据库加载的数据流。

After close collaboration with the engineers at the AWS Data Lab, we came up with the following pipeline architecture:

在与AWS Data Lab的工程师密切合作之后，我们提出了以下管道架构：

At a high level, our step function followed the following workflow:

总体而言，我们的步进功能遵循以下工作流程：

Trigger a Step Function from an upload event to an S3 Bucket.
触发从上传事件到S3存储桶的步进功能。
Copy the input ZIP file containing a feature matrix into an S3 working directory.
将包含要素矩阵的输入ZIP文件复制到S3工作目录中。
Store all intermittent results in a Working directory on S3.
将所有间歇结果存储在S3上的工作目录中。
Run Spark Jobs on AWS EMR to transform input feature matrices into various pre-computed datasets.
在AWS EMR上运行Spark Jobs，以将输入要素矩阵转换为各种预先计算的数据集。
Output the results of the Spark Jobs as HFiles.
将Spark Jobs的结果输出为HFiles 。
Bulk Load the results of our Spark Jobs into Apache HBase.
将Spark作业的结果批量加载到Apache HBase中。

The above architecture diagram is deceptively simple. We found a number of challenges during our initial implementation:

上面的架构图看似简单。在最初的实施过程中，我们发现了许多挑战：

Challenge 1: Lack of Rollbacks/Transactions in Apache HBase

挑战1：Apache HBase中缺乏回滚/事务处理

The results of our Spark Jobs are a number of precomputed views of our original input dataset. Each of these views is stored as a separate table in Apache HBase. A major drawback of Apache HBase is the lack of a native transactional system. HBase provides row-level atomicity, but nothing more. Our worst-case scenario is writing partial data — cases where some views are updated, but not others will show different results for different visualizations — resulting in scientifically incorrect data!

Spark Jobs的结果是原始输入数据集的许多预先计算的视图。这些视图中的每个视图均作为单独的表存储在Apache HBase中。 Apache HBase的主要缺点是缺少本机事务系统。 HBase提供了行级原子性，仅此而已。我们最坏的情况是写入部分数据(某些视图已更新，而其他视图则不会针对不同的可视化显示不同的结果)，从而导致科学上不正确的数据！

We worked around this by rolling our own blue/green system on top of Apache HBase. We suffix each set of tables related to a dataset with a UUID. We then use DynamoDB to track the UUID associated with each individual dataset. When an update to a dataset is being written, the UUID is not switched in DynamoDB until we verify that all the new tables have been successfully written to Apache HBase. We have an API on top of HBase to facilitate reads. This API checks DynamoDB for the dataset UUID before querying HBase, so user traffic is never redirected toward a new view until we have confirmed a successful write. Our API involves an AWS Lambda function using HappyBase to connect to our HBase cluster, wrapped in an AWS API Gateway layer to provide a REST interface.

为此，我们在Apache HBase之上滚动了自己的蓝/绿系统。我们为每个与数据集相关的表集加后缀UUID 。然后，我们使用DynamoDB跟踪与每个单独的数据集关联的UUID。在写入数据集的更新时，在我们确认所有新表已成功写入Apache HBase之前，不会在DynamoDB中切换UUID。我们在HBase之上有一个API，以方便读取。该API在查询HBase之前会在DynamoDB中检查数据集UUID，因此在我们确认写入成功之前，用户流量永远不会重定向到新视图。我们的API涉及使用HappyBase连接到我们的HBase集群的AWS Lambda函数，该函数包装在AWS API网关层中以提供REST接口。

Image for post — A/B Transactional writes in Apache HBase using DynamoDB to maintain state.

Challenge 2: Stalled Spark Jobs on extremely wide datasets

挑战2：在非常广泛的数据集上停滞了Spark作业

While Apache Spark is a fantastic engine for running distributed compute operations, it doesn’t do too well when scaling to extremely wide datasets. We routinely operate on data that surpasses 50,000 columns, which often causes issues such as a stalled JavaToPython step in our PySpark job. While we have more investigating to do in order to figure out why our Spark Jobs hang on these wide datasets, we found a simple workaround in the short term — batching!

虽然Apache Spark是运行分布式计算操作的出色引擎，但在扩展到极宽的数据集时效果并不理想。我们通常对超过50,000列的数据进行操作，这通常会导致问题，例如PySpark作业中的JavaToPython步骤停止。为了弄清楚为什么我们的Spark Jobs会挂在这些广泛的数据集上，我们需要做更多的调查，但在短期内我们发现了一个简单的解决方法-批处理！

A number of our jobs involve computing simple columnar aggregations on our data. This means that each calculation on a column is completely independent of all the other columns. This lends itself quite well to batching our compute. We can break our input columns into chunks, and run our compute on each chunk.

我们的许多工作涉及在数据上计算简单的列聚合。这意味着在一列上的每个计算完全独立于所有其他列。这很适合批处理我们的计算。我们可以将输入列分成多个块，然后在每个块上运行计算。

def get_aggregation_for_matrix_and_metadata(matrix, metadata, group_by_arg, agg_func, cols_per_write): ‘’’ Performs an aggregation on the joined matrix, aggregating the desired column by the given function. agg_func must be a valid Pandas UDF function. Runs in batches so we don’t overload the Task Scheduler with 50,000 columns at once.

def get_aggregation_for_matrix_and_metadata(矩阵，元数据，group_by_arg，agg_func，cols_per_write)：'''对连接的矩阵执行聚合，通过给定的函数聚合所需的列。 agg_func必须是有效的Pandas UDF函数。批量运行，因此我们不会一次使Task Scheduler过载50,000列。

# Chunk the data
 for col_group in pyspark_utilities.chunks(matrix.columns, cols_per_write): # Add the row key to the column group
 col_group.append(matrix.columns[0]) selected_matrix = matrix.select(pyspark_utilities.escape_column_list(col_group)) # create argument list for group by and then process
 cast_as_udf = pyspark_functions.pandas_udf(
 agg_func,
 pyspark_datatype.FloatType(),
 pyspark_functions.PandasUDFType.GROUPED_AGG) udf_input = [cast_as_udf(selected_matrix [column_name]).alias(column_name)
 for column_name in selected_matrix .columns
 if column_name != group_by_arg] yield joined.groupby(group_by_arg).agg(*udf_input)

Spark evaluation is lazy, so we don’t want to join our resulting DataFrames together after each batch. Rather, we can just write the results of each batch to an HFile, which is then later bulk loaded into HBase. (For my understanding, I thought the resulting DFs were finally joined together to create 1 HFile for each computed view?)

Spark评估是懒惰的，因此我们不想在每次批处理之后将我们得到的DataFrames连接在一起。相反，我们可以将每个批处理的结果写入HFile，然后将其批量加载到HBase中。 (据我所知，我以为最终的DF最终被合并在一起，为每个计算视图创建1个HFile？)

Because the post-aggregation DataFrame was very small, we found a significant performance increase in coalescing the DataFrame post-aggregation, and then eager checkpointing the results before writing the HFiles. This forces Spark to compute the aggregation before writing the HFiles. HFiles need to be sorted by row key, so it’s easier to pass a smaller DataFrame to our HFile converter.

因为聚合后的DataFrame非常小，所以我们发现在合并DataFrame聚合后，然后在编写HFile之前急于对结果进行检查时，性能会显着提高。这将强制Spark在写入HFile之前计算聚合。 HFiles需要按行键排序，因此将较小的DataFrame传递给我们的HFile转换器更加容易。

Challenge 3: Using Apache Spark to write DataFrames as HFiles

挑战3：使用Apache Spark将DataFrames编写为HFiles

Apache Spark supports writing DataFrames in multiple formats, including as HFiles. However, the documentation for doing so leaves a lot to be desired. In order to write out our Spark DataFrames as HFiles, we had to take the following steps:

Apache Spark支持以多种格式编写DataFrame，包括HFiles。但是，这样做的文档还有很多不足之处。为了将我们的Spark DataFrames写为HFiles，我们必须执行以下步骤：

Convert a DataFrame into a HFile-compatible format, assuming that the first column will be the HBase rowkey — (row_key, (row_key, column_family, col, value).
假设第一列将是HBase行键-(row_key，(row_key，column_family，col，value)，则将DataFrame转换为HFile兼容格式。
Create a JAR file containing a converter to convert input Python Objects into Java KeyValue byte classes. This step took a lot of trial and error — we couldn’t find clear documentation on how the Python object was serialized and passed into the Java function.
创建一个包含转换器的JAR文件，以将输入的Python对象转换为Java KeyValue字节类。此步骤经过大量的反复试验-我们找不到关于Python对象如何序列化并传递到Java函数的清晰文档。

Call the saveAsNewAPIHadoopFile function, passing in the relevant information — the ZooKeeper Quorum IP, Port, and cluster DNS of our Apache HBase EMR, cluster, the HBase table name, the class name of our Java converter function, and more.

调用saveAsNewAPIHadoopFile函数，传入相关信息-Apache HBase EMR的ZooKeeper仲裁IP，端口和集群DNS，集群，HBase表名，Java转换器函数的类名，等等。

You can see the details of our implementation here:import src.spark_transforms.pyspark_jobs.pyspark_utilities as pyspark_utilitiesimport src.spark_transforms.pyspark_jobs.output_handler.emr_constants as constants

您可以在此处查看实现的详细信息：将src.spark_transforms.pyspark_jobs.pyspark_utilities导入为pyspark_utilities导入src.spark_transforms.pyspark_jobs.output_handler.emr_constants作为常量

def csv_to_key_value(row, sorted_cols, column_family):
 ‘’’
 This method is an RDD mapping function that will map each
 row in an RDD to an hfile-formatted tuple for hfile creation
 (rowkey, (rowkey, columnFamily, columnQualifier, value))
 ‘’’
 result = []
 for index, col in enumerate(sorted_cols[constants.ROW_KEY_INDEX + 1:], 1):
 row_key = str(row[constants.ROW_KEY_INDEX])
 value = row[index] if value is None:
 raise ValueError(f’Null value found at {row_key}, {col}’) # We store sparse representations, dropping all zeroes.
 if value != 0:
 result.append((row_key, (row_key, column_family, col, value))) return tuple(result)
def get_sorted_df_by_cols(df):
 ‘’’
 Sorts the matrix by column. Retains the row key as the initial column.
 ‘’’
 cols = [df.columns[0]] + sorted(df.columns[1:])
 escaped_cols = pyspark_utilities.escape_column_list(cols)
 return df.select(escaped_cols)
def flat_map_to_hfile_format(df, column_family):
 ‘’’
 Flat maps the matrix DataFrame into an RDD formatted for conversion into HFiles.
 ‘’’
 sorted_df = get_sorted_df_by_cols(df)
 columns = sorted_df.columns
 return sorted_df.rdd.flatMap(lambda row: csv_to_key_value(row, columns, column_family)).sortByKey(True)
def write_hfiles(df, output_path, zookeeper_quorum_ip, table_name, column_family):
 ‘’’
 This method will sort and map the medians psyspark dataFrame and
 then write to hfiles in the output directory using the supplied
 hbase configuration.
 ‘’’
 # sort columns other than the row key (first column) rdd = flat_map_to_hfile_format(df, column_family) conf = {
 constants.HBASE_ZOOKEEPER_QUORUM: zookeeper_quorum_ip,
 constants.HBASE_ZOOKEEPER_CLIENTPORT: constants.ZOOKEEPER_CLIENTPORT,
 constants.ZOOKEEPER_ZNODE_PARENT: constants.ZOOKEEPER_PARENT,
 constants.HBASE_TABLE_NAME: table_name
 } rdd.saveAsNewAPIHadoopFile(output_path,
 constants.OUTPUT_FORMAT_CLASS,
 keyClass=constants.KEY_CLASS,
 valueClass=constants.VALUE_CLASS,
 keyConverter=constants.KEY_CONVERTER,
 valueConverter=constants.VALUE_CONVERTER,
 conf=conf)

展望未来 (Looking Ahead)

Our computation pipeline was a success, and you can see the resulting visualizations on https://transcriptomics.brain-map.org/. Since writing this blog post, we’ve done much more, including a cross-database transaction system, wide-matrix transposes in Spark, and more. Big Data problems in neuroscience never end, and we’re excited to share more with you in the future!

我们的计算管道非常成功，您可以在https://transcriptomics.brain-map.org/上看到最终的可视化效果。自撰写此博客文章以来，我们做了很多工作，包括跨数据库事务处理系统，Spark中的宽矩阵转置等等。神经科学中的大数据问题永无止境，我们很高兴在将来与您分享更多信息！