Schema-on-Read VS Schema-on-Read

1.Question:

What is the difference and meaning of these two statements that I encountered during a lecture here:

1. Traditional databases enforce schema during load time.

and

2. Hive enforces schema during read time.

2.ANSWER:

You touch on one of the reasons why Hadoop and other NoSQL strategies have been so successful, so I'm not sure if you were expecting to get a dissertation or not, but here it is! The extra flexibility and agility in data analysis has probably contributed to the explosion of "data science", just because it makes large-scale data analysis easier in general.

A traditional relational database stores the data with schema in mind. It knows that the second column is an integer, it knows that it has 40 columns, etc. Therefore, you need to specify your schema ahead of time and have it well planned out. This is "schema on write" -- that is, the schema is applied when the data is being written to the data store.

Hive (in some cases), Hadoop, and many other NoSQL systems in general are about "schema on read" -- the schema is applied as the data is being read off of the data store. Consider the following line of raw text:

A:B:C~E:F~G:H~~I::J~K~L

There are a couple ways to interpret this. ~ could be the delimiter or maybe : could be the delimiter. Who knows? With schema on read, it doesn't matter. You decide what the schema is when you analyze the data, not when you write the data. This example is a bit ridiculous in that you probably won't ever encounter this case, but it gets the point across hopefully.

With schema on read, you just load your data into the data store and think about how to parse and interpret later. At the core of this explanation, schema on read means write your data first, figure out what it is later. Schema on write means figure out what your data is first, then write it after.


There is a tradeoff here. Some of these are subjective and my own opinion.

Benefits of schema on write:

  • Better type safety and data cleansing done for the data at rest

  • Typically more efficient (storage size and computationally) since the data is already parsed

Downsides of schema on write:

  • You have to plan ahead of time what your schema is before you store the data (i.e., you have to do ETL)

  • Typically you throw away the original data, which could be bad if you have a bug in your ingest process

  • It's harder to have different views of the same data

Benefits of schema on read:

  • Flexibility in defining how your data is interpreted at load time

    • This gives you the ability to evolve your "schema" as time goes on

    • This allows you to have different versions of your "schema"

    • This allows the original source data format to change without having to consolidate to one data format

  • You get to keep your original data

  • You can load your data before you know what to do with it (so you don't drop it on the ground)

  • Gives you flexibility in being able to store unstructured, unclean, and/or unorganized data

Downsides of schema on read:

  • Generally it is less efficient because you have to reparse and reinterpret the data every time (this can be expensive with formats like XML)

  • The data is not self-documenting (i.e., you can't look at a schema to figure out what the data is)

  • More error prone and your analytics have to account for dirty data


153002_QSUJ_1416978.jpg

153002_7Q4J_1416978.jpg

153004_lI1a_1416978.jpg

153004_fjXg_1416978.png


转载于:https://my.oschina.net/u/1416978/blog/667056

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: Amazon OpenSearch支持Schema-on-Read功能,这意味着用户可以在上传数据时不需要指定数据的结构和类型,而是在读取数据时通过搜索引擎来确定数据的结构和类型。这使得用户能够在不更改数据结构的前提下动态地查询和组织数据。 ### 回答2: AWS Opensearch没有"schema on read"功能。AWS Opensearch是一个分布式搜索和分析引擎,基于Apache Lucene和Elasticsearch构建,它提供了一个用于索引、搜索和分析日志、指标和其他类型数据的完全托管的服务。在使用Opensearch时,需要在创建索引时定义和配置索引的结构,包括字段类型、分析器等。 相比之下,"schema on read"是针对数据湖架构的一种概念,其中数据存储为原始的、未结构化的形式,并且模式信息是在读取数据时进行解释的。这种模式可以在数据加载和查询时进行灵活的演变和适应。然而,AWS Opensearch需要事先定义和配置索引的结构,并且在后续的数据加载和查询过程中不支持在读取时动态解释模式。 总之,AWS Opensearch不提供"schema on read"功能,而是侧重于在创建索引时定义和配置索引的结构。 ### 回答3: AWS OpenSearch确实具有Schema on Read功能。 Schema on Read是指在数据读取时对数据进行解释、解析和转换,而不是在数据写入期间进行。AWS OpenSearch使用的是文档存储引擎,它不会强制要求在写入数据之前定义模式。这意味着您可以灵活地使用不同的数据格式和结构,而无需在写入数据时进行模式定义。 当您查询存储在AWS OpenSearch中的数据时,您可以根据需要对数据进行解析和转换。这样,您可以根据查询的要求灵活地定义数据模式,以便提供准确和一致的结果。这种灵活性特别适用于需要针对不同查询和分析场景使用不同数据模式的情况。 总之,AWS OpenSearch具有Schema on Read功能,它允许您在数据查询阶段定义数据模式,使您能够灵活地处理各种数据结构和格式。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值