大数据必须掌握的技能合集

最新推荐文章于 2021-12-04 17:57:24 发布

taya_a

最新推荐文章于 2021-12-04 17:57:24 发布

阅读量1.5k

点赞数

文章标签：大数据程序员 mysql

本文链接：https://blog.csdn.net/taya_a/article/details/84247479

版权

这篇博客集合了大数据领域的关键技能，从关系数据库管理系统如MySQL和PostgreSQL，到分布式编程框架如Hadoop和Spark，再到各种数据模型和数据库系统。此外，还涵盖了分布式文件系统，如HDFS，以及实时流处理和机器学习等技术。对于想要深入了解大数据生态的程序员来说，这是一个全面的指南。

摘要由CSDN通过智能技术生成

关系数据库管理系统(RDBMS)

MySQL4 世界最流行的开源数据库
PostgreSQL 世界最先进的开源数据库.
Oracle Database1 – 对象-关系型数据库管理系统。

框架

Apache Hadoop2 – framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).
Tigon1 – High Throughput Real-time Stream Processing Framework.

分布式编程

AddThis Hydra1 – distributed data processing and storage system originally developed at AddThis.
AMPLab SIMR – run Spark on Hadoop MapReduce v1.
Apache APEX – a unified, enterprise platform for big data stream and batch processing.
Apache Beam – an unified model and set of language-specific SDKs for defining and executing data processing workflows.
Apache Crunch – a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.
Apache DataFu – collection of user-defined functions for Hadoop and Pig developed by LinkedIn.
Apache Flink – high-performance runtime, and automatic program optimization.
Apache Gora – framework for in-memory data model and persistence.
Apache Hama – BSP (Bulk Synchronous Parallel) computing framework.
Apache MapReduce1 – programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
Apache Pig – high level language to express data analysis programs for Hadoop.
Apache REEF – retainable evaluator execution framework to simplify and unify the lower layers of big data systems.
Apache S4 – framework for stream processing, implementation of S4.
Apache Spark1 – framework for in-memory cluster computing.
Apache Spark Streaming1 – framework for stream processing, part of Spark.
Apache Storm – framework for stream processing by Twitter also on YARN.
Apache Samza – stream processing framework, based on Kafka and YARN.
Apache Tez – application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.
Apache Twill – abstraction over YARN that reduces the complexity of developing distributed applications.
Cascalog – data processing and querying library.
Cheetah – High Performance, Custom Data Warehouse on Top of MapReduce.
Concurrent Cascading – framework for data management/analytics on Hadoop.
Damballa Parkour – MapReduce library for Clojure.
Datasalt Pangool – alternative MapReduce paradigm.
DataTorrent StrAM – real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
Facebook Corona – Hadoop enhancement which removes single point of failure.
Facebook Peregrine – Map Reduce framework.
Facebook Scuba – distributed in-memory datastore.
Google Dataflow – create data pipelines to help themæingest, transform and analyze data.
Google MapReduce – map reduce framework.
Google MillWheel – fault tolerant stream processing framework.
JAQL – declarative programming language for working with structured, semi-structured and unstructured data.
Kite – is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
Metamarkets Druid – framework for real-time analysis of large datasets.
Netflix PigPen – map-reduce for Clojure which compiles to Apache Pig.
Nokia Disco – MapReduce framework developed by Nokia.
Onyx – Distributed computation for the cloud.
Pinterest Pinlater – asynchronous job execution system.
Pydoop1 – Python MapReduce and HDFS API for Hadoop.
Rackerlabs Blueflood – multi-tenant distributed metric processing system
Stratosphere – general purpose cluster computing framework.
Streamdrill – useful for counting activities of event streams over different time windows and finding the most active one.
Tuktu – Easy-to-use platform for batch and streaming computation, built using Scala, Akka and Play!
Twitter Heron – Heron is a realtime, distributed, fault-tolerant stream processing engine from Twitter replacing Storm.
Twitter Scalding – Scala library for Map Reduce jobs, built on Cascading.
Twitter Summingbird – Streaming MapReduce with Scalding and Storm, by Twitter.
Twitter TSAR – TimeSeries AggregatoR by Twitter.

分布式文件系统

Apache HDFS2 – a way to store large files across multiple machines.
BeeGFS – formerly FhGFS, parallel distributed file system.
Ceph Filesystem – software storage platform designed.
Disco DDFS – distributed filesystem.
Facebook Haystack1 – object storage system.
Google Colossus – distributed filesystem (GFS2).
Google GFS – distributed filesystem.
Google Megastore1 – scalable, highly available storage.
GridGain – GGFS, Hadoop compliant in-memory file system.
Lustre file system – high-performance distributed filesystem.
Quantcast File System QFS – open-source distributed file system.
Red Hat GlusterFS – scale-out network-attached storage file system.
Seaweed-FS – simple and highly scalable distributed file system.
Alluxio – reliable file sharing at memory speed across cluster frameworks.
Tahoe-LAFS – decentralized cloud storage system.

文档数据模型

Actian Versant – commercial object-oriented database management systems .
Crate Data – is an open source massively scalable data store. It requires zero administration.
Facebook Apollo – Facebook’s Paxos-like NoSQL database.
jumboDB – document oriented datastore over Hadoop.
LinkedIn Espresso – horizontally scalable document-oriented NoSQL data store.
MarkLogic – Schema-agnostic Enterprise NoSQL database technology.
MongoDB – Document-oriented database system.
RavenDB – A transactional, open-source Document Database.
RethinkDB – document database that supports queries like table joins and group by.

Key Map 数据模型

Note: There is some term confusion in the industry, and two different things are called “Columnar Databases”. Some, listed here, are distributed, persistent databases built around the “key-map” data model: all data has a (possibly composite) key, with which a map of key-value pairs is associated. In some systems, multiple such value maps can be associated with a key, and these maps are referred to as “column families” (with value map keys being referred to as “columns”).

Another group of technologies that can also be called “columnar databases” is distinguished by how it stores data, on disk or in memory — rather than storing data the traditional way, where all column values for a given key are stored next to each other, “row by row”, these systems store all columnvalues next to each other. So more work is needed to get all columns for a given key, but less work is needed to get all values for a given column.

The former group is referred to as “key map data model” here. The line between these and the Key-value Data Model stores is fairly blurry.

The latter, being more about the storage format than about the data model, is listed under Columnar Databases.

You can read more about this distinction on Prof. Daniel Abadi’s blog: Distinguishing two major types of Column Stores.

Apache Accumulo – distributed key/value store, built on Hadoop.
Apache Cassandra – column-oriented distributed datastore, inspired by BigTable.
Apache HBase – column-oriented distributed datastore, inspired by BigTable.
Facebook HydraBase – evolution of HBase made by Facebook.
Google BigTable – column-oriented distributed datastore.

最低0.47元/天解锁文章

taya_a

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
大数据必须掌握的技能合集

关系数据库管理系统(RDBMS)MySQL4 世界最流行的开源数据库 PostgreSQL 世界最先进的开源数据库. Oracle Database1 – 对象-关系型数据库管理系统。框架Apache Hadoop2 – framework for distributed processing. Integrates MapReduce (parallel processing), ...
复制链接

扫一扫