distributed system
文章平均质量分 82
macyang
Chance is waiting for prepared people and my Status is read the fucking source code.
展开
-
Apache Kylin在百度地图的实践
摘要:百度地图开放平台业务部数据智能组主要负责百度地图内部相关业务的大数据计算分析,处理日常百亿级规模数据,为不同业务提供单条SQL毫秒级响应的OLAP多维分析查询服务。1. 前言百度地图开放平台业务部数据智能组主要负责百度地图内部相关业务的大数据计算分析,处理日常百亿级规模数据,为不同业务提供单条SQL毫秒级响应的OLAP多维分析查询服务。对于Apache Kyl转载 2017-10-31 14:54:00 · 490 阅读 · 0 评论 -
Open sourcing Databus: LinkedIn's low latency change data capture system
We are pleased to announce the open source release of Databus - a real-time change data capture system. Originally developed in 2005, Databus has been in production in its latest revision at Linkedin转载 2013-03-01 22:05:59 · 1379 阅读 · 0 评论 -
Tez: Accelerating processing of data stored in HDFS
MapReduce has served us well. For years it has been THE processing engine for Hadoop and has been the backbone upon which a huge amount of value has been created. While it is here to stay, new parad转载 2013-02-23 23:01:12 · 765 阅读 · 0 评论 -
The Stinger Initiative: Making Apache Hive 100 Times Faster
Introduced by Facebook in 2007, Apache Hive and its HiveQL interface has become the de facto SQL interface for Hadoop. Today, companies of all types and sizes use Hive to access Hadoop data in a fa转载 2013-02-23 22:49:20 · 976 阅读 · 0 评论 -
经典论文翻译导读之《Large-scale Incremental Processing Using Distributed Transactions and Notifications》
【译者导读】Percolator号称其取代MapReduce之后,Google的索引更新速度提升了100倍。它究竟是如何实现 “100” 这个刺眼的数字?当今的并行计算世界真的有如此大的提升空间吗?当我们满心欢喜以为又有新的算法、新的并行计算架构可以学习时,她却又为何跟你聊起了分布式事务?这篇文章将为您揭晓。摘要在搜索引擎系统中,文档被抓取后需要更新web索引,新的文档会持转载 2013-02-21 21:32:20 · 1999 阅读 · 0 评论 -
DataSift Architecture: Realtime Datamining At 120,000 Tweets Per Second
I remember the excitement of when Twitter first opened up their firehose. As an early adopter of the Twitter API I could easily imagine some of the cool things you could do with all that data. I also转载 2013-02-21 11:23:39 · 1942 阅读 · 0 评论 -
Intra-cluster Replication in Apache Kafka
Kafka is a distributed publish-subscribe messaging system. It was originally developed at LinkedIn and became an Apache project in July, 2011. Today, Kafka is used by LinkedIn, Twitter, and Square f转载 2013-02-06 14:19:27 · 875 阅读 · 0 评论 -
Distributed Algorithms in NoSQL Databases
Scalability is one of the main drivers of the NoSQL movement. As such, it encompasses distributed system coordination, failover, resource management and many other capabilities. It sounds like a big u转载 2012-10-09 13:43:42 · 1595 阅读 · 0 评论 -
经典论文翻译导读之《Dremel: Interactive Analysis of WebScale Datasets》
英文原文:googleusercontent,编译:ImportNew - 储晓颖[译者注]从头到尾读懂一篇国外经典技术论文!相信这是很多技术爱好者一直以来想干的事情。本系列译文的目标是满足广大技术爱好者对原始论文一窥究竟的需求,尽量对原文全量翻译。原始论文中不乏较晦涩的学术性语句,也可能会有您不感兴趣的段落,所以译者会添加【译者预读】【译者总结】等环节帮助大家选择性的阅读,或者帮助读者总转载 2013-02-02 22:19:42 · 5364 阅读 · 1 评论 -
An Unorthodox Approach To Database Design : The Coming Of The Shard
Update 4: Why you don’t want to shard. by Morgon on the MySQL Performance Blog.Optimize everything else first, and then if performance still isn’t good enough, it’s time to take a very bitter medici转载 2013-01-18 22:35:41 · 750 阅读 · 0 评论 -
Apache Kafka --- A high-throughput distributed messaging system.
Why we built thisKafka is a messaging system that was originally developed at LinkedIn to serve as the foundation for LinkedIn's activity stream and operational data processing pipeline. It is now转载 2013-01-27 20:26:30 · 6602 阅读 · 0 评论 -
Scalable Web Architecture and Distributed Systems
Open source software has become a fundamental building block for some of the biggest websites. And as those websites have grown, best practices and guiding principles around their architectures have e转载 2013-01-06 21:46:42 · 640 阅读 · 0 评论 -
Scaling Up And Out
Most attention today is focused on adding nodes or cloud instances to scale out systems. Guest editor Nikita Shamgunov emphasizes the importance of scaling systems vertically as well.Recentl转载 2013-01-06 21:10:21 · 660 阅读 · 0 评论 -
Impala/Hive现状分析与前景展望
Impala和Hive野史提到Impala就不得不提Google的Dremel,处理PB级数据规模的基于SQL的交互式、实时数据分析系统。Dremel是Google推出的PaaS数据分析服务BigQuery的后台。Google已经有了MapReduce,为什么还要开发Dremel呢?Dremel/Impala类系统和MapReduce有什么区别呢?Hadoop现在已经成为BigDa转载 2012-12-29 20:10:00 · 8205 阅读 · 0 评论 -
缓存算法
引言 我们都听过 cache,当你问他们是什么是缓存的时候,他们会给你一个完美的答案,可是他们不知道缓存是怎么构建的,或者没有告诉你应该采用什么标准去选择缓存框架。在这边文章,我们会去讨论缓存,缓存算法,缓存框架以及哪个缓存框架会更好。面试 “缓存就是存贮数据(使用频繁的数据)的临时地方,因为取原始数据的代价太大了,所以我可以取得快一些。” 这就是 progr转载 2012-12-04 13:54:46 · 573 阅读 · 0 评论 -
Google Spanner原理- 全球级的分布式数据库
Google Spanner简介Spanner 是Google的全球级的分布式数据库 (Globally-Distributed Database) 。Spanner的扩展性达到了令人咋舌的全球级,可以扩展到数百万的机器,数已百计的数据中心,上万亿的行。更给力的是,除了夸张的扩展性之外,他还能同时通过同步复制和多版本来满足外部一致性,可用性也是很好的。冲破CAP的枷锁,在三者之间完美平衡。转载 2012-09-19 21:30:37 · 1606 阅读 · 0 评论 -
Google Dremel 原理 - 如何能3秒分析1PB
简介Dremel 是Google 的“交互式”数据分析系统。可以组建成规模上千的集群,处理PB级别的数据。MapReduce处理一个数据,需要分钟级的时间。作为MapReduce的发起人,Google开发了Dremel将处理时间缩短到秒级,作为MapReduce的有力补充。Dremel作为Google BigQuery的report引擎,获得了很大的成功。最近Apache计划推出Dreme转载 2012-08-24 10:07:36 · 1356 阅读 · 0 评论 -
MemSQL Architecture - The Fast (MVCC, InMem, LockFree, CodeGen) And Familiar (SQL)
This is an interview with MemSQL cofounder’s Eric Frenkiel and Nikita Shamgunov, in which they try to answer critics by going into more depth about their technology.MemSQL ruffled a few feathers转载 2012-08-20 13:40:50 · 2606 阅读 · 1 评论 -
The CAP FAQ
Version 1.0, May 9th 2013By: Henry Robinson / henry.robinson@gmail.com / @henryrhttp://the-paper-trail.org/0. What is this document?No subject appears to be more controversial to distribut转载 2013-05-13 20:23:41 · 832 阅读 · 0 评论 -
TAO: The power of the graph
Facebook puts an extremely demanding workload on its data backend. Every time any one of over a billion active users visits Facebook through a desktop browser or on a mobile device, they are presented转载 2013-06-27 22:04:11 · 1072 阅读 · 0 评论 -
Google的大规模集群管理系统Borg
编者按:本文是对Google在分布式底层架构的经典文章的翻译,原文可以查看这里,由于原文较长,建议先收藏本文,再下载英文原文,对照译文仔细阅读,可事半功倍。摘要:Google的Borg系统是一个运行着成千上万项作业的集群管理器,它同时管理着很多个应用集群,每个集群都有成千上万台机器,这些集群之上运行着Google的很多不同的应用。Borg通过准入控制,高效的任务打包,超额的资源分配和进转载 2017-09-15 09:49:15 · 1373 阅读 · 0 评论 -
Presto:Facebook的分布式SQL查询引擎
背景Facebook是一家数据驱动的公司。 数据处理和分析是Facebook为10亿多活跃用户开发和交付产品的核心所在。 我门拥有世界上最大的数据仓库之一,存储了大约 300PB 以上的数据。 这些数据被一系列不同种类的程序所使用, 包括传统的数据批处理程序、基于图论的数据分析[1]、机器学习、和实时性的数据分析。分析人员、数据科学家和工程师需要处理数据、分析数据、不断地改善我们的转载 2016-09-26 09:23:27 · 1233 阅读 · 0 评论 -
Basic Graph Traversals
This section will present basic graph traversals by way of examples on the simple property graph diagrammed below.gremlin> g = TinkerGraphFactory.createTinkerGraph() ==>tinkergraph[verti转载 2014-09-03 22:26:50 · 859 阅读 · 0 评论 -
Property Graph Model
Blueprints provides a set of interfaces for the property graph data model. An example instance is diagrammed above. In order to make a data management system “Blueprints-enabled,” the Blueprints inter转载 2014-09-03 22:09:40 · 1675 阅读 · 0 评论 -
Faunus Getting Started
Faunus requires that the user have access to a Hadoop cluster. If a Hadoop cluster is readily available to the user, then Faunus is easy to get up and running. If not, then the provided Whirr recipe转载 2014-09-12 11:32:15 · 869 阅读 · 0 评论 -
Type Definition Overview
In Titan, edge labels and property keys are types which can be individually configured to provide data verification, better storage efficiency, and higher performance. Types are uniquely identified by转载 2014-09-11 17:50:51 · 723 阅读 · 0 评论 -
A Solution to the Supernode Problem
In graph theory and network science, a “supernode” is a vertex with a disproportionately high number of incident edges. While supernodes are rare in natural graphs (as statistically demonstrated w转载 2014-09-10 21:27:19 · 1069 阅读 · 0 评论 -
gremlin docs
Gremlin is a graph traversal language. The documentation herein will provide all the information necessary to understand how to use Gremlin for graph query, analysis, and manipulation. Gremlin works o转载 2014-09-10 11:16:11 · 894 阅读 · 0 评论 -
Powers of Ten – Part II
“‘Curiouser and curiouser!’ cried Alice (she was so much surprised, that for the moment she quite forgot how to speak good English); ‘now I’m opening out like the largest telescope that ever was!”转载 2014-09-10 11:02:01 · 905 阅读 · 0 评论 -
Powers of Ten – Part I
“No, no! The adventures first,’ said the Gryphon in an impatient tone: ‘explanations take such a dreadful time.” — Lewis Carroll – Alice’s Adventures in WonderlandIt is often quite simple转载 2014-09-10 11:03:33 · 846 阅读 · 0 评论 -
GraphLab:新的面向机器学习的并行框架
GraphLab是一种新的面向机器学习的并行框架。1.1 GraphLab简介在海量数据盛行的今天,大规模并行计算已经随处可见,尤其是MapReduce框架的出现,促进了并行计算在互联网海量数据处理中的广泛应用。而针对海量数据的机器学习对并行计算的性能、开发复杂度等提出了新的挑战。机器学习的算法具有下面两个特点:数据依赖性强,运算过程各个机器之间要进行频繁的数据交换转载 2014-09-18 22:09:58 · 1365 阅读 · 0 评论 -
GraphSON Reader and Writer Library
com.tinkerpop.blueprints blueprints-core ??GraphSON is a JSON-based format for individual graph elements (i.e. vertices and edges). How these elements are organized and utilized when w转载 2014-09-05 14:14:25 · 934 阅读 · 0 评论 -
Titan - Using HBase
HBase is the Hadoop database. Think of it as a distributed, scalable, big data store. Use HBase when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of转载 2014-09-04 18:57:45 · 2451 阅读 · 0 评论 -
Managing Multiple Resources in Hadoop 2 with YARN
An overview of some of Cloudera’s contributions to YARN that help support management of multiple resources, from multi resource scheduling in the Fair Schedule to node-level enforcementAs Apache H转载 2013-12-25 23:17:00 · 957 阅读 · 0 评论 -
The Log: What every software engineer should know about real-time data's unifying abstraction
I joined LinkedIn about six years ago at a particularly interesting time. We were just beginning to run up against the limits of our monolithic, centralized database and needed to start the transition转载 2013-12-19 13:08:32 · 2372 阅读 · 0 评论 -
Putting Spark to Use: Fast In-Memory Computing for Your Big Data Applications
Our thanks to Databricks, the company behind Apache Spark (incubating), for providing the guest post below. Cloudera and Databricks recently announced that Cloudera will distribute and support Spa转载 2013-12-15 17:43:58 · 1133 阅读 · 0 评论 -
In-Stream Big Data Processing
The shortcomings and drawbacks of batch-oriented data processing were widely recognized by the Big Data community quite a long time ago. It became clear that real-time query processing and in-stream p转载 2013-08-22 15:11:26 · 2872 阅读 · 0 评论 -
淘宝的数据库拆分(TDDL)
淘宝的数据拆分历程系 统刚开始的时候,因为系统刚上线,用户不多,那个时候,所有的数据都放在了同一个数据库中,这个时候因为用户少压力小,一个数据库完全可以应付的了,但是 随着运营那些哥们辛苦的呐喊和拼命的推广以后,突然有一天发现,oh,god,用户数量突然变多了起来,随之而 来的就是数据库这哥们受不了,它终于在某一天大家都和惬意的时候挂掉啦。此时,咱们搞技术的哥们,就去看看究竟是啥原因,我们转载 2012-08-29 22:49:57 · 2617 阅读 · 0 评论 -
淘宝数据魔方技术架构解析
淘宝网拥有国内最具商业价值的海量数据。截至当前,每天有超过30亿的店铺、商品浏览记录,10亿在线商品数,上千万的成交、收藏和评价数据。如何从这些数据中挖掘出真正的商业价值,进而帮助淘宝、商家进行企业的数据化运营,帮助消费者进行理性的购物决策,是淘宝数据平台与产品部的使命。为此,我们进行了一系列数据产品的研发,比如为大家所熟知的量子统计、数据魔方和淘宝指数等。尽管从业务层面来讲,数据产品的研转载 2011-08-04 10:07:00 · 1536 阅读 · 0 评论 -
Hbase Map Reduce Example - Frequency Counter
文章来源:http://sujee.net/tech/articles/hbase-map-reduce-freq-counter/通过这篇入门级的例子,可以学习使用HBase API写MapReduce程序实现自己的测试用例!This is a tutorial on how to run a map reduce job on Hbase. Thiscovers version 0.20 and later.Recommended Readings:- Hbase home, -原创 2011-01-19 12:50:00 · 2424 阅读 · 0 评论