新一代计算架构下的海量数据处理技术

 

 

课程简介:

大规模海量数据处理技术在日益增长的互联网中发挥了越来越重要的作用。本课题集中两个部分:大规模数据处理系统方法与基于大规模数据分析的机器学习算法研究。Google的成功已经引起了产业界对于大规模数据处理的关注,也在学术界产生了新的关于大规模数据处理上的研究问题。本课程基于IBM与Google的开源项目Hadoop、推出云计算的概念下,介绍并深入探讨新一代计算架构下的数据处理算法与学习问题。

 

http://apex.sjtu.edu.cn/apex_wiki/LDA 

 

课程核心内容:

  • 海量数据平台的构建
  • 传统机器学习算法在海量数据处理平台上的分布、并行化
  • 基于新型计算平台的机器学习算法研究
  • 多种计算设备下的新型计算算法研究
  • 基于MapReduce的新型应用研究

教材:

  • Hadoop: The Definitive Guide, O'Reilly.

相关的研究项目:

  • 大规模类别下的朴素Beyesian分类算法
  • 中文统计语言模型下的快速计算
  • 大规模日志挖掘算法研究
  • 超大规模SVM算法研究

参考论文

  • Introduction to Distributed Systems
  • MapReduce: Simplified Data Processing on Large Scale Clusters

  • The Google File System
  • BigTable: A Distributed Storage System for Structured Data

  • Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters
  • Map-Reduce for Machine Learning on Multicore
  • Google's MapReduce Programming Model -- Revisited

  • Google文件系统(Google File System)论文中文版 http://www.codechina.org/doc/google/gfs-paper/architecture.html

参考课程

MyWiki: LDA (last edited 2010-03-08 05:50:41 by grxue)

 

 

 

http://cs.smith.edu/dftwiki/index.php/CSC352_Resources

 

http://www.google.com/intl/en/press/pressrel/20071008_ibm_univ.html

 

 

Google and IBM Announce University Initiative to Address Internet-Scale Computing Challenges

Mountain View, Calif., and Armonk, N.Y. – October 8, 2007 – Google and IBM today announced an initiative to promote new software development methods which will help students and researchers address the challenges of internet-scale applications in the future.

The goal of this initiative is to improve computer science students’ knowledge of highly parallel computing practices to better address the emerging paradigm of large-scale distributed computing. IBM and Google are teaming up to provide hardware, software and services to augment university curricula and expand research horizons. With their combined resources, the companies hope to lower the financial and logistical barriers for the academic community to explore this emerging model of computing.

The University of Washington was the first to join the initiative. A small number of universities will also pilot the program, including Carnegie Mellon University, Massachusetts Institute of Technology, Stanford University, the University of California at Berkeley and the University of Maryland. In the future, the program will be expanded to include additional researchers, educators and scientists.

"Google is excited to partner with IBM to provide resources which will better equip students and researchers to address today’s developing computational challenges," said Eric Schmidt, CEO of Google. "In order to most effectively serve the long-term interests of our users, it is imperative that students are adequately equipped to harness the potential of modern computing systems and for researchers to be able to innovate ways to address emerging problems."

Fundamental changes in computer architecture and increases in network capacity are encouraging software developers to take new approaches to computer-science problem solving. For web software such as search, social networking and mobile commerce to run quickly, computational tasks often need to be broken into hundreds or thousands of smaller pieces to run across many servers simultaneously. Parallel programming techniques are also used for complex scientific analysis such as gene sequencing and climate modeling.

"This project combines IBM’s historic strengths in scientific, business and secure-transaction computing with Google’s complementary expertise in Web computing and massively scaled clusters," said Samuel J. Palmisano, chairman, president and chief executive officer, IBM. "We’re aiming to train tomorrow’s programmers to write software that can support a tidal wave of global Web growth and trillions of secure transactions every day."

For this project, the two companies have dedicated a large cluster of several hundred computers (a combination of Google machines and IBM BladeCenter and System x servers) that is planned to grow to more than 1,600 processors. Students will access the cluster via the Internet to test their parallel programming course projects. The servers will run open source software including the Linux operating system, XEN systems virtualization and Apache’s Hadoop project, an open source implementation of Google’s published computing infrastructure, specifically MapReduce and the Google File System (GFS).

At the University of Washington, students were able to harness the power of distributed computing to produce complicated programs such as software that scans voluminous Wikipedia edits to identify spam and organizes global news articles by geographic location.

"In 2006, when I helped Christophe Bisciglia, a former UW student now a senior engineer at Google, to develop the program, our goal was to understand the challenges that universities face in teaching important new concepts such as large scale computing and develop methods to address this issue," said Ed Lazowska, Bill & Melinda Gates Chair of Computer Science & Engineering at the University of Washington. "A year later, we’ve seen how our students have mastered many of the techniques that are critical for large scale-internet computing, benefiting our department and students."

"Carnegie Mellon applauds Google and IBM for helping to provide the resources that will help professors better prepare our students for the challenges presented by highly parallel computing," said Randal Bryant, Dean of the School of Computer Science at Carnegie Mellon University. "We are quite pleased to be among the first universities participating in this program this fall."

To simplify the development of massively parallel programs Google and IBM have created the following resources:

  • A cluster of processors running an open source implementation of Google’s published computing infrastructure (MapReduce and GFS from Apache’s Hadoop project)
  • A Creative Commons licensed university curriculum developed by Google and the University of Washington focusing on massively parallel computing techniques available at: code.google.com/edu/content/parallel.html
  • Open source software designed by IBM to help students develop programs for clusters running Hadoop. The software works with Eclipse, an open source development platform. The plugin is currently available at: lucene.apache.org/hadoop/
  • Management, monitoring and dynamic resource provisioning of the cluster by IBM using IBM Tivoli systems management software
  • A website to encourage collaboration among universities in the program. This will be built on Web 2.0 technologies from IBM’s Innovation Factory.

About IBM and Education

IBM has a long-standing commitment to furthering education, including its IBM Academic Initiative, an innovative program offering a wide range of technology education benefits from free to fee that can scale to meet the goals of most colleges and universities. IBM will work with schools – that support open standards and seek to use open source and IBM technologies for teaching purposes – both directly and virtually via the Web. For more information on the IBM Academic Initiative, visit www.ibm.com/university.

About Google Inc.

Google’s innovative search technologies connect millions of people around the world with information every day. Founded in 1998 by Stanford Ph.D. students Larry Page and Sergey Brin, Google today is a top Web property in all major global markets. Google’s targeted advertising program provides businesses of all sizes with measurable results, while enhancing the overall Web experience for users. Google is headquartered in Silicon Valley with offices throughout the Americas, Europe and Asia. For more information, visitwww.google.com.

Media Contacts:

Jon Murchinson
jonm@google.com
650-253-4437

Colleen Haikes
chaikes@us.ibm.com
415-545-4003

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值