语言-英语翻译(edx-datascientist _A Very Short History Of Data Science)


A Very Short History Of Data Science


The story of how data scientists became sexy is mostlythe story of the coupling (结合)of the mature(成熟的) discipline of statistics with a veryyoung one–computer science. The term “Data Science” has emerged(出现、提出) only recently to specifically(尤其在) designate(指定、划派) a new profession that is expected tomake sense of the vast stores of big data. But making sense of data has a longhistory and has been discussed by scientists, statisticians, librarians,computer scientists and others for years. The following timeline traces theevolution(演化) of the term “Data Science” and its use, attempts(尝试) to define it, and related terms(相关术语). 数据科学家如何变得性感的故事大多是统计学成熟学科与非常年轻的计算机科学相结合的故事。数据科学这个术语最近才出现,专门指定一个新的职业,这个职业有望理解大量的大数据。但数据的理解历史悠久,多年来一直由科学家,统计人员,图书馆员,计算机科学家等人进行讨论。下面的时间表跟踪数据科学及其使用,尝试定义它以及相关术语的演变。 
1962 John W. Tukey writes in “The Future of Data Analysis”: “For a long time Ithought I was a statistician, interested in inferences
(推断、推论) from the particular to the general(对从特殊性到普遍性的推论感兴趣). But as I have watched mathematicalstatistics evolve(正如我所看到的数理统计的发展), I have had cause to wonder and doubt…I have come to feel that my central interest is in data analysis… Dataanalysis, and the parts of statistics which adhere (支撑)to it(支撑它的统计部分), must…take on the characteristics ofscience (科学的特征)rather than those of mathematics… data analysis isintrinsically(本质上) an empirical(经验主义) science… How vital (重要的)and how important… is the rise of thestored-program electronic computer电子计算机存储程序的兴起? In many instances(情况下) the answer may surprise many by being‘important but not vital,’ although in others there is no doubt but what thecomputer has been ‘vital.’” In 1947, Tukey coined(创造了) the term “bit” which Claude Shannonused in his 1948 paper “A Mathematical Theory of Communications.” In 1977,Tukey published Exploratory Data Analysis, arguing that(认为) more emphasis(强调) needed to be placed on using data tosuggest hypotheses to test and that Exploratory Data Analysis and ConfirmatoryData Analysis(探索性数据分析和验证性数据分析)”can—and should—proceed side by side(并排处理).”

1974 Peter Naur publishes Concise Survey of ComputerMethods (计算机方法简明综述)in Sweden and the United States. The book is a survey ofcontemporary(现代的,同时) data processing methods that are used in a wide range ofapplications(该书是一个调查-被广泛使用的数据处理方法的调查). It is organized(组织) around the concept of data as definedin the IFIP Guide to Concepts and Terms in Data Processing: “[Data is] arepresentation(表示、表现) of facts or ideas in a formalized(正式的) manner(方式) capable (能力)of being communicated(沟通、交流) or manipulated(操作) by some process.“ The Preface to thebook tells the reader that a course plan (课程计划)was presented at the IFIP Congress in1968, titled “Datalogy, the science of data and of data processes(过程、进程) and its place in education,“ and thatin the text of the book, ”the term ‘data science’ has been used freely.” Nauroffers the following definition of data science: “The science of dealing withdata, once they have been established(建立), while the relation of the data to whatthey represent is delegated to(委派、为代表的) other fields and sciences.(处理数据的科学一旦建立起来,而数据与它所代表的关系就被委托给其他领域和科学)” 
1977 The International Association
(国际协会) for Statistical Computing (IASC) isestablished as a Section(一部分) of the ISI. “It is the mission(任务) of the IASC to link traditionalstatistical methodology, modern computer technology, and the knowledge ofdomain experts in order to convert data into information and knowledge.”

1989 Gregory Piatetsky-Shapiro organizes and chairs (组织并主持)the first Knowledge Discovery inDatabases (KDD) workshop(研讨会). In 1995, it became the annual(年度) ACM SIGKDD Conference (会议、协商、讨论)on Knowledge Discovery and Data Mining (数据挖掘)(KDD).

September 1994 BusinessWeek publishes a cover story(封面故事) on “Database Marketing”: “Companies arecollecting mountains of (大量)information about you, crunching(咯咯的响) it to predict how likely you are to buya product(并预测你有多大的可能性来购买产品), and using that knowledge to craft(制定) a marketing message precisely(精确的) calibrated(校准、刻度) to get you to do so(并利用这些知识来制定一个经过精确校准的营销信息)… An earlier flush(发红) of enthusiasm (热情、热心)prompted by(促使) the spread(传播) of checkout scanners(结账扫描仪) in the 1980s ended in widespreaddisappointment(结果令人失望): Many companies were too overwhelmed(压倒) by the sheer (绝对的)quantity of data to do anything usefulwith the information(很多公司都因数据量太大而不知所措,无法对信息做任何有用的事情)… Still, many companies believe theyhave no choice but to brave(面对) the database-marketing frontier(前言的).” 
1996 Members of the International Federation
(联盟) of Classification (分类、类别)Societies (IFCS) meet in Kobe, Japan,for their biennial(两年一度的) conference(会议). For the first time, the term “datascience” is included in the title of the conference (“Data science,classification, and related methods(相关方法)”). The IFCS was founded in 1985 by sixcountry- and language-specific classification societies, one of which, TheClassification Society, was founded in 1964. The classification societies havevariously used the terms data analysis, data mining, and data science in theirpublications.

1996 Usama Fayyad, Gregory Piatetsky-Shapiro, and PadhraicSmyth publish “From Data Mining to Knowledge Discovery in Databases.”从数据挖掘到数据库中的知识发现) They write: “Historically(历史上来看), the notion (概念)of finding useful patterns in data hasbeen given a variety of names, including data mining, knowledge extraction(知识提取), information discovery, informationharvesting(收集), data archeology(考古学), and data pattern processing… In ourview, KDD [Knowledge Discovery in Databases] refers to the overall process ofdiscovering useful knowledge from data, and data mining refers to a particularstep in this process. Data mining is the application of specific algorithms forextracting patterns from data… the additional(额外的) steps in the KDD process, such as datapreparation, data selection, data cleaning, incorporation(合并) of appropriate(适当的) prior knowledge, and properinterpretation(正确的解释) of the results of mining, are essential to ensure thatuseful knowledge is derived from the data(是必不可少的确保从数据中获得有用的知识). Blind application(盲目的应用) of data-mining methods (rightlycriticized(被批作为) as data dredging(数据挖掘) in the statistical literature(统计学文献中)) can be a dangerous activity, easilyleading to the discovery of meaningless(无意义的) and invalid patterns(容易导致发现无意义和无效的模式).”

1997 In his inaugural lecture(就职演讲) for the H. C. Carver Chair in Statisticsat the University of Michigan, Professor C. F. Jeff Wu (currently at theGeorgia Institute of Technology), calls for statistics to be renamed datascience and statisticians to be renamed data scientists. 
1997 The journal
(杂志)Data Mining and Knowledge Discovery islaunched; the reversal(翻转) of the order of the two terms in itstitle reflecting the ascendance(优势) of “data mining” as the more popularway to designate(指定、委派) “extracting information from large databases.” 
December 1999 Jacob Zahavi is quoted in “Mining Data for Nuggets
(矿业)of Knowledge” in Knowledge@Wharton:“Conventional (传统的)statistical methods work well with small data sets.Today’s databases, however, can involve millions of rows and scores of columnsof data… Scalability(可扩展性) is a huge issue in data mining. Anothertechnical challenge is developing models that can do a better job analyzingdata, detecting non-linear relationships and interaction between elements(另一个技术挑战是开发能更好地分析数据,检测非线性关系和元素间相互作用的模型)… Special data mining tools may have tobe developed to address web-site decisions.(可能需要开发特殊的数据挖掘工具来解决网站的决策)” 
2001 William S. Cleveland publishes “Data Science: An Action Plan for Expandingthe Technical Areas of the Field of Statistics.” It is a plan “to enlarge themajor areas of technical work of the field of statistics. Because the plan isambitious
(雄心勃勃) and implies substantial(意味着实质性的) change, the altered(改变的) field will be called ‘data science.’”Cleveland puts the proposed(提出的) new discipline in the context(环境) of computer science and thecontemporary(当代的) work in data mining(克利夫兰把这个提出的新学科放在计算机科学和当代数据挖掘的工作中): “…the benefit to the data analyst(分析师) has been limited, because the knowledgeamong computer scientists about how to think of and approach the analysis ofdata is limited(对数据分析师的好处因为计算机科学家关于如何思考和处理数据分析的知识是有限), just as the knowledge of computingenvironments by statisticians is limited. A merger(合并) of knowledge bases would produce apowerful force for innovation(创新). This suggests that statisticiansshould look to computing for knowledge today just as data science looked tomathematics in the past(建议统计学家应该把目光投向今天的计算机科学,正如数据科学在过去数学科学中所看到的一样). … departments of data science shouldcontain faculty members who devote(致力于) their careers to advances(职业发展) in computing with data and who formpartnership with computer scientists.”数据科学部门应该包含那些致力于职业发展的教师,他们可以用数据计算技术进步,并与计算机科学家建立合作关系) 
2001 Leo Breiman publishes “Statistical Modeling: The Two Cultures
(文化)” (PDF): “There are two cultures in theuse of statistical modeling to reach conclusions from data(从数据中得出结论). One assumes(假定) that the data are generated by a givenstochastic(随机) data model. The other uses algorithmic models and treats(把什么当做) the data mechanism(机制) as unknown. The statistical communityhas been committed to the almost exclusive(独占的) use of data models(统计界一直致力于几乎独占使用数据模型). This commitment has led to irrelevant(无关) theory, questionable(可疑的结论) conclusions, and has kept statisticiansfrom working on a large range of interesting current problems(并使统计人员不得不处理大量有趣的当前问题). Algorithmic modeling, both in theoryand practice, has developed rapidly(快速的) in fields outside statistics. It can beused both on large complex data sets and as a more accurate(准确的) and informative alternative(信息替代方案) to data modeling on smaller data sets.If our goal as a field is to use data to solve problems, then we need to moveaway from (摆脱)exclusive(独有的) dependence(依赖) on data models and adopt(采用) a more diverse(多样化的) set of tools.(那么我们就需要摆脱对数据模型的独占依赖,并采用更多样化的工具。)” 
April 2002 Launch of Data Science Journal
(数据科学杂志), publishing papers on “the managementof data and databases in Science and Technology. The scope(范围) of the Journal includes descriptions ofdata systems, their publication on the internet(互联网刊物), applications and legal issues(法律问题).” The journal is published by theCommittee(委员会) on Data for Science and Technology (CODATA) of theInternational Council(委员会) for Science (ICSU).(国际科学理事会) 
January 2003 Launch of Journal of Data Science: “By ‘Data Science’ we meanalmost everything that has something to do with data: Collecting, analyzing,modeling…… yet the most important part is its applications–all sorts ofapplications. This journal is devoted
(致力于) to applications of statistical methodsat large…. The Journal of Data Science will provide a platform for all dataworkers to present their views and exchange ideas.” 
May 2005 Thomas H. Davenport, Don Cohen, and Al Jacobson publish “Competing
(竞争)on Analytics,” a Babson College WorkingKnowledge Research Center report(巴布森学院工作知识研究中心的报告), describing “the emergence(涌现、出现) of a new form of competition(竞争) based on the extensive use ofanalytics, data, and fact-based decision making(基于事实所做的决策)… Instead of competing on traditionalfactors, companies are beginning to employ (使用、采用、雇佣)statistical and quantitative(定量分析) analysis and predictive modeling asprimary elements of competition. ” The research is later published by Davenportin the Harvard Business Review (January 2006) and is expanded (with Jeanne G.Harris) into the book Competing on Analytics: The New Science of Winning (March2007). 
September 2005 The National Science Board
(美国国家科学委员会) publishes “Long-lived Digital DataCollections: Enabling Research and Education in the 21st Century.” One of therecommendations(建议) of the report reads: “The NSF, working in partnership (合作)with collection managers and thecommunity at large, should act to develop and mature (成熟)the career path for data scientists andto ensure that the research enterprise(企业) includes a sufficient(足够的) number of high-quality data scientists.”The report defines data scientists as “the information and computer scientists,database and software engineers and programmers, disciplinary(多学科的、训练的) experts, curators(监护、管理者) and expert annotators(解释), librarians, archivists(档案保管员), and others, who are crucial to(对..至关重要) the successful management of a digitaldata collection.”(该报告将数据科学家定义为信息和计算机科学家,数据库和软件工程师和程序员,纪律专家,策展人和专家注释员,图书管理员,档案管理员等等,他们对成功管理数字数据收集至关重要) 
2007 The Research Center for Dataology
(数据学) and Data Scienceis established at FudanUniversity, Shanghai, China. In 2009, two of the center’s researchers, YangyongZhu and Yun Xiong, publish “Introduction to Dataology and Data Science,” inwhich they state “Different from natural science and social science, Dataologyand Data Science takes data in cyberspace (信息空间)as its research object. It is a newscience.” The center holds annual symposiums(专题讨论会) on Dataology and Data Science. 
July 2008 The JISC publishes the final report of a study it commissioned
(定制) to “examine(检查、检验) and make recommendations(推荐)【作用】 on the role and career development ofdata scientists and the associated (有关的、联合的)supply of specialist(专家) data curation(管理、分享、帘幕) skills to the research community(并向研究界提供相关的专业数据管理技能提供建议). “ The study’s final report, “TheSkills, Role & Career Structure of Data Scientists & Curators:Assessment(评定、评估) of Current Practice & Future Needs,” defines datascientists as “people who work where the research is carried out–or, in thecase of data centre personnel(相关人员), in close collaboration with thecreators of the data–and may be involved in creative enquiry and analysis(并可能参与创造性的查询和分析), enabling others to work with digitaldata, and developments in data base technology.” 
January 2009 Harnessing the Power of Digital Data for Science and Society ispublished. This report of the Interagency
(机构间) Working Group on Digital Data to theCommittee(委员会) on Science of the National Science and TechnologyCouncil(委员会) states that “The nation needs to identify and promotethe emergence(出现) of new disciplines and specialists expert in addressingthe complex and dynamic challenges of digital preservation(保存), sustained(持续的) access, reuse and repurposing of data(国家需要确定和促进新的学科和专家的出现,专家应对数字保存的复杂和动态挑战持续访问,重用和重新利用数据). Many disciplines are seeing theemergence of a new type of data science and management expert, accomplished(完成) in the computer, information, and datasciences arenas(竞技场) and in another domain science. These individuals are keyto the current and future success of the scientific enterprise(科学企业). However, these individuals oftenreceive(收到) little recognition for their contributions and havelimited career paths.”然而,这些人往往不太了解他们的贡献,职业道路有限) 
January 2009 Hal Varian, Google’s Chief Economist, tells the McKinseyQuarterly: “I keep saying the sexy job in the next ten years will bestatisticians. People think I’m joking, but who would’ve guessed that computerengineers would’ve been the sexy job of the 1990s? The ability to take data—tobe able to understand it, to process it, to extract value from it, to visualizeit, to communicate it—that’s going to be a hugely important skill in the nextdecades… Because now we really do have essentially
(基本上) free and ubiquitous(无处不在的) data. So the complimentary (赞美的、赠送的)scarce(稀缺的、不足的) factor is the ability to understand thatdata and extract value from it… I do think those skills—of being able toaccess, understand, and communicate the insights(洞察力) you get from data analysis—are going tobe extremely(非常、极其) important. Managers need to be able to access andunderstand the data themselves.”(历史第一页,下一页没有看)

 


Books on big data tend to fall into one of two categories: either they offer no explanation as to how things actually work or they are highly mathematical textbooks suitable only for graduate students. The aim of this book is to offer an alternative by providing an introduction to how big data works and is changing the world about us; the effect it has on our everyday lives; and the effect it has in the business world. Data used to mean documents and papers, with maybe a few photos, but it now means much more than that. Social networking sites generate large amounts of data in the form of images, videos, and movies on a minute by minute basis. Online shopping creates data as we enter our address and credit card details. We are now at a point where the collection and storage of data is growing at a rate unimaginable only a few decades ago but, as we will see in this book, new data analysis techniques are transforming this data into useful information. While writing this book, I found that big data cannot be meaningfully discussed without frequent reference to its collection, storage, analysis, and use by the big commercial players. Since research departments in companies such as Google and Amazon have been responsible for many of the major developments in big data, frequent reference will be made to them. The first chapter introduces the reader to the diversity of data in general before explaining how the digital age has led to changes in the way we define data. Big data is introduced informally through the idea of the data explosion, which involves computer science, statistics, and the interface between them. In Chapters 2 to 4, I have used diagrams quite extensively to help explain some of the new methods required by big data. The second chapter explores what makes big data special and, in doing so, leads us to a more specific definition. In Chapter 3, we discuss the problems related to storing and managing big data. Most people are familiar with the need to back
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值