数据科学的发展_数据科学的发展与发展

最新推荐文章于 2025-08-19 03:40:53 发布

翻译最新推荐文章于 2025-08-19 03:40:53 发布 · 568 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：https://medium.com/@caroline_clark/data-sciences-evolution-and-mine-fb12ce3156ba

文章标签：

#人工智能 #大数据 #python #java #数据分析

随着21世纪数据的激增，数据科学、机器学习和人工智能正以前所未有的速度发展，影响着我们的日常生活。从搜索引擎到社交媒体，再到制造和医疗行业，这些技术的应用日益广泛。本文探讨了数据科学的演变，以及一位数据科学家如何通过深度学习提升技能，应对行业挑战。

数据科学的发展

There’s perhaps nothing that sets the 21st century apart from others more than the concept of data. Every interaction we have with a connected device creates a data record, and beams it back to some data store for tracking and analysis. Internet-connected devices are ubiquitous and growing. In 2018, there were approximately 8 connected devices per person in the United States. That number is expected to grow to 13.6 by 2023.¹

也许没有什么比数据的概念更能使21世纪与众不同。我们与连接的设备进行的每次交互都会创建一个数据记录，并将其发送回某个数据存储以进行跟踪和分析。连接互联网的设备无处不在并且正在增长。在2018年，美国每人大约有8台连接的设备。预计到2023年，该数字将增长到13.6。¹

The vast amounts of data that are being collected by organizations and individuals have enabled ever more powerful — and transformational — machine learning algorithms. Machine learning and artificial intelligence (AI) shape our experience when we use a search engine, visit a social media website, or interact with a large company’s customer service. AI enables SpaceX to safely land its rockets back on Earth for reuse. It fuels a growing population of robots in manufacturing, generates novel chemical compositions for drug research, and brings the possibility of fully autonomous vehicles closer every day.

组织和个人正在收集的大量数据使机器学习算法变得更加强大，并且具有变革性。当我们使用搜索引擎，访问社交媒体网站或与大公司的客户服务进行交互时，机器学习和人工智能(AI)会影响我们的体验。人工智能使SpaceX能够安全地将其火箭降落到地球上以供重复使用。它推动了制造业中不断增长的机器人数量的增长，产生了用于药物研究的新颖化学成分，并每天都使全自动驾驶汽车的可能性越来越近。

Yes, advances in compute power and better algorithms have also been a critical part of this advancement. But without good data, hardware and mathematical equations can only do so much. “Garbage in, garbage out” as the old adage goes.

是的，计算能力的提高和更好的算法也是这一进步的关键部分。但是，如果没有良好的数据，硬件和数学方程式只能做很多事情。就像古老的谚语所说的那样，“垃圾进，垃圾出”。

数据科学与机器学习与人工智能 (Data Science vs. Machine Learning vs. Artificial Intelligence)

It’s probably useful at this point to discuss what we mean when we talk about data science, machine learning, and artificial intelligence(AI).

在这一点上讨论我们在谈论数据科学，机器学习和人工智能(AI)时的含义可能是有用的。

Historically, data science has involved the process of analyzing data to gain insights, typically business insights. As Andrew Ng explains in his Coursera course, AI for Everyone, the output of a data science analysis would typically be a PowerPoint presentation (though this isn’t necessarily the case anymore — more on that in a moment).² Such an output would typically serve key stakeholders in an organization or on a project.

从历史上看，数据科学涉及分析数据以获取见解(通常是业务见解)的过程。正如吴安德(Andrew Ng)在他的Coursera课程“人人享有AI”中所解释的那样，数据科学分析的输出通常是PowerPoint演示文稿(尽管情况已不再是这种情况了，稍后再讨论)。²这样的输出将通常为组织或项目中的关键利益相关者服务。

One of its pioneers, Arthur Samuel defined machine learning as “the field of study that gives computers the ability to learn without being explicitly programmed”. The output of a machine learning project is typically some type of software, for example an algorithm that automatically optimizes listings you see on a job search site based on a variety of factors. Such an output could serve thousands, millions, or even billions of users.

它的先驱之一，亚瑟·塞缪尔(Arthur Samuel)将机器学习定义为“ 使计算机无需明确编程即可学习的研究领域” 。机器学习项目的输出通常是某种类型的软件，例如，一种算法会根据各种因素自动优化您在求职网站上看到的清单。这样的输出可以为数千，数百万甚至数十亿用户提供服务。

Artificial intelligence is the field of study involving how to build intelligent machines, typically with at least human-level performance on a given task (narrow AI) or on a diverse set of tasks (artificial general intelligence — AGI). We don’t know when we will reach AGI, or how we might know when we reach it.³ But in recent years, researchers and practitioners have achieved human-level or better performance on a variety of tasks using a specific type of machine learning called deep learning. Deep learning leverages an artificial neural network architecture, so you might see deep learning, neural networks, and AI used interchangeably in some settings.

人工智能是一个研究领域，涉及如何构建智能机器，通常在给定任务(狭窄的AI)或一组不同的任务(人工通用智能-AGI)上至少具有人类水平的性能。我们不知道什么时候可以到达AGI，或者我们怎么知道何时可以到达AGI。³但是，近年来，研究人员和从业人员已经通过使用特定类型的机器学习在各种任务上达到了人类水平或更好的性能称为深度学习。深度学习利用人工神经网络架构，因此您可能会看到深度学习，神经网络和AI在某些情况下可以互换使用。

演进：数据科学与矿山 (Evolution: Data Science’s and Mine)

Advances in deep learning are being increasingly leveraged by data scientists to develop both useful insights and products. Take for example the analyst who uses a a natural language processing algorithm to analyze customer sentiment regarding a new product, and presents the findings to an executive team. Or the data scientist who builds a recommendation engine and delivers this software to an engineering team for back-end integration.

数据科学家越来越多地利用深度学习的进展来开发有用的见解和产品。以使用自然语言处理算法分析新产品的客户情绪并将分析结果呈现给执行团队的分析师为例。或构建推荐引擎并将此软件提供给工程团队进行后端集成的数据科学家。

The rapid evolution of these fields, easy access to powerful compute platforms, and ubiquity of high-quality technical MOOCs (Massive Online Open Courses) contribute to the blurring of lines between data scientists, machine learning engineers, and even deep learning engineers.

这些领域的快速发展，易于访问的强大计算平台以及高质量的技术MOOC(大规模在线公开课程)的普及，导致数据科学家，机器学习工程师乃至深度学习工程师之间的界线越来越模糊。

Google’s search algorithm is probably the most widely used and under-recognized machine learning technology of the past 20 years. I began my career at Google and spent six years working in a variety of roles including on search and analytics teams. A lot of this work came down to helping customers optimize their usage of Google’s algorithms. Even during these early days (2008–2014), we were actively using machine learning-powered tools to provide both insights for our customers and automated campaign solutions. But the truth was this was only the infancy of the AI revolution.

Google的搜索算法可能是过去20年中使用最广泛且认识不足的机器学习技术。我的职业生涯始于Google，并在六年中担任过各种职务，包括在搜索和分析团队中工作。很多工作归结为帮助客户优化对Google算法的使用。即使在初期(2008-2014年)，我们仍在积极使用机器学习支持的工具来为我们的客户提供见解和自动化的营销活动解决方案。事实是，这只是AI革命的婴儿。

Deep learning took off in the public sphere after deep convolutional neural networks started smashing performance records.⁴ I took notice of the disruption in industry. While working as a consultant, I spoke with folks in the field, and embarked on an self-study journey to transition into a machine learning career, absorbing Andrew Ng’s Deeplearning.ai Coursera specialization, among other courses, research papers, and texts. As I started to work with clients in the space through a consulting firm, the experience was extremely rewarding and interesting.

深度卷积神经网络开始破坏性能记录后，深度学习在公共领域开始兴起。⁴我注意到了行业的混乱。在担任顾问期间，我与该领域的人们进行了交谈，并开始了自学之旅，以过渡到机器学习的职业，吸收了Andrew Ng的Deeplearning.ai Coursera专业知识，以及其他课程，研究论文和文章。当我开始通过一家咨询公司与该领域的客户合作时，这种经历是非常有益和有趣的。

COVID-19和大会 (COVID-19 and General Assembly)

Enter COVID-19.

输入COVID-19。

Though I was grateful to be in a better position than many folks out there, COVID-19 still led to some non-negligible disruption. But instead of thinking about this thing happening TO me, I wanted to flip the script and do something with the flexibility that came with working from home. As a lifelong learner in the machine learning and analytics space I had always felt like I was missing the data science portion of the puzzle. Back at Google I loved helping clients understand what was going on and what they should do using analytics, but I had gotten pretty far away from that, not to mention the cornucopia of new tools that are being used now to conduct analysis and relay the information in useful ways. After a lot of different conversations with colleagues and many late nights searching for the right solution to upgrade my data science skills, I settled on General Assembly. Specifically, I enrolled in General Assembly’s 12-week Data Science Immersive.

尽管我很高兴自己处于比其他人更好的位置，但是COVID-19仍然导致了一些不可忽视的干扰。但是，我没有想到这件事发生在我身上，而是想翻转脚本并以在家工作时带来的灵活性来做一些事情。作为机器学习和分析领域的终生学习者，我始终觉得自己好像错过了难题的数据科学部分。回到Google之前，我很乐意帮助客户了解分析的过程以及应该使用的方法，但我与之相距甚远，更不用说现在正在使用新工具进行分析和传递信息的聚宝盆以有用的方式。在与同事进行了许多不同的交谈并且深夜搜寻了正确的解决方案以提升我的数据科学技能之后，我决定参加大会。具体来说，我参加了大会为期12周的“沉浸式数据科学”课程。

My goals with this course are:

本课程的目标是：

Become a data wrangling master
成为数据争用大师
Build a solid foundation in statistics
为统计打下坚实的基础
Enhance my machine learning knowledge
增强我的机器学习知识

I’m excited to bring data science skills to my machine learning work in the future. Deep learning isn’t always feasible or necessary in a project depending on the data set and goal — this is where having a robust machine learning toolkit comes in handy. A solid statistics foundation can also be a boon when collecting and evaluating data quality, or when examining the impact labeling errors have on machine learning algorithm performance.

我很高兴将来能将数据科学技能带入我的机器学习工作中。根据数据集和目标，深度学习在项目中并不总是可行或必要的-在这里，拥有强大的机器学习工具非常有用。当收集和评估数据质量，或者检查标记错误对机器学习算法性能的影响时，扎实的统计基础也可以成为福音。

I’ll be sharing some of my journey on this blog over the coming months. If you’re interested, give me a follow.

在接下来的几个月中，我将在此博客上分享我的一些旅程。如果您有兴趣，请跟我来。

¹ https://www.cisco.com/c/en/us/solutions/executive-perspectives/annual-internet-report/air-highlights.html#² https://www.coursera.org/learn/ai-for-everyone³ For more on the challenges AGI presents, see Max Tegmark’s book, Life 3.0.⁴https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

¹https : //www.cisco.com/c/en/us/solutions/executive-perspectives/annual-internet-report/air-highlights.html#²https ://www.coursera.org/learn/ai-所有人 ³有关AGI所面临挑战的更多信息，请参阅Max Tegmark的书《 Life 3.0》 。https : //papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf