scholar 引用:18
页数:4
发表时间:2017.08
发表刊物:National Academy of Sciences
作者:David M. Bleia, and Padhraic Smythd
摘要:Key words:data science, statistics, machine learning
Data science has attracted a lot of attention, promising to turn vast amounts of data into useful predictions
and insights. In this article, we ask why scientists should care about data science. To answer, we discuss
data science from three perspectives: statistical, computational, and human. Although each of the three is
a critical component of data science, we argue that the effective combination of all three components is
the essence of what data science is about.
结论:
- to solve real world problems, a data scientist will need to undertake tasks that are beyond their traditional training.
- Holistic data science requires that we understand the context of data, appreciate the responsibilities involved in using private and public data, and clearly communicate what a dataset can and cannot tell us about the world.
Introduction:
- data science is the child of statistics and computer science.
- genetic data can potentially aid researchers in studying the human genome, helping them understand how it evolves, and how it governs observed traits.
- Connecting genes and traits at large scale is a problem that is beyond the limits of classical genome analysis, both computationally and statistically.
- Applying modern statistical and computational tools to modern scientific questions requires significant human judgment and deep disciplinary knowledge.
正文组织架构:
1. Introduction
2. Statistical Perspective
3. Computational Perspective
4. Human Perspective
5. Summary
正文部分内容摘录:
2. Statistical Perspective
- All datasets involve uncertainty.
- Statistics relates to data science through multiple statistical subfields: complex and structured data, high dimensionality, and causality.
- To handle high-dimensional data, statisticians and computer scientists have developed powerful methods involving robustness, regularization, and stability
3. Computational Perspective
- Computational thinking provides a way to understand and compare their computational footprints.
- One well-known example of computational thinking revolves around optimization.
- Another example of computational thinking is sampling methods. Sampling methods help compute approximate solutions of data analysis problems where the exact solutions are too complex for direct mathematical analysis: bootstrap; Bayesian data analysis(Markov chain Monte Carlo (MCMC)).
- A final example of computational thinking is in scaling data analysis with distributed computing.
- While statistical thinking offers a suite of methods for understanding data, computational thinking provides the crucial considerations of how to balance statistical accuracy with limited computational resources.
4. Human Perspective
- understanding a problem domain, deciding which data to acquire and how to process it, exploring and visualizing the data, selecting appropriate statistical models and computational methods, and communicating the results of the analyses.
- The human perspective reveals how aspects of the data analysis process, such as metadata, data provenance, data analysis workflows, and scientific reproducibility, are critical to modern scientific research.