面向数据编程的编程语言
The battle between programming languages has always been a hot topic in the tech world. And given how fast technology is advancing, we have a new programming language or framework every few months.
编程语言之间的斗争一直是技术界的热门话题。 考虑到技术的发展速度,我们每隔几个月就会有一种新的编程语言或框架。
This makes it ever harder for developers, analysts, and researchers to choose the best language that will get their tasks done efficiently while incurring the lowest cost.
这使得开发人员,分析人员和研究人员更加难以选择最佳的语言来有效地完成任务,同时又花费最低的成本。
But I think that we tend to look at the wrong reasons for choosing a language. There are a bunch of factors that lead to the choice of a certain language. And with Data Science projects flooding the market, the question is NOT “which is the best language” but "which one suits your project requirements and environment (work setting)?"
但是我认为我们倾向于选择语言的错误原因。 有许多因素导致选择某种语言。 随着数据科学项目席卷市场,问题就不是“哪种语言最好”,而是“哪种语言适合您的项目要求和环境(工作环境)?”
So, with this post, I will present you with the right set of questions you should be asking in order to decide which is the best programming language for your data science project.
因此,在这篇文章中,我将向您提出一些正确的问题,以决定哪种数据语言项目最适合您的数据科学项目。
数据科学最常用的编程语言 (Most commonly used programming languages for Data Science)
Python and R are the most widely used languages for statistical analysis or machine learning-centric projects. But there are others - like Java, Scala, or Matlab.
Python和R是用于统计分析或以机器学习为中心的项目的最广泛使用的语言。 但是还有其他一些语言,例如Java,Scala或Matlab。
Both Python and R are state-of-the-art open-source programming languages with great community support. And we keep learning about new libraries and tools that allow us to achieve greater levels of performance and complexity.
Python和R都是最先进的开源编程语言,具有强大的社区支持。 而且,我们不断学习新的库和工具,这些库和工具可以使我们获得更高的性能和复杂性。
Python (Python)
Python is well-known for its easy to learn and readable syntax. With a general-purpose (jack of all trades) language like Python, you can build complete scientific ecosystems without worrying much about the compatibility or interfacing issues.
Python以其易学易懂的语法而闻名。 使用像Python这样的通用语言(万事通),您可以构建完整的科学生态系统,而不必担心兼容性或接口问题。
Python code has low maintenance costs and is arguably more robust. From data wrangling to feature selection, web scraping, and deployment of our machine learning models, Python can get almost everything done with integration support from all the major ML and deep learning APIs like Theano, TensorFlow, and PyTorch.
Python代码的维护成本较低,并且可以说更强大。 从数据整理到功能选择,网页抓取和我们机器学习模型的部署,Python可以通过所有主要ML和深度学习API(例如Theano,TensorFlow和PyTorch)的集成支持,完成几乎所有工作。
[R (R)
R was developed by academicians and statisticians over two decades ago. R today enables many statisticians, analysts, and developers to carry out their analysis effectively. We have over 12000 packages available in CRAN (an open-source repository).
R是由院士和统计学家在二十多年前开发的。 如今,R使许多统计学家,分析师和开发人员都能有效地进行分析。 我们在CRAN(一个开放源代码存储库)中提供了超过12000个软件包。
Since it was developed keeping statisticians in mind, R is often the first choice for all the core-scientific and statistical analysis. There is a package in R for almost every kind of analysis there is.
由于R是在开发过程中牢记统计学家的思想,因此R通常是所有核心科学和统计分析的首选。 R中提供了一个程序包,用于几乎所有类型的分析。
Also, data analysis has been made very easy with tools like RStudio that allow you to communicate your results with concise and elegant reports.
此外,借助RStudio之类的工具,数据分析变得非常容易,它使您可以通过简洁明了的报告来传达结果。
4个问题,可帮助您选择最适合您的项目的语言 (4 Questions to help you choose the BEST suited language for your project)
So, how do you make the right choice for your work at hand?
那么,您如何为手头的工作做出正确的选择?
Try answering these 4 questions:
尝试回答以下四个问题:
1.您的组织/行业首选哪种语言/框架? (1. Which language/framework is preferred in your organisation/industry?)
Look at the industry you are working in and the most commonly used language by your peers and competitors. It might be easier if you speak the same language.
查看您从事的行业以及同行和竞争对手最常用的语言。 如果您使用相同的语言,可能会更容易。
Here is an analysis carried out by David Robinson, a data scientist. It’s a reflection of the popularity of R in each industry, and you can see that R is heavily used in Academia and Healthcare.
这是数据科学家David Robinson进行的分析 。 这反映了R在每个行业中的普及程度,您可以看到R在学术界和医疗保健中大量使用。
So, if you’re someone who wants to go into research, academia, or bioinformatics, you might consider R over Python.
因此,如果您想从事研究,学术或生物信息学研究,可以考虑使用R over Python。
The other side of this coin involves software industries, application-driven organizations, and product-based companies. You might have to use the tech stack of your organization’s infrastructure or the language that your colleagues/teams are using.
硬币的另一面涉及软件行业,应用程序驱动的组织和基于产品的公司。 您可能必须使用组织基础结构的技术堆栈或同事/团队使用的语言。
And most of these organizations/industries have their infrastructure based on Python, including academia as well:
这些组织/行业中的大多数都具有基于Python的基础架构,包括学术界:
As an aspiring data scientist, therefore, you should focus on learning the language and tech that have the most applications and that can increase your chances of getting a job.
因此, 作为一名有抱负的数据科学家,您应该专注于学习应用最多的语言和技术,这可以增加您获得工作的机会。
2.您的项目范围是什么? (2. What is the scope of your project?)
This is an important question, because before you pick up a language, you must have an agenda for your project.
这是一个重要的问题,因为在选择语言之前,您必须为项目制定一个议程。
For example, what if you want to simply solve a statistical problem through a dataset, perform some multi-variate analyses, and prepare a report or a dashboard explaining the insights? In this case R might be a better choice. It has some really powerful visualization and communication libraries.
例如,如果您只想通过数据集简单地解决统计问题,执行一些多元分析,并准备一份解释见解的报告或仪表板该怎么办? 在这种情况下,R可能是更好的选择。 它具有一些非常强大的可视化和通讯库。
On the other hand, what if your aim is to first carry out exploratory analysis, develop a deep learning model, and then deploy the model within a web application? Then Python’s web frameworks and support from all the major cloud providers make it a clear winner.
另一方面,如果您的目标是首先进行探索性分析,开发深度学习模型,然后在Web应用程序中部署该模型,该怎么办? 然后,Python的Web框架和所有主要云提供商的支持使其成为赢家。
3.您在数据科学领域的经验如何? (3. How experienced are you in the field of data science?)
For a beginner in data science who has limited familiarity with statistics and mathematical concepts, Python might be a better choice because it lets you code the fragments of an algorithm with ease.
对于对统计和数学概念了解有限的数据科学初学者, Python可能是一个更好的选择,因为它使您可以轻松地编写算法的片段。
With libraries like NumPy, you can manipulate matrices and code algorithms yourself. As a novice, it is always better to learn to build things from scratch rather than hopping onto using machine learning libraries.
使用NumPy之类的库,您可以自己操纵矩阵并编写算法。 作为新手,总要学习从头开始构建东西,而不是跳到使用机器学习库。
But if you already know the fundamentals of machine learning algorithms, you can pick up either of the languages and get started with them.
但是,如果您已经了解了机器学习算法的基础知识,则可以选择任何一种语言并开始使用它们。
4.您有多少时间在手,学习的成本是多少? (4. How much time do you have on hand, and what's the cost of learning?)
The amount of time you can invest makes another case for your choice. Depending on your experience with programming and the delivery time of your project, you might choose one language over another to get started in the field.
您可以投入的时间又为您选择了另一个案例。 根据您的编程经验和项目的交付时间,您可能会选择一种语言而不是另一种语言来开始该领域。
If there is a high-priority project and you don’t know either of the languages, R might be an easier option for you to get started as you need limited/no experience with programming. You can write statistical models with a few lines of code using existing libraries.
如果有一个高优先级的项目,并且您不懂这两种语言,那么R可能是您入门的更简单选择,因为您需要有限的编程经验或没有编程经验。 您可以使用现有库用几行代码编写统计模型。
Python (often the programmer’s choice) is a great option to start off with if you have some bandwidth to explore the libraries and learn about methods of exploring datasets. (In the case of R, this can be done quickly within Rstudio.)
如果您有足够的带宽来探索库并了解探索数据集的方法,那么Python(通常是程序员的选择)是一个很好的选择。 (对于R,可以在Rstudio中快速完成。)
Another important factor is that there are more Python Mentors as compared with R. If you're someone who needs help with their python/R project, you can look for a Coding Mentor here and using this link will also get you $10 credit on sign up to be used for the first mentor meeting.
另一个重要因素是,与R相比,Python导师更多。如果您是需要他们的python / R项目帮助的人,则可以在此处寻找Coding Mentor ,使用此链接还将获得10美元的签到信用额。用于第一次导师会议。
结论 (Conclusion)
In a nutshell, the gap between the capabilities of R and Python is getting narrower. Most jobs can be done by both languages. And both have rich ecosystems to support you.
简而言之,R和Python功能之间的差距正在缩小。 大多数工作都可以用两种语言完成。 两者都有丰富的生态系统来支持您。
Choosing a language for your project will then depend on:
然后,为项目选择语言将取决于:
Your prior experience with Data Science (stats and math) and programming.
您先前在数据科学(统计和数学)和编程方面的经验。
- The domain of the project at hand and the extent of statistical or scientific processing required. 当前项目的领域以及所需的统计或科学处理范围。
- The future scope of your project. 您的项目的未来范围。
- The language/framework that is most widely supported in your teams, organisation, and industry. 在您的团队,组织和行业中得到最广泛支持的语言/框架。
You can check out the video version of this blog here,
您可以在此处查看此博客的视频版本,
Harshit的数据科学 (Data Science with Harshit)
With this channel, I am planning to roll out a couple of series covering the entire data science space. Here is why you should be subscribing to the channel:
通过这个渠道,我计划推出一系列涵盖整个数据科学领域的系列文章 。 这就是为什么您要订阅该频道的原因 :
The series would cover all the required/demanded quality tutorials on each of the topics and subtopics like Python fundamentals for Data Science.
该系列将涵盖关于每个主题和子主题(如数据科学的Python基础知识)的所有必需/要求的质量教程。
- Explained Mathematics and derivations of why we do what we do in ML and Deep Learning. 解释了数学以及为什么我们在ML和深度学习中做我们所做的事情。
- Podcasts with Data Scientists and Engineers at Google, Microsoft, Amazon, etc, and CEOs of big data-driven companies. Google,Microsoft,Amazon等的数据科学家和工程师以及大数据驱动公司的首席执行官的播客。
Projects and instructions to implement the topics learned so far.
实施到目前为止所学主题的项目和说明 。
You can connect with me on Twitter, or LinkedIn.
面向数据编程的编程语言