13个至关重要的机器学习面试题

Given the increased demand for experts in the field of Machine Learning, more and more developers start looking for common questions on such interviews. In this story, I have listed the typical questions I and other developers got while applying for jobs in the field.

鉴于对机器学习领域专家的需求不断增长,越来越多的开发人员开始在此类采访中寻找常见问题。 在这个故事中,我列出了我和其他开发人员在该领域求职时遇到的典型问题。

Q1。 机器学习和深度学习有什么区别? (Q1. What is the difference between Machine Learning and Deep Learning?)

Machine Learning is a practice of using algorithms to organize data, learn from it, and make predictions about real-world problems. In comparison to finding a solution to each specific problem, the machine is trained using large amounts of data to find solutions on its own.

机器学习是一种使用算法来组织数据,从中学习以及对实际问题进行预测的实践。 与为每个特定问题找到解决方案相比,该机器使用大量数据进行训练以自行查找解决方案。

Deep Learning is a form of Machine Learning that is based on the principles of the human brain. In short, Deep Learning is a technique for implementing Machine Learning. Neural networks are an essential part of Deep learning, which is based on the principle of interconnected neurons.

深度学习是一种基于人脑原理的机器学习形式。 简而言之,深度学习是一种用于实现机器学习的技术。 神经网络是深度学习的重要组成部分,它基于互连神经元的原理。

Q2。 精确度和召回率是什么意思? (Q2. What do Precision and Recall mean?)

To put it in simple terms, precision is the number of relevant entries over the total number of entries. Precision signifies the percentage of the results which are relevant. So if you search for “brown dogs” in Google and only seven out of ten images are brown dogs, then the precision is 7/10=0.7.

简单来说, 精度是相关条目数占条目总数的乘积。 精度表示相关结果的百分比。 因此,如果您在Google中搜索“棕色狗”,而十分之七的图像是棕色狗,则精度为7/10 = 0.7。

The recall is the number of retrieved instances among all relevant instances. In this case, we are looking for correct entries that were displayed to us from the total number of available correct entries. Say there are fifteen correct pages about brown dogs. Since we only got seven, our recall is 7/15=0.47.

召回是所有相关实例中检索到的实例数。 在这种情况下,我们正在从可用的正确条目总数中寻找显示给我们的正确条目。 假设有15条关于棕狗的正确页面。 由于我们只有七个,因此我们的召回率为7/15 = 0.47。

Q3。 F1分数是多少? 您将如何使用它? (Q3. What is the F1 score? How would you use it?)

In statistical analysis, F1 is a measure of a test’s accuracy. In the case of Machine Learning, it refers to the model’s performance.

在统计分析中,F1是测试准确性的度量。 在机器学习的情况下,它指的是模型的性能。

It is calculated as a weighted average of the precision and recall of a model. The results closer to 1 are the best and closer to 0 are the worst. F1 score is best when we need to seek a balance between precision and recall, and there is an uneven class distribution.

它是作为模型精度和召回率的加权平均值计算的。 接近1的结果是最好的,接近​​0的结果是最差的。 当我们需要在精确度和召回率之间寻求平衡并且班级分布不均匀时,F1分数是最好的。

Q4。 哪个更重要:模型准确性或模型性能? (Q4. Which is more important: model accuracy or model performance?)

This is a rather simple trick question, of which there could be many. Model accuracy is only part of the model performance. The accuracy and performance are directly proportional. Therefore, better performance leads to more accurate predictions.

这是一个相当简单的技巧问题,其中可能有很多。 模型的准确性只是模型性能的一部分。 精度和性能成正比。 因此,更好的性能导致更准确的预测。

Q5。 给定一个数据集,如何决定要使用哪种机器学习算法? (Q5. Given a Data Set, how to decide which Machine Learning algorithm to use?)

There certainly is no superior algorithm that can be used in every single scenario. That is the reason why there are so many and each one is targetted towards a certain problem or data set. To decide which one to use, consider the following questions:

当然,没有可以在每种情况下使用的高级算法。 这就是为什么会有如此之多且每个目标都针对某个问题或数据集的原因。 要决定使用哪个,请考虑以下问题:

  • How large is the data set? Is it continuous or categorical?

    数据集有多大? 是连续的还是绝对的?
  • Labeled, unlabeled, or a combination?

    带标签,无标签或组合?
  • The question is related to association, classification, clustering, or regression?

    问题与关联,分类,聚类还是回归相关?
  • What is the goal of the algorithm?

    该算法的目标是什么?

Based on these questions, here are some examples of algorithms for each case:

基于这些问题,以下是每种情况的一些算法示例:

  • Continuous output -> Linear Regression

    连续输出->线性回归
  • Categorical Output -> Decision Tree, KNN, Random Forest, Logistic Regression, Naive Bayes

    分类输出->决策树,KNN,随机森林,逻辑回归,朴素贝叶斯
  • Clustered Output -> K Means Clustering, Hierarchical Clustering, PCA

    聚类输出-> K均值聚类,分层聚类,PCA
  • Non-linear Interaction Data -> Boosting, Bagging

    非线性互动数据->提升,打包
  • The output is an Association -> Apriori

    输出是一个关联-> Apriori
  • The output is an Image(Audio) -> Neural Networks

    输出是图像(音频)->神经网络

Q6。 您最喜欢的算法是什么? 请在两分钟内进行说明。 (Q6. What is your favorite algorithm? Explain it in under two minutes.)

The ability to summarize complex ideas and explain them in a simple yet understandable manner is often tested on interviews. For these types of questions make sure you have several algorithms you could explain well. Also, practice explaining them in short periods of time.

通常会在访谈中测试总结复杂想法并以简单但可理解的方式进行解释的能力。 对于这些类型的问题,请确保您有几种可以很好解释的算法。 另外,练习在短时间内解释它们。

Q7。 您将如何处理数据集中的丢失或损坏的数据? (Q7. How would you handle missing or corrupt data in a dataset?)

If you find missing or corrupted data in the dataset you could either drop those rows or columns. Alternatively, you could choose to replace them with another value.

如果在数据集中发现丢失或损坏的数据,则可以删除这些行或列。 或者,您可以选择将它们替换为另一个值。

For example, in Pandas isnull() and dropna() will help you find columns of data with missing or corrupted data and drop them. You could also use fillna() to replace those values with zeros.

例如,在熊猫中, isnull()dropna()将帮助您查找丢失或损坏数据的数据列并将其删除。 您也可以使用fillna()将这些值替换为零。

Q8。 数组和链表之间有什么区别? (Q8. What are the differences between an array and a linked list?)

Data structures are tested in all types of interviews, so it is generally a good practice to prepare for such.

数据结构在所有类型的采访中都经过测试,因此通常为进行此类准备是一个好习惯。

An array is a set of similar data objects stored sequentially under a common variable name.

数组是一组以公共变量名称顺序存储的相似数据对象。

A linked list is a data structure that contains the sequence of elements, where each one is linked to the next one. Each item in a link list commonly contains two fields: the data field and the link to the next item.

链表是一种包含元素序列的数据结构,其中每个元素都链接到下一个元素。 链接列表中的每个项目通常包含两个字段:数据字段和到下一个项目的链接。

The most notable difference is in the way these two are stored. An array has a fixed size, which has to be declared prior to its utilization. A linked list, on the other hand, can be stored separately, so it could always be extended.

最显着的区别在于这两种存储方式。 数组具有固定大小,必须在使用数组之前声明该大小。 另一方面,链表可以单独存储,因此可以随时进行扩展。

Q9。 决策树的分类是什么? (Q9. What is the decision tree classification?)

Decision trees build models as tree structures. The data is broken down into smaller subsets, which follow the structure of a tree with nodes and branches. Both the numerical and categorical data can be handled by decision trees.

决策树将模型构建为树结构。 数据被分解为较小的子集,这些子集遵循具有节点和分支的树的结构。 数字数据和分类数据都可以由决策树处理。

Q10。 决策树中的修剪是什么? 怎么做? (Q10. What is pruning in decision trees? How is it done?)

Pruning means simplifying or compressing a decision tree by removing redundant or unessential sections of the tree. It prevents overfitting and therefore improves accuracy.

修剪意味着通过删除树上多余或不必要的部分来简化或压缩决策树。 它可以防止过拟合,从而提高精度。

Pruning can be done in two ways(directions):

修剪可以通过两种方式(方向)完成:

  • Bottom-up — starting from the leaves and going up.

    自下而上-从叶子开始向上。
  • Top-down — starting from the room and going down.

    自上而下-从房间开始向下走。

Mentioning any pruning algorithm, such as reduced error pruning, would demonstrate a practical understanding of the concept. Make sure to explain it quickly, while mentioning the most important parts of it.

提及任何修剪算法,例如减少错误修剪,将证明对该概念有实际的理解。 在提到它最重要的部分时,请确保对其进行快速解释。

Q11。 比较K均值和KNN算法。 (Q11. Compare K-means and KNN algorithms.)

  • K-means is unsupervised. KNN is supervised in nature.

    K均值是无监督的。 KNN本质上是受监督的。
  • K-means is a clustering algorithm, while KNN is a classification algorithm.

    K均值是聚类算法,而KNN是分类算法。
  • In K-means the points in each cluster are similar and the clusters are different from each other. KNN classifies an unlabeled observation based on its K surrounding neighbors.

    在K均值中,每个聚类中的点相似,并且聚类彼此不同。 KNN根据其周围的K个邻居对未标记的观察进行分类。

Q12。 您最近阅读过哪些机器学习论文? (Q12. What are the last Machine Learning papers you have read?)

These types of questions are aimed at finding whether you are actually passionate about Machine Learning. It is essential to keep up with the latest scientific literature to demonstrate your interest in the position.

这些类型的问题旨在确定您是否真的对机器学习充满热情。 跟上最新的科学文献来证明您对该职位的兴趣至关重要。

Search for some credible research papers and make sure you understand them well. As a general practice, it is best to actually read literature about your field of study and work.

搜索一些可靠的研究论文,并确保您对它们的理解很好。 通常,最好实际阅读有关您的研究和工作领域的文献。

Q13。 您通常在哪里搜索数据集? (Q13. Where do you usually search for datasets?)

If you are truly passionate about the field, you must have done several projects on your own. This question is used to better understand your interest and it is a chance to show your knowledge and mention your past projects. Check out what is out there and be prepared to talk about it.

如果您真的对这个领域充满热情,那么您必须自己完成几个项目。 这个问题是用来更好地理解您的兴趣的,它是一个展示您的知识并提及您过去的项目的机会。 看看那里有什么,并准备好谈论它。

最后的想法 (Final Thoughts)

As you might have noticed, not every question on the interviews is trying to test your programming skills. It is essential to have proper skills, however, the understanding of concepts and general interest in Machine Learning are no less important.

您可能已经注意到,并不是面试中的每个问题都试图测试您的编程技能。 拥有适当的技能至关重要,但是,对概念的理解和对机器学习的普遍兴趣同样重要。

You could never know which exact questions you will get on the interview, however, they tend to repeat quite often. It is quite likely you will get one of these questions or similar ones in your next interview. Make sure to practice and take your time to understand these concepts.

您永远不会知道在面试中会遇到哪些确切的问题,但是,这些问题往往会重复出现。 在下一次面试中,您很可能会遇到这些问题之一或类似问题。 请务必练习并花点时间来理解这些概念。

翻译自: https://towardsdatascience.com/13-vital-machine-learning-interview-questions-476cf5b0aa43

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值