Over 100 Data Science Interview Questions 北美数据科学面试题和参考答案

最新推荐文章于 2024-05-11 09:34:57 发布

jackly231

最新推荐文章于 2024-05-11 09:34:57 发布

阅读量7.3k

点赞数 4

文章标签：北美数据科学笔试题数据科学面试题数据分析笔试面试

本文链接：https://blog.csdn.net/liweijie231/article/details/81623656

版权

机器学习专栏收录该内容

5 篇文章 1 订阅

订阅专栏

Over 100 Data Science Interview Questions

General Questions

Apple

Suppose you’re given millions of users that each have hundreds of transactions and these millions of transactions are for tens of thousands of products. How would you group the users together in meaningful segments?

如果你有几百万用户，每个用户都会发生数百笔交易，这些交易存在于数十种产品中。你该如何把这些用户细分成有意义的几类？

举个例子：用户购买股票交易的这个过程

数据分析角度：多维分析
数据挖掘角度：特征
数据统计加的：假设检验，
用户分群
- 普通用户
- 普通粉丝
- 忠实用户
- 核心用户

Microsoft

Describe a project you’ve worked on and how it made a difference

描述一个你曾经参与的项目，以及它的优点。

论文摘要
算法模型
结论

How would you approach a categorical feature with high-cardinality?

如何处理具有高基数（high-cardinality）的类属特征？

One-Hot encoding

Dummy encoding

前根据业务理解，分类，变成 ordinal data 然后encoding

这个文章写得很好 beyond one-hot :an exploration of categorical variables
Simple Methods to deal with Categorical Variables in Predictive Modeling

Assume, we have 500 levels in categorical variables. Then, should we
create 500 dummy variables? If you can automate it, very well. Or else,
I’d suggest you to first, reduce the levels by using combining methods
and then use dummy coding. This would save your time.This method is also
known as “One Hot Encoding“.
Multiple Regression with Categorical Variables
Level of measurement

IncrementalProgress	Measure Property	MathematicalOperators	AdvancedOperations	CentralTendency
Nominal	Classification, Membership	=, !=	Grouping	Mode
Ordinal	Comparison, Level	>, <	Sorting	Median
Interval	Difference, Affinity	+, -	Yardstick	Mean,Deviation
Ratio	Magnitude, Amount	*, /	Ratio	Geometric Mean,Coeff. of Variat

Twitter

What would you do to summarize a Twitter feed?

如果想要给 Twitter feed 写 summarize，你要怎么办？
What are the steps for wrangling and cleaning data before applying machine learning algorithms?

在应用机器学习算法之前纠正和清理数据的步骤是什么？

数据预处理：缺失值，脏数据，异常点检查和处理

数据归一化：最大-最小归一化，Z-分数, 对数log，分段归一化，排序归一

特征选择： Filter（基于相关统计量）， Wraper（特征子集搜索）， Embedding（lasso,ridge），降维
How do you measure distance between data points?

如何测量数据点之间的距离？

标称数据：Jaccard

序数数据：可以变换成数值数据或者标称数据

数值数据：p-范数，余弦相似性
Define variance.

请定义一下方差。
Describe the differences between and use cases for box plots and histograms.

请描述箱形图（box plot）和直方图（histogram）之间的差异，以及它们的用例。

直方图：对原始数据分布进行可视化

箱形图：对原始数据分布的特征精选可视化。

Twitter

What features would you use to build a recommendation algorithm for users?

你会使用什么功能来为用户构建推荐算法？
- 用户分群
  - 普通用户
  - 普通粉丝
  - 忠实用户
  - 核心用户
- Collaborative filtering
- Content-based filtering
- Hybrid recommender systems

Uber

Pick any product or app that you really like and describe how you would improve it.

选择任何一个你真正喜欢的产品或应用程序，并描述如何改善它。
How would you find an anomaly in a distribution ?

如何在分布中发现异常？

参数法：高斯模型

非参数法：直方图，箱形图，散点图

聚类：稀疏的簇是异常的可能性比较大

分类：One-Class SVM ， KNN
How would you go about investigating if a certain trend in a distribution is due to an anomaly?

如何检查分布中的某个趋势是否是由于异常产生的？

问题描述的不清楚

“某个趋势是有异常产生的”

对比下 Spearman 和 pearman 计算相关系数
How would you estimate the impact Uber has on traffic and driving conditions?

如何估算 Uber 对交通和驾驶环境造成的影响？

相关分析
What metrics would you consider using to track if Uber’s paid advertising strategy to acquire new customers actually works? How would you then approach figuring out an ideal customer acquisition cost?

你会考虑用什么指标来跟踪 Uber 付费广告策略在吸引新用户上是否有效？然后，你想用什么办法估算出理想的客户购置成本？

(1)

付费前后的海盗指标对比

详细点说下底层的数据模型怎么实现（E-R模型，维度模型）

(2)

举个实际例子说明我这个行业是这样计算获客成本的：

指标：

平均每个新增活跃用户净利润/年

平均新增有效“有效户”净利润/年

算法：

用户与利润增长模型

Big Data EngineerCan you explain what REST is?

（大数据工程师）请解释 REST 是什么。

Machine Learning Questions

Google

Why do you use feature selection?

为什么要使用特征选择（feature selection）？

业务角度：可解析性，输入对输出的影响程度

算法角度：维度灾难，降维，降低学习任务的难度

特征选择方法：
What is the effect on the coefficients of logistic regression if two predictors are highly correlated? What are the confidence intervals of the coefficients?

如果两个预测变量高度相关，它们对逻辑回归系数的影响是什么？系数的置信区间是什么？
- 系数影响：增大回归系数的方差
- 业务上：可解释性变差
- 系数的置信区间：变小
  
  什么是方差膨胀因子 (VIF)？
  
  What is the effect of having correlated predictors in a multiple regression model?
What’s the difference between Gaussian Mixture Model and K-Means?

高斯混合模型（Gaussian Mixture Model）和 K-Means 之间有什么区别？

K-Means：非参数方法，非概率模型，相似度角度衡量

GMM：参数方法，概率模型，概率角度衡量

K-means is a special case of Mixture ofGaussian, and Mixture of Gaussian is a special case ofExpectation-Maximization.

The biggest difference between K-meanand GMM in practice is:

K-Mean only detect spherical cluster.

GMM can adjust its self to ellipticshape cluster.

https://metacademy.org/graphs/concepts/gaussian_mixtures_vs_k_means#focus=k_means&mode=learn

https://www.quora.com/What-is-the-difference-between-K-means-and-the-mixture-model-of-Gaussian
How do you pick k for K-Means?

在 K-Means 中如何拾取 k？
- 根据业务理解
- 代价函数RMSE 与K的函数图，曲线拐点。
- 其他方法：Unsupervised-model-11.pdf
How do you know when Gaussian Mixture Model is applicable?

你如何知道高斯混合模型是不是适用的？

Quora:GMM vs K-means

概率模型：指对样本的概率密度分布进行估计，

分类问题：输出不是确定的分类标记，而是得到每个类的概率。

对样本中的数据分别在几个高斯模型上投影，就会分别得到在各个类上的概率。然后我们可以选取概率最大的类所为判决结果

理论上通过增加Model的个数，可以用GMM近似任何概率分布

GMM 由 K 个 Gaussian 分布组成，每个 Gaussian 称为一个“Component”，这些 Component 线性加成在一起就组成了 GMM 的概率密度函数：

$p (x) = \sum k = 1 K p (k) p (x | k) = \sum k = 1 K π k N (μ k, Σ k) (1) (2)$ $\begin{alignat}{2} p(x) & = \sum_{k=1}^Kp(k)p(x|k) \\ & = \sum_{k=1}^K\pi_k \mathcal{N}(\mu_k,\Sigma_k) \\ \end{alignat}$
Assuming a clustering model’s labels are known, how do you evaluate the performance of the model?

假设聚类模型的标签是已知的，你如

何评估模型的性能？

分类问题，混淆矩阵，准确率，召回率等

Unsupervised-model-11.pdf

Microsoft

What’s an example of a machine learning project you’re proud of?

你有哪些引以为豪的机器学习项目？
Choose any machine learning algorithm and describe it.

随意选择一个机器学习算法，并描述它。
Describe how Gradient Boosting works.

请解释 Gradient Boosting 是如何工作的。
Data MiningDescribe the decision tree model.

（数据挖掘工程师）请解释决策树模型。
Data MiningWhat is a neural network?

（数据挖掘工程师）什么是神经网络？
Explain the Bias-Variance Tradeoff

请解释偏差方差权衡（Bias-Variance Tradeoff）。
How do you deal with unbalanced binary classification?

如何处理不平衡二进制分类？
What’s the difference between L1 and L2 regularization?

L1 和 L2 正则化之间有什么区别？
- 贝叶斯角度：
  
  L1 代价函数相当于拉普拉斯先验，根据最大后验估计进行回归系数求解
  
  L2是高斯分布先验进行，最大后验的求解
- 对回归系数影响
  L2 penalizes one big weight more than many small weights.
  L1 doesn’t.
  So with L2, you tend to end up with many small weights, while with L1, you tend to end up with larger weights, but more zeros.
- [Differences between L1 and L2 as Loss Function and Regularization](Differences between L1 and L2 as Loss Function and Regularization)
- [https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization](What is the difference between L1 and L2 regularization?)
  
  　## Uber
What sort features could you give an Uber driver to predict if they will accept a ride request or not? What supervised learning algorithm would you use to solve the problem and how would compare the results of the algorithm?

你会通过哪种特征来预测 Uber 司机是否会接受订单请求？你会使用哪种监督学习算法来解决这个问题，如何比较算法的结果？

Name and describe three different kernel functions and in what situation you would use each.

点出及描述三种不同的内核函数，在哪些情况下使用哪种？
Describe a method used in machine learning.

随意解释机器学习里的一种方法。
How do you deal with sparse data?

如何应付稀疏数据？

Ridge can handle both sparse and nonsparse data.

Fast Learning from Sparse Data

IBM

How do you prevent overfitting?

如何防止过拟合（overfitting）？
- Penalty methods
- Holdout and Cross-validation methods
- Ensembles
How do you deal with outliers in your data?

如何处理数据中的离群值？
- 对比下drop 和不drop 对模型结果产生影响
  Outliers: To Drop or Not to Drop
- robust statistic
- How to Deal with Outliers in Your Data
How do you analyze the performance of the predictions generated by regression models versus classification models?
- classification models: 混淆矩阵
- regression models : RMSE ,MAP
How do you assess logistic regression versus simple linear regression models?

如何确定逻辑回归与简单线性回归模型？
- linear
  - $p = a 0 + a 1 * X 1 + a 2 * X 2 + \dots + a k * X k$ $p = a_0 + a_1*X_1 + a_2*X_2 + … + a_k*X_k$
  - 回归问题：输入连续，输出也是连续
  - 估计算法：ordinary least square
- logistic
  - $ln[p*/(1-p)] = b_0 + b_1*X_1 + b_2*X_2 + … + b_k*X_k$
  - 分类问题：输入连续，输出是离散的
  - 估计算法：maximum likelihood （假设p:(y|x) 服从 Bernoulli分布）
- Linear Regression vs Logistic Regression vs Poisson Regression
- Linear Regression vs Logistic Regression vs Poisson Regression
What’s the difference between supervised learning and unsupervised learning?
- 监督学习：样本带有标记
  - 分类
  - 回归
- 非监督学习
  - 聚类
  - 关联
- Supervised and Unsupervised Machine Learning Algorithms
- [What is the difference between supervised learning and unsupervised learning?]
What is cross-validation and why would you use it?

什么是交叉验证（cross-validation），为什么要使用它？
- Holdout Method 缺点：
  - 样本不足
  - 样本不够随机，训练处的模型不够robust
- Cross-validation:
  - Random Subsampling
  - K-Fold Cross-Validation
  - Leave-one-out Cross-Validation
- 作用：
  - 模型和参数选择
  - 性能评估
What’s the name of the matrix used to evaluate predictive models?

用于评估预测模型的矩阵的称为什么？

混淆矩阵
What relationships exist between a logistic regression’s coefficient and the Odds Ratio?

逻辑回归系数和胜算比（Odds Ratio）之间存在怎样的关联？

$ln[p*/(1-p)] = b_0 + b_1*X_1 + b_2*X_2 + … + b_k*X_k$
What’s the relationship between Principal Component Analysis (PCA) and Linear & Quadratic Discriminant Analysis (LDA & QDA)

主成分分析（PCA）与线性判别分析（LDA）、二次判别分析（QDA）之间存在怎样的关联？
- 都是降维的方法
- PCA
  - 非监督
  - 投影方向，使得数据尽可能的分散开，数据的方差最大
- LDA，QDA （生成模型）
  - 监督
  - 投影方向使得数据尽可能分类开来
- Linear Discriminant Analysis
- Linear and Quadratic Discriminant Analysis
  - 生成方法（generative approach ）：一般由数据的联合分布P(X,Y)，然后求出条件概率分布P(Y|X)作为预测模型，即生成模型：（可以认为是贝叶斯模型吗？）

P (Y | X) = P ( X , Y ) P ( X )

$P(Y|X) = \frac{P(X,Y)}{P(X)}$

Generative models model the distribution ( $P(X|Y)$ ) of individual classes
- 模型表示给定了输入X产生输出Y的生成关系。
- 模型包括：朴素贝叶斯，隐马尔可夫模型
- 特点：学习速度快，样本增多时可以更快收敛与真实模型，当存在隐含变量时，可以用生成模型。
判别方法（discriminative approach）:
- Discriminative models learn the (hard or soft) boundary between classes
- 由数据之间学习决策函数f(X)或者条件概率分布P(Y|X)作为预测的模型，即判别模型，
- 模型包括：KNN，感知机，决策树，logistic regression，最大熵模型，支持向量机，提升，条件随机场
- 特地：可以简化学习问题
  1. If you had a categorical dependent variable and a mixture of categorical and continuous independent variables, what algorithms, methods, or tools would you use for analysis?
如果你有一个因变量分类，又有一个连续自变量的混合分类，你将使用什么算法，方法或工具进行分析？
dummy coding
- Multiple Regression with Categorical Variables
  1. Business AnalyticsWhat’s the difference between logistic and linear regression? How do you avoid local minima?
  （行业分析师）逻辑与线性回归有什么区别？如何避免局部极小值？
  - 区别对比见IBM第4题
  - 解决local minima问题：用 cross entropy loss作为cost function ,它是covex function
  - Logistic Regression

Salesforce

What data and models would would you use to measure attrition/churn? How would you measure the performance of your models?

你会使用哪些数据和模型来测量用户流失？如何测试模型性能？
- 流失率的计算：
  
  $一段时间内流失的用户数这段时间开始时的用户数$ $\frac{一段时间内流失的用户数}{这段时间开始时的用户数}$
- 修正的流失率
- $一段时间内流失的用户数这段时间开始时的用户数 + 这段时间新增的净流入用户$ $\frac{一段时间内流失的用户数}{这段时间开始时的用户数+这段时间新增的净流入用户}$
- 日常运营：新增用户用户留存、用户流失曲线
- 流失预测模型：
  - 分类问题
  - logistic regression模型
  - 特征选择：
    - 用户行为特征：活跃情况，功能使用，物品的购买行为
    - 用户属性特征：性别，年龄，职业，学历之类的
- 性能评估：混淆矩阵
  - 准确率
  - 召回率
- Defining Churn Rate
- Why Modeling Churn is Difficult
Explain a machine learning algorithm as if you’re talking to a non-technical person.

请尝试向非技术人员解释一种机器学习算法。

Capital One

How would you build a model to predict credit card fraud?

如何构建一个模型来预测信用卡诈骗？
- 分类问题
- Logistic Regression
- 特征选择
How do you handle missing or bad data?

如何处理丢失或不良数据？
- 忽略该记录
- 全局变量替换确实值
- 均值，中位数，最可能的值代替缺失值，
- Robust 算法
How would you derive new features from features that already exist?

如何从已存在的特征中导出新的特征？
If you’re attempting to predict a customer’s gender, and you only have 100 data points, what problems could arise?

如果你试图预测客户的性别，但只有 100 个数据点，可能会出现什么问题？
- 胖矩阵
- 过拟合
- 正则化：ridge 回归
Suppose you were given two years of transaction history. What features would you use to predict credit risk?

在拥有两年交易历史的情况下，哪些特征可以用来预测信用风险？
Design an AI program for Tic-tac-toe

设计一个用来下井字棋的人工智能程序。

Zillow

Explain overfitting and what steps you can take to prevent it.

请解释过度拟合，以及如何防止过度拟合。
- 解释：
- 方法：正则化
Why does SVM need to maximize the margin between support vectors?

为什么 SVM 需要在支持向量之间最大化边缘？

SVM - Understanding the math - the optimal hyperplane

Hadoop

Twitter

How would you use Map/Reduce to split a very large graph into smaller pieces and parallelize the computation of edges according to the fast/dynamic change of data?

如何使用 Map/Reduce 将非常大的图形分割成更小的块，并根据数据的快速/动态变化并行计算它们的边缘？
Data EngineerGiven a list of followers in the format:123, 345234, 678345, 123…Where column one is the ID of the follower and column two is the ID of the followee. Find all mutual following pairs (the pair 123, 345 in the example above). How would you use Map/Reduce to solve the problem when the list does not fit in memory?

（数据工程师）给定一个列表：123, 345234, 678345, 123…其中第一列是粉丝的 ID，第二列是被粉者的 ID。查找所有相互后续对（上面的示例中的对是 123，345）。当列表超出内存时，如何使用 Map / Reduce 来解决问题？

Capital One

Data EngineerWhat is Hadoop serialization?

（对数据工程师）什么是 Hadoop 序列化（serialization）？

把内存中数据转化成一串字节存储
Explain a simple Map/Reduce problem.

阐述一个简单的 Map / Reduce 问题。
- MapReduce例子.png

Hive

Data EngineerWrite a Hive UDF that returns a sentiment score. For example, if good = 1, bad = -1, and average = 0, then a review of a restaurant states “Good food, bad service,” your score might be 1 – 1 = 0.

（数据工程师）请编写返回情感分数的 Hive UDF。例如，假如好=1，坏=-1，平均数=0，那么对餐厅做评价时因为「食物好，服务差」，你的分数可能为 1 - 1 = 0

Spark

Capital One

Data Engineer Explain how RDDs work with Scala in Spark

（数据工程师）阐释使用 Scala 语言时RDD 在 Spark 中是如何工作的？

Scala数据结构和 RDD 数据模型之间的关系
- scala 数据结构
- RDD模型
- scala 数据结构到 RDD转换
- RDD是Spark的核心，也是整个Spark的架构基础。它的特性可以总结如下：
- 它是不变的数据结构存储
- 它是支持跨集群的分布式数据结构
- 可以根据数据记录的key对结构进行分区
- 提供了粗粒度的操作，且这些操作都支持分区
- 它将数据存储在内存中，从而提供了低延迟性

Statistics & Probability Questions

Google

Explain Cross-validation as if you’re talking to a non-technical person.

请尝试向非技术人员阐释交叉验证（Cross-validation）。
Describe a non-normal probability distribution and how to apply it.

请描述一下非正态概率分布以及该如何应用？
Non-Normal Distributions in the Real World

Normal Probability Distributions
Normal vs. Non-Normal Distributed Data�Comparing Results

Microsoft

Data MiningExplain what heteroskedasticity is and how to solve it

（数据挖掘）请解释异方差（heteroskedasticity）是什么，以及如何解决它。

Twitter

Given Twitter user data, how would you measure engagement?

在给定 Twitter 用户数据的情况下，你该如何衡量参与度？

Uber

What are some different Time Series forecasting techniques?

时间序列预测技术有什么不同？
Explain Principle Component Analysis (PCA) and equations PCA uses.

解释（PCA）及其使用的方程。
How do you solve Multicollinearity?

如何解决多重共线性（Multicollinearity）？
- Ridge regression
- principal component regression
- partial least squares regression
AnalystWrite an equation that would optimize the ad spend between Twitter and Facebook.

（分析师）请尝试列出优化我们在推特和脸书上的广告费用支出的方程。

Facebook

What’s the probability you’ll draw two cards of the same suite from a single deck?

在一副牌中抽取两张，出现同一花色的概率是多少？

IBM

What are p-values and confidence intervals?

什么是 p-value 和置信区间？

Capital One

Data AnalystIf you have 70 red marbles, and the ratio of green to red marbles is 2 to 7, how many green marbles are there?

（数据分析师）如果你有 70 个红色弹珠，绿色和红色弹珠的比例是 2 ：7，有多少绿色弹珠？
What would the distribution of daily commutes in New York City look like?

纽约市的通勤数据看起来应该遵从什么分布？
Given a die, would it be more likely to get a single 6 in six rolls, at least two 6s in twelve rolls, or at least one-hundred 6s in six-hundred rolls?

一个骰子，在扔 6 次的情况下出现 1 个 6 的几率，与扔 12 次的情况下出现至少两个 6 的几率，和扔 600 次出现至少 100 次 6 的几率相比哪个大？

PayPal

What’s the Central Limit Theorem, and how do you prove it? What are its applications?

什么是中心极限定理（Central Limit Theorem），如何证明它？它的应用方向是什么？
- Central limit theorem

Programming & Algorithms

Google

(Data Analyst)Write a program that can determine the height of an arbitrary binary tree

（数据分析师）请写一个程序可以判定二叉树的高度。

Write a Program to Find the Maximum Depth or Height of a Tree

     # -*- coding: utf-8 -*-  

     def main():
        # Driver program to test above function
        root = Node(1)
        root.left = Node(2)
        root.right = Node(3)
        root.left.left = Node(4)
        root.left.right = Node(5)
        length = maxDepth(root)
        print("Height of tree is %d" % length) 

     class Node:
         def __init__(self, data):
             self.data = data
             self.left = None
             self.right = None

     def maxDepth(node):
         if node is None:
             return 0 ; 
         else :
             lDepth = maxDepth(node.left)
             rDepth = maxDepth(node.right)
             if (lDepth > rDepth):
                 return lDepth+1
             else:
                 return rDepth+1

     if __name__ == '__main__':
        main()

Microsoft

Create a function that checks if a word is a palindrome.

请创建一个函数检查一个词是否具有回文结构。

     def main():
        result = is_palindrome("abcba")
        print(result)

     def is_palindrome(w):
         return w == w[::-1]

     if __name__ == '__main__':
        main()

Twitter

Build a power set.

请构建一个幂集（power set）。

Python Power function

     def pow(x, y):
         result = 1
         for _ in range(y):   
             result *= x
         return result

How do you find the median of a very large dataset?

请问如何在一个巨大的数据集中找到中值？

Top K 排序问题: O(n + k log k)

Uber

Data EngineerCode a function that calculates the square root (2-point precision) of a given number. Follow up: Avoid redundant calculations by now optimizing your function with a caching mechanism.

（数据工程师）编写一个函数用来计算给定数字的平方根（精确到百分位）。随后：避免冗余计算，现在使用缓存机制优化你的功能。

Facebook

Suppose you’re given two binary strings, write a function adds them together without using any builtin string-to-int conversion or parsing tools. For example, if you give your function binary strings 100 and 111, it should return 1011. What’s the space and time complexity of your solution?

假设给定两个二进制字符串，写一个函数将它们添加在一起，而不使用任何内置的字符串到 int 转换或解析工具。例如：如果给函数二进制字符串 100 和 111，它应该返回 1011。你的解决方案的空间和时间复杂性如何？
Write a function that accepts two already sorted lists and returns their union in a sorted list.

编写一个函数，它接受两个已排序的列表，并在排序列表中返回它们的并集。

Data EngineerWrite some code that will determine if brackets in a string are balanced

（数据工程师）请编写一些代码来确定字符串中的左右括号是否是平衡的？
How do you find the second largest element in a Binary Search Tree?

如何找到二叉搜索树中第二大的元素？
Write a function that takes two sorted vectors and returns a single sorted vector.

请编写一个函数，它接受两个排序的向量，并返回一个排序的向量。
If you have an incoming stream of numbers, how would you find the most frequent numbers on-the-fly?

如果你有一个输入的数字流，如何在运行过程中找到最频繁出现的数字？
Write a function that raises one number to another number, i.e. the pow() function.

编写一个函数，将一个数字增加到另一个数字，就像 pow（）函数一样。
Split a large string into valid words and store them in a dictionary. If the string cannot be split, return false. What’s your solution’s complexity?

将大字符串拆分成有效字段并将它们存储在 dictionary 中。如果字符串不能拆分，返回 false。你的解决方案的复杂性如何？

Salesforce

What’s the computational complexity of finding a document’s most frequently used words?

查找文档最常用的词的计算复杂性是什么？

O(n)
If you’re given 10 TBs of unstructured customer data, how would you go about finding extracting valuable information from it?

如果给你10 TBs的非结构化客户数据,你会如何发现提取有价值的信息呢?

Capital One

Data Engineer How would you ‘disjoin’ two arrays (like JOIN for SQL, but the opposite)?

（对数据工程师）如何「拆散」两个数列（就像 SQL 中的 JOIN 反过来）？
Create a function that does addition where the numbers are represented as two linked lists.

请创建一个用于添加的函数，数字表示为两个链表。
Create a function that calculates matrix sums.

请创建一个计算矩阵的函数。
How would you use Python to read a very large tab-delimited file of numbers to count the frequency of each number?

如何使用 Python 读取一个非常大的制表符分隔的数字文件，来计算每个数字出现的频率？

PayPal

Write a function that takes a sentence and prints out the same sentence with each word backwards in O(n) time.

请编写一个函数，让它能在 O（n）的时间内取一个句子并逆向打印出来。
Write a function that takes an array, splits the array into every possible set of two arrays, and prints out the max differences between the two array’s minima in O(n) time.

请编写一个函数，从一个数组中拾取，将它们分成两个可能的数组，然后打印两个数组之间的最大差值（在 O(n) 时间内）。
Write a program that does merge sort.

请编写一个执行合并排序的程序。

SQL Questions

Microsoft

Data Analyst Define and explain the differences between clustered and non-clustered indexes.

（数据分析师）定义和解释聚簇索引和非聚簇索引之间的差异。
- 描述的聚集索引和非聚集索引
- 聚集索引根据数据行的键值在表或视图中排序和存储这些数据行。索引定义中包含聚集索引列。每个表只能有一个聚集索引，因为数据行本身只能按一个顺序排序。
- 非聚集索引具有独立于数据行的结构。非聚集索引包含非聚集索引键值，并且每个键值项都有指向包含该键值的数据行的指针。
- 每天分区，分区内主键用非集聚索引
Data Analyst What are the different ways to return the rowcount of a table?

（数据分析师）返回表的行计数有哪些不同的方法？

count(*)
SQL Server 对着表右击-属性存储-行计数
SQL Server–HOW-TO: quickly retrieve accurate row count for table

Facebook

Data Engineer If you’re given a raw data table, how would perform ETL (Extract, Transform, Load) with SQL to obtain the data in a desired format?

（数据工程师）如果给定一个原始数据表，如何使用 SQL 执行 ETL（提取，转换，加载）以获取所需格式的数据？
- 理解业务过程中
- 理解数据源，数据颗粒度
- 确定分析维度
- 指标开发
- 可能需要举个栗子
How would you write a SQL query to compute a frequency table of a certain attribute involving two joins? What changes would you need to make if you want to ORDER BY or GROUP BY some attribute? What would you do to account for NULLS?

如何编写 SQL 查询来计算涉及两个连接的某个属性的频率表？如果你想要 ORDER BY 或 GROUP BY 一些属性，你需要做什么变化？你该怎么解释 NULL？
- 这个题目描述的不清楚，估计是写一些连表查询的操作，注意a join b 中b的连接字段要是主键就行了吧
- facebook的题目怎么那么抽象~

Data EngineerHow would you improve ETL (Extract, Transform, Load) throughput?

（数据工程师）如何改进 ETL（提取，转换，加载）的吞吐量？

结合业务特点可以从以下几个方面考虑：
- 数据监控：过程元监控，记录下每段代码的执行时间，优化
- 数据抽取：数据探查，变化数据捕捉，根据业务进行增量采样
- 数据准备区（Data staging）：采用 sequential or flat files
- 数据压缩传输
- 分区与并行化：分区表，并行化（如 hadoop生态sqoop）
- 7 Tips to Improve ETL Performance

Brain Teasers & Word Problems

Google

Suppose you have ten bags of marbles with ten marbles in each bag. If one bag weighs differently than the other bags, and you could only perform a single weighing, how would you figure out which one is different?

假设你有 10 包弹球，每包里面都是 10 个弹球。如果其中一包的重量和其他的不同，但你只能进行一次称重，你该用什么办法？

需要称几次 log(20)/log(3)

怎么称

Facebook

You are about to hop on a plane to Seattle and want to know if you should carry an umbrella. You call three friends of yours that live in Seattle and ask each, independently, if it’s raining.Each of your friends will tell you the truth ? of the time and mess with you by lying ? of the time. If all three friends answer “Yes, it’s raining,” what is the probability that is it actually raining in Seattle?

你打算坐飞机去西雅图，想知道是不是需要带伞，于是你分别打电话给三位在西雅图的朋友。每个朋友都有 2/3 的几率说真话，1/3 的几率在骗你。如果他们都说「会下雨」，西雅图下雨的概率是多少？

假设：P( rain=1 ) = P( rain=0 ) = 1/2

P( X=1 |rain =1 ) = 2/3

P( X=0 |rain =1 ) = 1/3

P( rain =1| X=1) = P(rain=1) * P( X=1 ,Y=1,Z=1 | rain=1) / ( P(rain=1) * P( X=1 ,Y=1,Z=1| rain=1) + P(rain=0) * P( X=1 ,Y=1,Z=1 | rain=0) )

= 2/3 * 2/3 * 2/3 / (2/3 * 2/3 2/3 + 1/3 *1/3 1/3)

Uber

Imagine you are working with a hospital. Patients arrive at the hospital in a Poisson Distribution, and the doctors attend to the patients in a Uniform Distribution. Write a function or code block that outputs the patient’s average wait time and total number of patients that are attended to by doctors on a random day.
想象一下你在一家医院工作。患者来就诊的频率符合泊松分布，而医生照顾患者的频率符合均匀分布。请写一个函数或一段代码来输出患者的平均等待时间和医生在某日的参与度。
- 到达时间服从泊松分布，服务时间服从均匀分布，调度算法？
- 求平均等待时间表达式，某时刻正在服务的人数表达式
- 如果是数学问题：M/M/1 queue M/M/1
- 如果是模拟概率事件，1，代码实现 Poisson Distribution 2，代码实现Uniform Distribution， -，设置调度策略计算相关指标
- Poisson 分布的意义：罕见时间的发生数的分布规律
- What is a Poisson distribution?
- 排队论基础：
- Arrivals : the arrival rate 到达速率（人数） , successive arrivals (the inter-arrival time )（时间间隔）

Facebook

Imagine there are three ants in each corner of an equilateral triangle, and each ant randomly picks a direction and starts traversing the edge of the triangle. What’s the probability that none of the ants collide? What about if there are N ants sitting in N corners of an equilateral polygon?

假如在一个等边三角形的三个角上都有一只蚂蚁，每只随机选择方向然后直走一直到另一个边缘，三只蚂蚁互相不交汇的几率是多少？如果有 n 只蚂蚁在 n 角形中，概率又是多少？

1/2 *1/2 *1/2 *2 = 1/4 1/2^(n-1)

(1/n)^n * (n-1)

Three Ants on The Corners of a Triangle
How many trailing zeros are in 100 factorial (i.e. 100!)?

在 100! 的结果里有多少个零？

$2 ,2^2 , 2^3 ,2^4 , 2^5, 2^6$

So number of zeroes is 24.The basic formula isFor n! No of zeroes is given byn/5 + n/5² + n/5³ + and so on.

[100/2]+[100/4]+[100/8]+[100/16]+[100/32]+[100/64]=50+25+12+6+3+1=97

Imagine you’re climbing a staircase that contains n stairs, and you can take any number k steps. How many distinct ways can you reach the top of the staircase? (This is a modification of the original stair step problem)

你正在攀爬一个 n 阶的楼梯，你可以采取任何数量的 k 个步骤。你到达楼梯顶部有多少不同的方式？（这是楼梯问题的修改版）

$F(n) = \sum{F(n-ik)}$

$n-ik > 0$

jackly231

关注

4
点赞
踩
29

收藏

觉得还不错? 一键收藏
0
评论
Over 100 Data Science Interview Questions 北美数据科学面试题和参考答案

Over 100 Data Science Interview QuestionsGeneral QuestionsAppleSuppose you’re given millions of users that each have hundreds of transactions and these millions of transactions are for tens o...
复制链接

扫一扫