技术学校面试该说什么_我第一次现场技术面试后，香港专业教育学院学到了什么...

最新推荐文章于 2024-10-09 18:12:32 发布

李_涛

最新推荐文章于 2024-10-09 18:12:32 发布

阅读量431

点赞数

文章标签：面试 java python

原文链接：https://towardsdatascience.com/pandas-what-ive-learned-after-my-1st-on-site-technical-interview-4fb94dbc1b45

版权

技术学校面试该说什么

技术面试 (Technical Interviews)

前言 (Preface)

I know a lot of people get nervous in interviews. Similarly, 8 out of 10 times in my interviews, I get ants in my pants. This meant forgetting the usual tricks I have in my pocket or missing some of the key knowledge I should have known. Recently, I did an on-site technical interview for a Business Intelligence internship role, and guess what? It was all: Pandas! Though the interview did not go as smoothly as I expected, I wanted to turn this experience into a reflection on what I think is crucial in Pandas, typically text data manipulation.

我知道很多人在面试中会变得紧张。同样，在我的访谈中，十分之八的时间，我的裤子里都有蚂蚁。这意味着忘记我口袋里的惯用技巧，或者错过一些本该知道的关键知识。最近，我对一个商业智能实习职位进行了现场技术面试，您猜怎么着？就是这样：熊猫！尽管面试并不像我预期的那样顺利，但我想将这次经历转化为对我认为对Pandas至关重要的事情的反思，通常是文本数据操作。

总览 (Overview)

This article will be on “Working with Text Data in Pandas,” mainly functions in the “pandas.Series.str”. Most of the problem settings are similar to what I experience during the interview. As you go over the article, you will get the opportunity to review what you have learned and receive a friendly reminder if you are looking forward to a technical interview. If you are new to Pandas, I hope you can add the tricks illustrated in this article to your data science toolkit. Let’s get started and clean our way towards more structured data.

本文将在“使用文本数据工作中熊猫 ”，主要功能在“pandas.Series.str”。大多数问题设置与我在面试中遇到的情况类似。在阅读本文时，如果您希望进行技术面试，您将有机会回顾所学内容并得到友好的提醒。如果您不熟悉Pandas，希望您可以将本文中介绍的技巧添加到数据科学工具包中。让我们开始吧，为更结构化的数据而清理思路。

Image for post — Pandas in the Taipei Zoo, Source: Me

仅供参考： (FYI:)

The article will be sorted into small problem sets and possible solutions following those problems (or what I used to tackle each problem).
本文将分类为小问题集，并提供针对这些问题(或我用来解决每个问题的解决方案)的可能解决方案。
This data set is a made-up sales data of “The Fruit” company to illustrate the possible scenarios you will be challenged in interviews, jobs, or data projects.
该数据集是“水果”公司的虚构销售数据，用于说明您在面试，工作或数据项目中可能面临的挑战。

The data is preloaded and saved in df, and don’t forget to:
数据已预加载并保存在df中 ，请不要忘记执行以下操作：

import pandas as pd

Without further ado, let’s get started!

事不宜迟，让我们开始吧！

To start off, the interviewer handed you a dataset of The Fruit’s sales division. You were asked to tackle problems in each column with your technical ability in Python. What did you find?

首先，面试官向您提供了The Fruit销售部门的数据集。您被要求使用Python的技术能力来解决每列中的问题。你发现了什么？

Sales_ID：奇怪的感叹号和冗余空间 (Sales_ID: Weird Exclamation Mark and Redundant Spaces)

If we take a look at the first row in the Sales_ID column, you will find that it contains unnecessary spaces and an exclamation mark for the ID itself. That is a problem since we don’t need those redundant characters in the Sales_ID column.

如果我们查看Sales_ID列中的第一行，您会发现它包含不必要的空格和ID本身的感叹号。这是一个问题，因为我们在Sales_ID列中不需要那些多余的字符。

df[ 'Sales_ID'][0]Out: 'AU_1382578        !'

What we can do is utilize the pandas.Series.str.strip method in Pandas! strip() trims off the leading and trailing characters of the column. There are also small variants like rstrip() for right strip and lstrip() for left strip.

我们可以做的是利用Pandas中的pandas.Series.str.strip方法！ strip()修剪列的开头和结尾字符。也有一些小的变体，例如rstrip()用于右侧带， lstrip()用于左侧带。

It is very useful to use strip() when you have unessential characters and symbols on two sides of the data. Here we first right strip the “!” and then strip spaces on both sides.

当数据的两侧都有不必要的字符和符号时，使用strip()很有用。在这里，我们首先右移“！” 然后在两侧都去除空格。

df['Sales_ID'] = df['Sales_ID'].str.rstrip('!').str.strip()

Congrats on cleaning the Sales_ID column!

祝贺您清理Sales_ID列！

Sales_Branch：缺少？一点特征工程 (Sales_Branch: Missing? A little Feature Engineering)

Seems like the Sales_Branch column is totally missing! The interviewer asked you to fix it with hints from the Sales_ID column. After a quick check, you find that sales are assigned to branches of The Fruit based on the first 2 words of the Sales_ID.

似乎完全缺少Sales_Branch列！面试官要求您使用Sales_ID列中的提示来修复它。快速检查后，您发现根据Sales_ID的前两个单词将销售分配给The Fruit的分支。

We can use the pandas.Series.str.slice method to slice the key elements of the string. For instance: ‘AU_1382578’ in Sales_ID → ‘AU’ in Sales_Branch.

我们可以使用pandas.Series.str.slice 分割字符串的关键元素的方法。例如： Sales_ID中的“ AU_1382578 ” → Sales_Branch 中的“ AU” 。

slice() contains 3 parameters:

slice()包含3个参数：

start: where to start with a default of 0
start：从哪里开始，默认为0
stop: where to stop (exclusive)
停止：在哪里停止(独家)
step: how far a step with a default of 1
step：默认值为1的步长

df['Sales_Branch'] = df['Sales_ID'].str.slice(stop=2)

Congrats on feature engineering the Sales_Branch column!

恭喜您对Sales_Branch列进行了特征工程！

产品说明：包含要出售的产品的详细说明 (Product_Description: Long Descriptions containing which Products are Sold)

Probably on most occasions, the company’s system produces a transaction log with long sentences. But the sentences contain valuable information that your interviewer is asking for.

在大多数情况下，该公司的系统可能会生成带有长句子的交易日志。但是这些句子包含面试官要求的有价值的信息。

df['Product_Description'][0]Out: 'The Fruit, a company that sells fruits from all over the world, has branches in Australia, the United Kingdom, the United States, and Taiwan. Product: Apple Mango Banana Watermelon Orange Blueberry Banana Watermelon Kiwifruit'

We can see that while all of the first sentences were identical, we can split this into 2 columns: Description and Product. This is where the pandas.Series.str.split method comes into play.

我们可以看到，虽然所有的第一句话都是相同的，但我们可以将其分为两列：描述和产品。这是pandas.Series.str.split的地方 方法起作用。

split() contains 3 parameters:

split()包含3个参数：

pat: what to split on with a default of white space ‘ ’
pat：使用默认空白''分割的内容
n: the users can specify how many splits they want
n：用户可以指定他们想要多少个分割
expand: when expand=True, the splits are put into individual columns
expand：当expand = True时，拆分将放入单独的列中

df['Product_Description'].str.split(' Product: ', expand=True)

We can assign the 0 and 1 to Description and Product in 2 ways:

我们可以通过2种方式将0和1分配给Description和Product：

df['Description'] = df['Product_Description'].str.split(' Product: ', expand=True)[0]
df['Product'] = df['Product_Description'].str.split(' Product: ', expand=True)[1]

要么

df[['Description', 'Product']] = df['Product_Description'].str.split(': ', expand=True)

Now, the interviewer gets picky and added an additional requirement: sort the list of products in the Product column.

现在，面试官变得挑剔并增加了一个附加要求：在“产品”列中对产品列表进行排序。

No need to feel anxious! This can be done by first splitting values in the Product column with split() and then apply the sorted function in Python.

无需感到焦虑！这可以通过首先使用split()拆分Product列中的值，然后在Python中应用排序后的函数来完成。

df['Product'] = df['Product'].str.split().apply(sorted)

Congrats on getting information out of the Product_Description column!

祝贺您从Product_Description列中获取信息！

Product_Count：每个销售员的水果数 (Product_Count: The Number of Fruits for each Salesperson)

The interviewer wanted to get a sense of the variety of fruits that their salespeople are selling. It’s good to know that you can also use the pandas.Series.str.len method to get the length of lists in the Product column.

面试官想了解他们的销售人员正在出售的各种水果。很高兴知道您也可以使用pandas.Series.str.len 获取“产品”列中列表长度的方法。

df['Product_Count'] = df['Product'].str.len()

Let’s find out who sells the most variety of fruits in The Fruit!

让我们找出谁在《水果》中出售最多样的水果！

df[df['Product_Count'] == max(df['Product_Count'])]

It seems like the Australian representatives who sell 10 different fruits is the winner!

似乎卖出10种不同水果的澳大利亚代表就是赢家！

Congrats on counting the variety of fruits in the Product_Count column!

恭喜您在Product_Count栏中计算水果的种类！

产品：不仅仅是产品列表 (Product: More than just lists of Products)

So after getting Product out of Product_Description, there is another challenge in your way! The interviewer asked if you could split up the Product column in 2 manners:

因此，将产品从Product_Description中删除后，您面临的另一个挑战是！面试官问您是否可以通过两种方式拆分“产品”列：

Split the column into Product_1, Product_2, …
将列拆分为Product_1，Product_2等
Do one-hot encoding on each product
对每个产品进行一次热编码

1.将此列拆分为Product_n (1. Split this column into Product_n)

We can easily cope with this challenge by utilizing pd.Series, which we can turn a list of products into pandas.core.frame.DataFrame.

我们可以利用pd.Series轻松应对这一挑战，我们可以将产品列表转换为pandas.core.frame.DataFrame 。

Products = df['Product'].apply(pd.Series)Products

Then we finish our work by renaming the column names!

然后，我们通过重命名列名来完成我们的工作！

Products = Products.rename(columns=lambda x: 'Product_'+str(x))
Products

2.对每个产品进行一次热编码 (2. Do one-hot encoding on each product)

One-hot encoding in Pandas is called pandas.Series.str.get_dummies, which is used to give each product a column of its own.

Pandas中的一键编码称为pandas.Series.str.get_dummies ，用于为每个产品提供独立的列。

get_dummies() has only 1 parameter:

get_dummies() 只有1个参数：

sep: what to split on with a default of ‘|’
sep：要分割的内容，默认为“ |”

We are using the original form from the split because we can use whitespace to separate them. (Fill free to try using the list form with a loop!)

我们使用拆分中的原始表单，因为我们可以使用空格将它们分开。 (可随意尝试使用带有循环的列表表单！)

df['Product_Description'].str.split(': ', expand=True)[1]

By applying get_dummies(), we get the 10 rows × 27 columns of each fruit from the operation (that is huge!).

通过应用get_dummies() ，我们从操作中获得每个水果的10行×27列(这是巨大的！)。

Products2 = df['Product_Description'].str.split(': ', expand=True)[1].str.get_dummies(' ')
Products2

After acing the 2 requirements, the interviewer finally decided to pick option 1 as the final version. We can merge Products (option 1) with the original data frame df, resulting in a shape of 10 rows × 17 columns.

在满足了2个要求之后，访调员最终决定选择选项1作为最终版本。我们可以将Products(选项1)与原始数据框df合并，从而得到10行×17列的形状。

df = pd.concat([df, Products], axis=1)

仅供参考： (FYI:)

The scikit-learn library also provides a “MultiLabelBinarizer” with the same function, but that is another story.

scikit-learn库还提供了具有相同功能的“ MultiLabelBinarizer ”，但这是另一回事了。

Congrats on separating each fruit from the Product column!

祝贺您将每个水果从“产品”列中分离出来！

最近销售日期：一年一月的一天 (Recent_Sales_Date: Just Year-Month-Day)

We are almost done with our interview, one last task is on data time duties. The interviewer asks you to turn the Recent_Sales_Date column into a “year-month-day” format. That is to further prevent future operations in selecting dates with different output times.

采访几乎完成了，最后一项任务是数据时间职责。面试官要求您将“最近的销售日期”列转换为“年-月-日”格式。这是为了进一步防止将来在选择具有不同输出时间的日期时进行操作。

First, we cut out the dates we want, which can be accomplished in 2 ways:

首先，我们切出所需的日期，这可以通过两种方式完成：

Dates = df['Recent_Sales_Date'].str[:10]

要么

Dates = df['Recent_Sales_Date'].str.slice(stop=10)

Second, simply wrap it inside pd.to_datetime().

其次，只需将其包装在pd.to_datetime()中 。

df['Recent_Sales_Date'] = pd.to_datetime(Dates)

Congrats on reformatting date time in the Recent_Sales_Date column!

恭喜，您可以在“最近的销售日期”列中重新格式化日期时间！

结语 (Epilogue)

The interview process is officially over! Congrats! I hope that this article provides you with the experience that I had with my 1st technical interview. Though I did not do that well on this one, this won’t be the last one as well! Now, we are more confident in text data manipulation in Pandas. Thank you all for joining in the ride!

面试过程正式结束！恭喜！我希望本文能为您提供我第一次技术面试中的经验。尽管我在这方面做得不好，但这也不是最后一个！现在，我们对Pandas中的文本数据操作更有信心。谢谢大家的参与！

For a short review, the strip(), slice(), split() methods are all useful tools when dealing with text data: You could choose to strip off irrelevant parts on two sides, slice the essential parts that could be used, and divide the data according to a splitting criterion.

简短地回顾一下， strip() ， slice() ， split()方法在处理文本数据时都是有用的工具：您可以选择在两侧剥离不相关的部分，对可以使用的基本部分进行切片，以及根据拆分标准划分数据。

Here is the Github repo for all the codes in the article!

这是文章中所有代码的Github存储库！

I love to learn about data and reflect on (write about) what I’ve learned in practical applications. You can contact me via LinkedIn and Twitter if you have more to discuss. Also, feel free to follow me on Medium for more data science articles to come!

我喜欢学习数据并反思(写)我在实际应用中学到的东西。如果您还有其他需要讨论的地方，可以通过LinkedIn和Twitter与我联系。另外，请随时关注我的“中型”，以获取更多数据科学文章！