In the modern world, here and there ideas are arising about using data science for an extra benefit. For instance, Google can use a history of watched videos for providing recommendations about new ones. Online shops are using a recommendation system for increasing your receipt. However… if companies use the data for their benefit, could we do the same for own needs such as looking an online English teacher?

在现代世界中,关于使用数据科学带来额外好处的想法不时出现。 例如,Google可以使用观看视频的历史记录来提供有关新视频的推荐。 网上商店正在使用推荐系统来增加您的收据。 但是……如果公司将数据用于自己的利益,我们是否可以为满足自己的需求(例如寻找在线英语老师)而这样做?

免责声明 (Disclaimer)

It is an approach based on my own experience and can be unsuitable to your point of view, ideas, or principles.


介绍 (Introduction)

I am keen on learning English (my native language is Russian), and usually, I use different resources for it. And one of them is… useful enough, except a small downside - it has a quite limited toolset for finding a teacher. At the same time, a vast piece of data is hidden behind the scene. And the main point is that we can only see a top of an iceberg and do not know about the information that is waiting for us.

我热衷于学习英语(我的母语是俄语),通常我会使用不同的资源。 其中一个很…有用,除了不足之处-它在寻找老师方面的工具集非常有限。 同时,大量数据隐藏在幕后。 最主要的一点是,我们只能看到冰山的一角,而不知道正在等待我们的信息。

Are you ready to find a teacher who fits your expectations? If so, follow me and I show you how deep the rabbit hole goes.

您准备好找到符合您期望的老师了吗? 如果是这样,请跟随我,我告诉您兔子洞有多深。

步骤1.收集您的期望。 (Step 1. Collect your expectations.)

Okay, we are ready to start our journey. But… first things first. What we are looking for? If we know what is our goal, we could estimate the success of our actions. I think you have your own criteria, so do I. Let me share them as an example.

好的,我们准备开始我们的旅程。 但是……首先是第一件事。 我们在寻找什么? 如果我们知道我们的目标是什么,就可以估计我们行动的成功。 我认为您有自己的标准,我也有。我以它们为例进行分享。



  • A price no more than $20 per hour.

    每小时价格不超过$ 20。
  • Teachers able to help me in preparation for Cambridge Exams(FCE)

  • They have real experience

  • Agree to give me homework and check it

  • No more than 3 candidates


The first requirement(a price) looks like the easiest one. But in the nutshell - everything is harder. Let me explain it.

第一个要求( 价格 )看起来像最简单的要求。 简而言之-一切都会更艰难。 让我解释一下。

If you have a maximum amount of money, which you are ready to pay for a lesson, you would probably wonder to face a situation when this is a price for classes is not the same thing which you are waiting for. I tried to find teachers who are ready to provide an "exam-oriented" lesson for $20. However - I received a subset of variants, when $20 was the price for "basic English", meanwhile a preparation for FCE-exam was more costly. It seemed inconvenient, but we are able to cope with it soon. For now - just keep it in mind.

如果您有足够的钱准备上一堂课,那么您可能会想知道如果这是课程的价格与您所等待的不是一样的话,那么您可能会想面对这种情况。 我试图找到愿意以20美元的价格提供“面向考试”课程的老师。 但是-我收到了一些变体,“基础英语”的价格为20美元,与此同时,准备FCE考试的费用更高。 似乎不方便,但是我们很快就能解决。 现在-记住这一点。

The second(an exam) - is my main goal. I suppose it does not require any clarification.

第二( 考试 )-是我的主要目标。 我认为它不需要任何澄清。

The third(real experience) is more complicated than the first one. Sometimes, people try to pose themselves as professionals after ending for some courses. In my opinion, only a certification - it is quite a weak argument, there could be a case, when people have finished a course, but have no relevant skills. So, I would rather consider people having a real experience in teaching than evidence of ending of courses.

第三( 实际经验 )比第一个更复杂。 有时,人们在结束某些课程后会试图将自己摆成专业人士。 在我看来,只有证书-这是一个很弱的论据,有可能情况是人们完成了课程但没有相关技能。 因此,我宁愿考虑具有真正教学经验的人,而不是课程结束的证据。

The fourthly(homework)- also looks logical, at least for me. A learning theory, pleasant chatting through lessons… well, everything is good. But if you want to learn how to swim - you need to swim. And if we paraphrase it - "If you want to pass the exam - you need to try to pass them, at least on examples".Yes, practice makes perfect. And you want someone to help you and check your progress (like a sports coach), how are you doing your job. You need feedback from your writing, speaking, listening, reading tasks. And for it, you need homework.

至少对我来说,第四( 作业 )也看起来合乎逻辑。 学习理论,愉快地上课聊天……好吧,一切都很好。 但是,如果您想学习游泳-您需要游泳。 如果我们改写它:“如果您想通过考试,则需要尝试通过考试,至少要在示例上通过。”是的,实践是完美的。 而且,您希望有人帮助您并检查您的进度(例如体育教练),您的工作状况如何。 您需要从写作,口语,听力和阅读任务中获得反馈。 为此,您需要功课。

The last one (3 candidates max.)- the website gives an opportunity to book three trial lessons for less price. And I would like to use these tries as efficiently as possible.

最后一堂( 最多3名候选人 )-该网站提供了以较低的价格预订三堂试听课的机会。 我想尽可能有效地利用这些尝试。

步骤2.粗滤器。 (Step 2. A rough filter.)

We have some data received from the website. We almost do not have to clear them, apart from removing some useless information. 

我们从网站上收到了一些数据。 除了删除一些无用的信息外,我们几乎不必清除它们。

And then our dataset could look like that:


For this stage we will consider information from the column pro_course_detail - it is a repository of information about teachers and courses which they provide.

在此阶段,我们将考虑来自pro_course_detail列的信息 -它是有关教师和所提供课程的信息的存储库。

现在该弄脏我们的双手了。 (It is time to get our hands dirty.)

Firstly  -  we will find teachers related to the main goal - an FCE exam. 


Secondly  -  separate them by price for a lesson(do you remember about the situation when the price could be different from your expectations? We are going to resolve this problem.)


Okay, we have initial criteria. Time to code it.

好的,我们有初始条件。 是时候编写代码了。

第一印象 (The first impression)

Let's try to visualize our first subset for getting a general overview of the number of teachers per country.


Hmm… looks like people from many countries are ready to support you on your way to a Cambridge Exam. Mainly they are British(a label "GB"). I anticipate it is a consequence of "the nature" of the exam. However, I glad to see people from my motherland ("RU" means Russia) who also are ready to give you a hand.

嗯……看来来自许多国家/地区的人们都准备在您参加剑桥考试的路上为您提供支持。 主要是英国人(标签为“ GB”)。 我预计这是考试“性质”的结果。 但是,我很高兴见到我祖国的人(“ RU”表示俄罗斯)也准备帮忙。

So, we can strike two moments out. A price no more than $20 per hour. Teachers able to help me in preparation for Cambridge Exams(FCE)

因此,我们可以稍等片刻。 每小时价格不超过$ 20。 能够帮助我准备剑桥考试(FCE)的老师

However, we still have others to have to be done


  • They have real experience

  • Agree to give me homework and check it

  • No more than 3 candidates


步骤3.按描述过滤 (Step 3. Filtering by description)

Here is beginning something which a website could not provide you - a searching over a text description. We have some columns in our dataframe which could present some extra information

这是网站无法提供给您的东西-搜索文字描述。 我们在数据框中有一些列可以显示一些额外的信息

A bit more about people who have nothing against becoming your teacher

A bit more about people who have nothing against becoming your teacher:


  • about_me - it is a short description of teachers as humans, who they are, where they are from. Usually, there are the most basic things about their lifestyle and things like that

    about_me-这是对教师的简短描述,他们是人类,他们是谁,来自哪里。 通常,关于他们的生活方式以及诸如此类的最基本的事情

  • about_teacher - it is more related to professional skills. Some are good at test preparation (IELTS, TOEFL, etc..), others could help you to be ready for a job interview, or could teach you how to use language with your business partners. In short - it is a specialization.

    about_teacher-与职业技能更相关。 有些人擅长考试准备(雅思,托福等),另一些人可以帮助您准备工作面试,或者可以教您如何与业务合作伙伴一起使用语言。 简而言之,这是一种专业化

  • teaching_style - information about the style of your future POTENTIAL classes. How a teacher would conduct them.

    Teaching_style-有关将来的POTENTIAL课程的风格的信息。 老师的行为方式。

  • introduction - usually people fill it with some information about them. Sometimes it is empty or copied info other text columns

    简介 -通常人们会在其中添加一些有关他们的信息。 有时为空或复制了其他文本列的信息

Okay, do our best and try to resolve other requirements from the list. 


To do a function for filtration by specific word sequence, as a result, we have a boolean mask for applying on the dataframe. 


After that, we are going to create a chain/combination of boolean masks for reducing the size of our dataset. I guess that expressions like "I have been…" is a good idea to find teachers who have a real experience. At the same time - the word "Homework" is a key indicator for people who will check out your tasks.

之后,我们将创建布尔掩码的链/组合以减小数据集的大小。 我想像“我一直……”这样的表达是找到具有真实经验的老师的好主意。 同时,“作业”一词是将要签出您的任务的人的关键指示。

And then show how many candidates do we have.


Many teachers were excluded from our dataset. The big part of our dataset (people from Great Britain) is gone. But, there is room for optimism, people from 7 different countries fit our sophisticated criteria. The interesting thing is that someone from Russia is still there.

我们的数据集中排除了许多教师。 我们数据集的大部分(来自英国的人们)都消失了。 但是,还有乐观的空间,来自7个不同国家的人们符合我们的复杂标准。 有趣的是,还有俄罗斯人在那儿。

So, now we can eliminate the other two. A price no more than $20 per hour. Teachers able to help me in preparation for Cambridge Exams(FCE) They have real experience Agree to give me homework and check it

因此,现在我们可以消除其他两个。 每小时价格不超过$ 20。 能够帮助我准备剑桥考试(FCE)的老师 他们有真正的经验 同意给我作业并检查

But… there is big "BUT"


  • No more than 3 candidates


And now it looks like… we get stuck into this step.


摘要 (Summary)

We picked the low-hanging fruit, using by explicit features of our dataset. Moreover, we used underestimated pieces of information from text descriptions. But unfortunately, it is not enough, for getting things done.

通过使用数据集的显式特征,我们选择了低落的果实。 此外,我们使用了文字描述中被低估的信息。 但是不幸的是,这还不足以完成任务。

So… it is time to take a break, to look at the "nature" of the subject domain from another point of view and then to cope with the problem. There is every indication that something will happen in the second part of this story…

因此,是时候休息一下,从另一个角度看待主题领域的“性质”,然后再解决这个问题。 种种迹象表明,故事的第二部分将会发生……

P.S. The Ipython-notebook is located there.

PS 。 IPython的笔记本位于那里

翻译自: https://habr.com/en/post/509114/






