第四扩展fs_四个fs

第四扩展fs

I teach a 3rd year undergraduate course on data science. It is not your typical course of lectures, practicals and tutorials. Lectures are few and far between and it is more about the practice of data science than it is about book-learning. After a one-week data science bootcamp, where we work together on a sample project from start to finish, students are supported through a data science project of their own creation. It can be a shock to the system because they are more used to pre-defined projects, and following a precise specification, so the idea of defining their own from scratch is as daunting as it is exciting.

我教了三年级的数据科学本科课程。 这不是您的典型讲座,实践和教程课程。 讲座很少而且相距甚远,更多的是关于数据科学的实践,而不是关于书本学习。 在为期一周的数据科学训练营之后,我们从头到尾共同致力于一个示例项目,学生将通过自己创建的数据科学项目获得支持。 这可能会给系统带来震撼,因为它们更常用于预定义项目并遵循精确的规范,因此从头开始定义它们自己的想法既令人生畏又令人兴奋。

In my experience the greatest challenge they confront is not the coding, but rather the definition of the research question(s) they wish to pursue. Getting this right helps to shape the entire project, and usually leads to a good outcome, but getting it wrong can leave the student in a never-ending struggle for clarity and purpose, which rarely results in a stand-out project.

以我的经验,他们面临的最大挑战不是编码,而是他们希望追求的研究问题的定义。 正确完成这项工作有助于塑造整个项目,通常会带来良好的结果,但是如果做错了,则可能会使学生陷入无休止的争取清晰性和目的性的斗争,而这很少会导致出色的项目。

In this article, I try to capture some of the advice I give to my students early on: what to look for in a research topic; and how to think about their research objective; and how to translate this into an appropriate research question(s) that will serve them well during their project. Although I have my 3rd years in mind as I write the article, I don’t think the advice is at all limited to them. Certainly, I frequently counsel my graduate students and other researchers on similar matters, as they confront many of the same challenges when establishing their research objectives. As such, I think that this article should be of interest to anyone working on data-driven projects or tasks.

在本文中,我将尽早抓住我给学生的一些建议:研究主题中的内容; 以及如何考虑他们的研究目标; 以及如何将其转化为适当的研究问题,以在他们的项目中为他们提供良好的服务。 尽管在撰写本文时已牢记三年级,但我认为建议绝不仅限于此。 当然,我经常为我的研究生和其他研究人员提供类似问题的咨询,因为他们在建立研究目标时会面临许多相同的挑战。 因此,我认为从事数据驱动项目或任务的任何人都应该对本文感兴趣。

四个F (The Four F’s)

It’s always nice to start with a catchy checklist. In marketing, they have The Four P’s (Product, Price, Promotion, Place). The best I can come up with is the Four F’s – Fascinating, Focused, Falsifiable, Feasible – okay so it doesn’t exactly roll off the tongue, but it does a good job of capturing what is important to think about when defining your research and formulating a research question.

我 T的总是好的开始一个琅琅上口的清单。 在营销方面,他们有四普的 ( 产品价格促销地点 )。 我能想到的最好的是四个F- 引人入胜专注可证伪可行 -好的,所以它并不会完全落空,但是在定义研究时要考虑的重要方面做得很好并提出研究问题。

迷人 (Fascinating)

Try to find a topic that fascinates you. If you don’t care about your chosen topic then nobody else will. Plus, you won’t find work satisfying, you won’t enjoy doing it, and the result you produce will be mediocre at best. It doesn’t have to be a topic that is so important and compelling that everyone agrees that it needs to be answered. Truth be told, such questions are few and far between anyway, and pursing the obvious candidates runs the risk of your work being considered derivative!

尝试找到一个让您着迷的主题。 如果您不在乎所选主题,那么其他人将不会。 另外,您不会找到令人满意的工作,也不会享受其中的乐趣,并且所产生的结果充其量只是中等水平。 它不必是一个如此重要和引人注目的主题,每个人都同意需要回答它。 实话实说,这样的问题无论如何都是很少的,追求明显的候选人会冒着您的工作被视为衍生的风险!

I am forever encouraging my students to pursue their own niche interests. After all, these are the topics that excite them, and if they are excited then chances are others will be too. Pursuing your own interests also brings the added advantage of a topic in which you have some expertise, which can give you a valuable head-start. It usually also makes it easier for you to intuit an interesting research question too, and your intuitions may be good enough to evaluate whether your findings are reasonable at an earlier stage in the process than might otherwise be the case. This can provide you with time to adjust your research or replan as necessary.

我永远鼓励我的学生追求自己的利基利益。 毕竟,这些都是激发他们的主题,如果他们感到振奋那么很有可能别人也将如此。 追求自己的兴趣还带给您具有一定专业知识的话题的额外好处,可以为您提供宝贵的起点。 通常,它也使您也更容易理解一个有趣的研究问题,并且您的直觉可能足以评估您在该过程的较早阶段所得出的结论是否合理。 这样可以为您提供必要的时间来调整研究或重新计划。

Whatever topic you choose, take the time early on to reflect on why it is interesting to you and who else might be interested in it. This will help you to better appreciate your own motivations and will enable you to better motivate your work for others. Rest assured, even if you chose a very niche topic you will find that there will be others who are interested in it, that’s the nature of our connected world. Your passion for it will shine through, which can be a catalyst to capture the attention of others.

无论您选择什么主题,都花点时间早点思考为什么对您来说很有趣,还有谁可能对此感兴趣。 这将帮助您更好地欣赏自己的动机,并使您能够更好地激励他人的工作。 放心,即使您选择了一个非常特殊的话题,也会发现会有其他人对此感兴趣,这就是我们互联世界的本质。 您对它的热情将会散发出来,可以吸引其他人的注意力。

As an example, a few years ago I became more interested in running and starting exploring marathon data collected online. It was something I was interested in for myself but I quickly found that the questions I was exploring were of interest to others too, and what started as a personal project outside of my core research has since emerged as one of main research themes in my current work, resulting in numerous blog posts, scientific articles and even a few media invitations. I pursue the work because it was of interest to me and I wanted to know the answer to the questions I was asking. But when I talked about this work, when I wrote about it, others became interested too, revealing new opportunities and more research questions.

例如,几年前,我对运行和开始探索在线收集的马拉松数据变得更加感兴趣。 这是我自己感兴趣的事情,但是我很快发现,我正在探索的问题也引起了其他人的兴趣,而从我的核心研究之外的个人项目开始的事情就成为了我当前的主要研究主题之一。工作,产生了大量博客文章科学文章 ,甚至还有一些媒体邀请。 我从事这项工作是因为它对我很感兴趣,我想知道我所提出问题的答案。 但是当我谈论这项工作时,当我写这篇文章时,其他人也开始对它感兴趣,揭示了新的机会和更多的研究问题。

专心 (Focused)

A good research question should focus on a single well-defined problem that is specific enough to be answerable in a thorough and rigorous way. But while a research question should be focused, it doesn’t need to start out that way. Indeed, it can be a useful exercise to start by asking a big and bold question – perhaps one that will have wide appeal – but the time will come to sharpen your focus on a more precise and practical version of this initial question.

一个好的研究问题应该集中在一个定义明确的问题上,这个问题要足够具体,以至于可以彻底而严格地回答。 但是,尽管应该重点研究一个问题,但并不需要以这种方式开始。 确实,从提出一个大胆的问题(也许会吸引广泛的问题)开始,这可能是一个有用的练习,但是现在是时候让您更加专注于此初始问题的更精确和实际版本了。

A few of years ago I was interested in the question: Does Hollywood ruin good books? It’s a familiar source of debate among friends, which makes it interesting and appealing – everyone understands books and movies, and everyone has a view on this topic – but what does this question mean exactly? I translated it into a narrower version: Are movie ratings usually better or worse than the ratings of the books they are adapted from? Now we are getting more focused, and in a way that facilitates the data science. We are talking about numbers (ratings) and comparing the (average) ratings of books and movies, and yet we can still capture the essence of the big, bold question that we started with.

几年前,我对一个问题感兴趣: 好莱坞会毁掉好书吗? 这是朋友之间辩论的熟悉来源,这使之有趣而引人入胜–每个人都了解书和电影,每个人都对该主题有看法–但是这个问题的确切含义是什么? 我将其翻译成一个较窄的版本: 电影收视率通常好于或差于它们改编的书籍的收视率吗? 现在,我们变得更加专注,并且以一种促进数据科学的方式。 我们正在谈论数字(等级)并比较书籍和电影的(平均)等级,但是我们仍然可以抓住我们从头开始的大胆大问题的本质。

In another example, one of the first topics of interest to me in my marathon work was how I should pace my own races to maximise my performance. And so the big, bold question became: What is the best pacing strategy in the marathon? But this was far too broad and and far too vague to be useful, and I needed a more focused question on pacing. I was conscious of a common marathon recommendation, “don’t start too fast” and this led to the more practical question to ask of the data: Does starting too fast in the marathon impair your performance? This was much better, because it suggested a practical way forward, by comparing how runners paced themselves at the start of their race versus their overall finish-times.

在另一个示例中,我在马拉松比赛中感兴趣的第一个主题之一就是我应该如何调整自己的比赛速度以最大化自己的表现。 于是,一个大胆的问题就变成了: 马拉松比赛中最佳的起搏策略是什么? 但这太宽泛,太含糊,以至于无法使用,我需要在步调上有一个更集中的问题。 我意识到通常的马拉松建议:“ 不要开始得太快 ”,这导致提出了一个更实际的数据问题: 在马拉松中开始得太快会损害您的表现吗? 这要好得多,因为它通过比较跑步者在比赛开始时的节奏和整体完成时间,提出了一种实用的前进方法。

Once again, this brings us back to numbers, averages, and comparisons – the stuff of data science, concrete and actionable – and it means that we can start our journey with a specific destination in mind and, if not a precise set of directions, at least we have a map and a compass!

再次,这使我们回到了数字,平均值和比较–数据科学的内容,具体而又切实可行–意味着我们可以着眼于一个特定的目的地,如果没有一套精确的方向,开始我们的旅程,至少我们有地图和指南针!

可证伪的 (Falsifiable)

Falsifiability or refutability is a key concept in the philosophy of science, first introduced by Karl Popper in his book Logik der Forschung (1934). Simply put a statement is falsifiable if it can be contradicted by evidence (or data). The classic example is, “all swans are white,” which is falsifiable because we can observe that black swans exist. Falsifiability underpins the scientific method by providing a principled way to translate observations and data into robust scientific conclusions.

可证伪性 辩驳性是科学哲学中的一个关键概念,由卡尔·波普尔(Karl Popper)在他的《逻辑学杂志》( Logik der Forschung ,1934年)一书中首次提出。 简而言之,如果陈述与证据(或数据)相抵触,则是伪造的。 典型的例子是“ 所有天鹅都是白色的” ,这是可以证伪的,因为我们可以观察到黑色天鹅的存在。 可证伪性通过提供一种将观察结果和数据转化为可靠的科学结论的原则方法 ,为科学方法奠定了基础。

“That (your hypothesis) is not only not right; it is not even wrong.” — Wolfgang Pauli (Nobel Prize in Physics, 1945)

“(您的假设)不仅不正确,而且 这甚至没有错。” 沃尔夫冈·保利(Wolfgang Pauli)(1945年诺贝尔物理学奖)

In practice, this means you should frame your research question as a falsifiable hypothesis, by posing it as a statement that can be either true or false, and in a manner that can lead to an experiment or analysis that is capable of disproving it.

在实践中,这意味着您应将您的研究问题构成一个虚假的假设,将其陈述为可以为真或为假的陈述,并应以能够证明该结论的实验或分析的方式进行。

This helps us to avoid vague research questions such as “What is the best pacing strategy for the marathon?” and forces us instead to formulate more meaningful ones, such as “Does starting too fast impair your marathon performance?” or, better yet, as a falsifiable statement,“Runners who start too fast have slower finish-times, compared with runners who don’t start too fast.” Now we can disprove this statement by using data about starting paces and finish-times; if the mean finish-time of fast-starters is no different from the mean finish-times of runners who start more slowly, then, all other things being equal, this statement will be false.

这有助于我们避免模糊的研究问题,例如“马拉松的最佳起搏策略是什么?” 并迫使我们制定更有意义的建议,例如“开始得太快会损害您的马拉松表现吗?” 或者,更好的是,可以伪造的说法是: “起步速度较快的运动员相比起步速度不太快的运动员,完成时间更慢。” 现在,我们可以通过使用有关起跑速度和结束时间的数据来反驳这一说法。 如果快速启动者的平均完成时间与速度较慢的跑步者的平均完成时间没有不同,则在所有其他条件相同的情况下,该陈述将为假。

In case you think that this is more about pandering to the philosophy of science than the practicalities of data science, then you would be mistaken. Framing your research question as a falsifiable hypothesis provides a more robust foundation for your work and this brings with it many important practical benefits. In particular, modern statistical techniques are based on this type of hypothesis testing and when the time comes to draw a conclusion from your work, you will find it is a better fit for the types of statistical tests that you will need to use to validate that your findings are not due to chance; a vital step when it comes interpreting the results of your experiments.

如果您认为这更多地是在顺应科学哲学而不是数据科学的实用性,那么您将被误解。 将您的研究问题归为伪造的假设可以为您的工作提供更坚实的基础,并因此带来许多重要的实际好处。 特别是,现代统计技术就是基于这种假设检验的 ,当需要从您的工作中得出结论时,您会发现它更适合用于验证以下情况的统计检验类型:您的发现不是偶然的; 解释实验结果的关键一步。

可行 (Feasible)

Last but not least, you must determine whether your research question is feasible? Simply put, can it be answered with the data you have, in the time that is available, and using the skills you’ve learned? Usually, if you have the right dataset then you will be able to frame a research question that fits your timeframe and skill set.

最后但并非最不重要的一点是,您必须确定您的研究问题是否可行? 简而言之,是否可以用您拥有的数据,可用的时间以及所学技能来回答? 通常,如果您拥有正确的数据集,那么您将能够提出适合您的时间框架和技能集的研究问题。

All too often, however, I have seen students derailed during their project because they have realised too late that their dataset is missing some vital piece of information that needed to answer their research question. Usually this is because the question was not properly defined, duping them into thinking that their dataset was sufficient for their needs, until the time came to test it.

但是,我经常看到学生在项目中脱轨,因为他们太晚意识到他们的数据集缺少一些重要的信息,这些信息需要回答他们的研究问题。 通常是因为问题定义不正确,导致他们认为自己的数据集足以满足他们的需求,直到需要进行测试为止。

When I was working on whether Hollywood had a tendency to ruin good books, I knew I needed movie ratings and book ratings. IMDB provided a large-scale data-dump of movie data, including ratings, and Goodreads provided a handy API to get book data and ratings. Howver, millions of movie and book ratings weren’t enough on their own. I needed to be able to identify movies that were adapted from books, and I needed to align these movies with their corresponding books, so that ratings could be compared on a like-for-like basis. This was the most challenging aspect of that project. I needed to verify that the IMDB data included some indication that a movie was based on a book; it did in various ways via the writing credits. I also needed to test whether I could match movies and books reliably; I could, using the Goodreads search API using movie title, year, and author information.

当我在研究好莱坞是否倾向于毁掉好书时,我知道我需要电影评级和书评。 IMDB提供了包括分级在内的电影数据的大规模数据转储 ,而Goodreads提供了一个便捷的API以获取书籍数据和分级。 但是,单靠电影和书籍评分还不够。 我需要能够识别从书本改编的电影,并且需要将这些电影与它们对应的书本对齐,以便可以在类似的基础上比较收视率。 这是该项目最具挑战性的方面。 我需要验证IMDB数据是否包含某种电影是基于书籍的迹象; 它通过写作学分以多种方式做到了。 我还需要测试我是否可以可靠地匹配电影和书籍。 使用电影标题,年份和作者信息的Goodreads搜索API,我可以。

Remember, you can’t create something from nothing so if you need a particular piece of data to answer your research question then you need to ensure that it is available in your dataset, or that it can be derived from the available data. Sometimes, even if data is not missing, precision can be an issue. In my fast-starters marathon work I needed to measure the starting pace of marathon runners, but my dataset only included 5km split times, which meant I could only calculate pace every 5kms. Would this be sufficient? Did it make sense to consider the first 5km as the start of the race? After running a few early tests I was able to conclude that 5km splits would be sufficient, which was an important confidence boost to allow me to continue; incidentally you can find the result of this marathon work here.

请记住,您不能一无所获,因此,如果您需要特定的数据来回答您的研究问题,那么您需要确保它在数据集中可用,或者可以从可用数据中得出。 有时,即使数据不丢失,精度也会成为问题。 在我的快速起跑马拉松比赛中,我需要测量马拉松运动员的起跑速度,但是我的数据集仅包含5公里的分割时间,这意味着我只能每5公里计算一次速度。 这样就足够了吗? 将前5公里视为比赛的开始是否有意义? 在进行了几次早期测试之后,我可以断定5公里的劈裂就足够了,这对我来说是继续前进的重要信心。 顺便说一句,您可以在这里找到这项马拉松比赛的结果

I tell my students that this feasibility step is one of the most important aspects of their early work, because I have seen first-hand, and all too often, how they can fool themselves into believing that their data are sufficient, only to find out later that some vital piece of information is missing. Indeed, it has been a hard lesson for me to learn in my own work, but I have just about learned it at this stage.

我告诉我的学生,这个可行性步骤是他们早期工作中最重要的方面之一,因为我亲眼目睹并且经常看到他们如何愚弄自己以为自己的数据足够,只能找出答案后来,一些重要的信息丢失了。 确实,对我来说,在自己的工作中学习是很难的一课,但是在这个阶段我才刚刚学到它。

结论 (Conclusions)

Identifying a suitable research question is one of the most important tasks in any data science project. Getting it wrong can lead to major problems down the line, while getting it right can not only help to guarantee a successful project, but also guide your work from the start. When it comes to starting a new research project my advice to my students can be summarised as follows:

我确定一个合适的研究问题是任何数据科学项目中最重要的任务之一。 弄错了可能会导致重大问题,而正确做对不仅可以帮助确保项目成功,还可以从一开始就指导您的工作。 当开始一个新的研究项目时,我对学生的建议可以总结如下:

  • Choose on a topic that fascinates you, even if it is a niche topic and, sometimes, especially if it is a niche topic, because “novelty!”

    选择一个主题 即使这是一个小众话题,有时,尤其是它是一个小众话题,也会让您着迷 ,因为“新颖!”

  • Focus your research objectives to one question at a time.

    一次将您的研究目标集中于一个问题。

  • Translate your initial research question into a specific and falsifiable hypothesis.

    将您最初的研究问题转化为具体且可以证伪的假设。

  • Verify that your that your research question is feasible, given the data that you have available.

    给定可用数据,请验证您的研究问题是否可行

I believe that following these steps will make it more likely to produce a successful data science project, and it will be a lot more satisfying and enjoyable too.

我相信,遵循这些步骤将使它更有可能产生成功的数据科学项目,并且也将更加令人满意和令人愉快。

翻译自: https://towardsdatascience.com/the-four-fs-9f03d7d66554

第四扩展fs

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值