分步式数据库_创建真实数据科学档案项目的分步指南

最新推荐文章于 2024-04-24 13:35:29 发布

weixin_26731327

最新推荐文章于 2024-04-24 13:35:29 发布

阅读量119

点赞数

文章标签： python 数据库 java mysql sql

原文链接：https://towardsdatascience.com/a-step-by-step-guide-for-creating-an-authentic-data-science-portfolio-project-aa641c2f2403

版权

分步式数据库

As an inspiring data scientist, building interesting portfolio projects is key to showcase your skills. When I learned coding and data science as a business student through online courses, I disliked that datasets were made up of fake data or were solved before like Boston House Prices or the Titanic dataset on Kaggle.

作为一个鼓舞人心的数据科学家，构建有趣的项目组合是展示您的技能的关键。当我通过在线课程学习商科专业的编码和数据科学时，我不喜欢数据集是由伪造的数据组成的，或者不像波士顿房屋价格或Kaggle上的泰坦尼克号数据集那样被求解过。

In this blogpost, I want to show you how I develop interesting data science project ideas and implement them step by step, such as exploring Germany’s biggest frequent flyer forum Vielfliegertreff. If you are short on time feel free to skip to the conclusion TLDR.

在本博文中，我想向您展示我如何开发有趣的数据科学项目构想并逐步实施它们，例如探索德国最大的飞行常客论坛Vielfliegertreff。如果您时间有限，请随时跳过TLDR的结论。

步骤1：选择与您相关的激情话题 (Step 1: Choose your passion topic that is relevant)

As a first step, I think about a potential project that fulfills the following three requirements to make it the most interesting and enjoyable:

首先，我考虑一个可能满足以下三个要求的项目，使其成为最有趣和最有趣的项目：

Solving my own problem or burning question
解决我自己的问题或棘手的问题
Connected to some recent event to be relevant or especially interesting
与最近的活动相关或特别有趣
Has not been solved or covered before
之前尚未解决或覆盖

As these ideas are still quite abstract, let me give you a rundown how my three projects fulfilled the requirements:

由于这些想法还很抽象，请允许我简要介绍一下我的三个项目如何满足要求：

演示地址

Overview of my own data science portfolio projects fufilling the three outlined requirements.

我自己的数据科学组合项目概述满足了三个概述的要求。

As a beginner do not strive for perfection, but choose something you are genuinely curious about and write down all the questions you want to explore in your topic.

作为一个初学者，不要追求完美，而要选择您真正好奇的东西，并写下您想在主题中探索的所有问题。

步骤2：开始将您自己的数据集收集在一起 (Step 2: Start Scraping together your own dataset)

Given that you followed my third requirement, there will be no dataset publicly available and you will have to scrape data together yourself. Having scraped a couple of websites, there are 3 major frameworks I use for different scenarios:

鉴于您遵循了我的第三个要求，因此不会公开提供任何数据集，并且您必须自己将数据收集在一起。抓取了两个网站后，我针对不同的情况使用了3个主要框架：

演示地址

Overview of the 3 major frameworks I use for scraping.

我用于抓取的3个主要框架的概述。

For Vielfliegertreff, I used scrapy as framework for the following reasons:

对于Vielfliegertreff，出于以下原因，我使用scrapy作为框架：

There was no Java Script enabled elements that were hiding data.
没有启用Java Script的元素可以隐藏数据。
The website structure was complex having to go from each forum subject, to all the threads and from all the treads to all post website pages. With scrapy you can easily implement complex logic yielding requests that lead to new callback functions in an organized way.
网站结构非常复杂，必须从每个论坛主题，所有主题，所有环节到所有发布网站页面。使用scrapy，您可以轻松实现复杂的逻辑，从而产生有组织的方式导致新回调函数的请求。
There were quite a lot of posts so crawling the entire forum will defnitley take some time. Scrapy allows you to asynchronously scrape websites at an incredible speed.
有很多帖子，因此在整个论坛中进行爬网将需要一些时间。 Scrapy允许您以惊人的速度异步抓取网站。

To give you just an idea of how powerful scrapy is, I quickly benchmarked my MacBook Pro (13-inch, 2018, Four Thunderbolt 3 Ports) with a 2,3 GHz Quad-Core Intel Core i5 that was able to scrape around 3000 pages / minute:

为了让您了解刮擦的强大程度，我Swift对MacBook Pro(13英寸，2018年，四个Thunderbolt 3端口)进行了基准测试，采用了2,3 GHz四核Intel Core i5，能够刮擦约3000页/分钟：

Image for post — Scrapy scraping benchmark. (Image by Author)

To be nice and not to get blocked, it is important that you scrape gently, by for example enabling scrapy’s auto-throttle feature. Furthermore, I also saved all data to a SQL lite database via an items pipeline to avoid duplicates and turned on to log each url request to make sure I do not put more load on the server if I stop and restart the scraping process.

为了变得友善而不被阻塞，轻柔地刮擦非常重要，例如通过启用scrapy的自动油门功能。此外，我还通过项管道将所有数据保存到SQL lite数据库中，以避免重复，并打开日志记录每个url请求，以确保如果停止并重新启动抓取过程，则不会对服务器造成更多负载。

Knowing how to scrape gives you the freedom to collect datasets by yourself and teaches you also important concepts about how the internet works, what a request is and the structure of html/xpath.

知道如何抓取使您可以自由地自己收集数据集，并且还教您有关互联网如何工作，请求是什么以及html / xpath的结构的重要概念。

For my project I ended up with 1.47 gb of data which was close to 1 million posts in the forum.

对于我的项目，我最终获得了1.47 gb的数据，该数据在论坛中接近100万个帖子。

步骤3：清理资料集 (Step 3: Cleaning your dataset)

With your own scraped messy dataset the most challenging part of the portfolio project comes, where data scientists spend on average 60% of their time:

使用您自己的混乱数据集，投资组合项目中最具挑战性的部分来了，数据科学家平均花费60％的时间：

Unlike clean Kaggle datasets, your own dataset allows you to build skills in data cleaning and show a future employer that you are ready to deal with real life messy datasets. Additionally, you can explore and take advantage of the python ecosystem by leveraging libraries that solve some common data cleaning tasks that others solved before.

与干净的Kaggle数据集不同，您自己的数据集可让您建立数据清理技能，并向未来的雇主表明您已准备好处理现实生活中的混乱数据集。此外，您可以利用库来解决并利用python生态系统，这些库可以解决其他人之前解决的一些常见数据清理任务。

For my dataset from Vielfliegertreff, there were a couple of common tasks like turning the dates into a pandas timestamps, converting numbers from strings into actual numeric data types and cleaning a very messy html post text to something readable and usable for NLP tasks. While some tasks are a bit more complicated, I would like to share my top 3 favourite libraries that solved some of my common data cleaning problems:

对于我来自Vielfliegertreff的数据集，有一些常见的任务，例如将日期转换为熊猫时间戳，将数字从字符串转换为实际的数字数据类型以及将非常混乱的html帖子文本清理为对NLP任务可读且可用的东西。尽管有些任务比较复杂，但我想分享我最喜欢的3个librarie ，它们解决了一些常见的数据清理问题：

dateparser: Easily parse localized dates in almost any string formats commonly found on web pages.
dateparser ：可以轻松解析网页上常见的几乎任何字符串格式的本地化日期。
clean-text: Preprocess your scraped data with clean-text to create a normalized text representation. This one is also amazing to remove personally identifiable information, such as emails or phone numbers etc.
clean-text ：使用clean-text预处理您抓取的数据，以创建规范化的文本表示形式。删除个人身份信息(例如电子邮件或电话号码等)也非常出色。
fuzzywuzzy: Fuzzy string matching like a boss.
Fuzzywuzzy ：模糊字符串匹配，像一个老板。

步骤4：资料探索与分析 (Step 4: Data Exploration and Analysis)

When completing the Data Science Nanodegree on Udacity, I came across the Cross-Industry Standard Process for Data Mining (CRISP-DM), which I thought was quite an interesting framework to structure your work in a systematic way.

在完成有关Udacity的数据科学纳米学位时，我遇到了跨行业的数据挖掘标准流程(CRISP-DM) ，我认为这是一个非常有趣的框架，可以系统地组织您的工作。

With our current flow, we implicitly followed the CRISP-DM for our project:

通过当前的流程，我们隐含地遵循了CRISP-DM的项目：

Expressing business understanding by coming up with the following questions in step 1:

通过在步骤1中提出以下问题来表达业务理解：

How is COVID-19 impacting online frequent flyer forums like Vielfliegertreff?
COVID-19对Vielfliegertreff等在线飞行常客论坛有何影响？
What are some of the best posts in the forums?
论坛中最好的帖子是什么？
Who are the experts that I should follow as a new joiner?
作为新加入者，我应该追随哪些专家？
What are some of the worst or best things people say about airlines or airports?
人们对航空公司或机场所说的最坏或最好的话是什么？

And with the scraped data we are now able to translate our initial business questions from above into specific data explanatory questions:

通过抓取的数据，我们现在能够将我们最初的业务问题从上面转换为具体的数据解释性问题：

How many posts are posted on a monthly basis? Did the posts decrease in the beginning of 2020 after COVID-19? Is there also some sort of indication that less people joined the plattform not being able to travel?
每月发布多少个帖子？在COVID-19之后，职位在2020年初是否减少了？是否还有某种迹象表明，加入平台的人越来越少而无法旅行？
What are the top 10 number of posts by the number of likes?
按赞次数排名前10位的帖子数是多少？
Who is posting the most and also receives on average the most likes for the post? These are the users I should follow regularly to see the best content.
谁在该帖子上发布的次数最多，平均也收到最多的赞？这些是我应定期关注以查看最佳内容的用户。
Could a sentiment analysis on every post in combination with named entity recognition to identify cities/airports/airlines lead to interesting positive or negative comments?
对每个帖子进行情感分析并结合命名实体识别以识别城市/机场/航空公司，是否会引起有趣的正面或负面评论？

For the Vielfliegertreff project one can definitely say that there has been a trend of declining posts over the years. With COVID-19 we can clearly see a rapid decrease in posts from January 2020 onwards when Europe was shutting down and closing borders which also heavily affected travelling:

对于Vielfliegertreff项目，可以肯定地说，这些年来职位呈下降趋势。有了COVID-19，我们可以清楚地看到，自2020年1月起欧洲关闭和关闭边境，这也严重影响了出行，职位Swift减少：

演示地址

Posts created by month. (Chart by Author)

按月创建的帖子。 (作者图表)

Also user sign ups have gone down over the years and the forum seems to see less and less of its rapid growth since start in January 2009:

多年来，用户注册量也有所下降，自2009年1月开始以来，该论坛的快速增长似乎越来越少：

演示地址

注册过去几个月的用户数。 (按作者图表)

Last but not least, I wanted to check what the most liked post was about. Unfortunately, it is in Germany, but it was indeed a very interesting post, where a German guy was allowed to spend some time on a US aircraft carrier and experienced a catapult take off in a C2 airplane. The post has some very nice pictures and interesting details. Feel free to check it out here if you can understand some German:

最后但并非最不重要的一点是，我想检查一下最喜欢的帖子。不幸的是，这是在德国，但这确实是一个非常有趣的职位，在那里，一个德国人被允许在美国的航母上呆了一段时间，并经历了C2飞机上的弹射器起飞。该帖子有一些非常漂亮的图片和有趣的细节。如果您能理解一些德语，请随时在此处查看：

步骤5：通过Blogpost或Web App共享您的工作 (Step 5: Share your work via a Blogpost or Web App)

Once you are done with those steps you can go one step further and create a model that classifies or predicts certain data points. For this project I did not attempt further to use machine learning in a specific way, although I had some interesting ideas about classifying sentiment of posts in connection with certain airlines.

完成这些步骤后，您可以再进一步一步，创建一个模型来分类或预测某些数据点。对于这个项目，尽管我对分类某些航空公司的职位情绪有一些有趣的想法，但我没有尝试进一步以特定的方式使用机器学习。

In another project however, I modeled a price prediction algorithm that allows a user to get a price estimate for any type of tractor. The model was then deployed with the awesome streamlit framework, which can be found here (be patient with loading, it might load a bit slower).

但是，在另一个项目中，我对价格预测算法进行了建模，该算法允许用户获得任何类型的拖拉机的价格估计。然后，使用令人敬畏的精简框架来部署该模型，该框架可以在此处找到(耐心等待加载，加载速度可能会慢一些)。

Another way to share your work is like me through blog posts on Medium, Hackernoon, KDNuggets or other popular websites. When writing blog posts, about portfolio projects or other topics, such as awesome interactive AI applications, I always try to make them as fun, visual and interactive as possible. Here are some of my top tips:

分享您作品的另一种方式是像我一样，通过Medium， Hackernoon ， KDNuggets或其他流行网站上的博客文章。在撰写有关投资组合项目或其他主题(如超棒的交互式AI应用程序)的博客文章时，我总是尽力使它们尽可能有趣，直观和互动。以下是一些我的重要提示：

Include nice pictures for easy understanding and to break up some of the long text
包括精美的图片，以便于理解并分解一些较长的文字
Include interactive elements, like tweets or videos that let the user interact
包括互动元素，例如允许用户互动的推文或视频
Change boring tables or charts for interactive ones through tools and frameworks like airtable or plotly
通过诸如airtable或plotly之类的工具和框架，为交互式表格更改无聊的表格或图表

结论与TLDR(Conclusion & TLDR)

Come up with a blog post idea that answers a burning question you had or solves your own problem. Ideally the timing of the topic is relevant and has not been analysed by anyone else before. Based on your experience, website structure and complexity, choose a framework that matches the scraping job best. During data cleaning leverage existing libraries to solve painful data cleaning tasks like parsing timestamps or cleaning text. Finally, choose how you can best share your work. Both an interactive deployed model/dashboard or a well written medium blog post can differentiate you from other applicants on the journey to become a data scientist.

提出一个博客想法，回答您遇到的紧迫问题或解决自己的问题。理想情况下，主题的时间安排是相关的，并且以前没有其他人进行过分析。根据您的经验，网站结构和复杂性，选择最适合抓取工作的框架。在数据清理期间，利用现有库来解决繁琐的数据清理任务，例如解析时间戳或清理文本。最后，选择如何最好地共享您的工作。交互式部署的模型/仪表板或写得很好的博客文章都可以使您与成为数据科学家的其他申请人区分开。

As always feel free to share with me some great data science resources or some of your best portfolio projects!

一如既往，随时与我分享一些很棒的数据科学资源或一些最佳的项目组合！

翻译自: https://towardsdatascience.com/a-step-by-step-guide-for-creating-an-authentic-data-science-portfolio-project-aa641c2f2403

分步式数据库

weixin_26731327

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
分步式数据库_创建真实数据科学档案项目的分步指南

分步式数据库As an inspiring data scientist, building interesting portfolio projects is key to showcase your skills. When I learned coding and data science as a business student through online courses, I dis...
复制链接

扫一扫