jenkins 潜入网页_我潜入专业机器学习

最新推荐文章于 2022-04-11 23:15:13 发布

weixin_26721705

最新推荐文章于 2022-04-11 23:15:13 发布

阅读量246

点赞数

文章标签：机器学习 python 人工智能

原文链接：https://medium.com/@thejackmccumber/my-dive-into-professional-machine-learning-b10a9a4e49bd

版权

jenkins 潜入网页

I finished up my internship about a month back. The experience itself was hugely rewarding and the people on my team that I had the pleasure of working with will be lifelong mentors. Now that the project is finished, I wanted to reflect on the experience candidly and shed some light on what it was like going from hobbyist tinkerer to developing machine learning models in a professional environment.

大约一个月前，我完成了实习。这次经历本身就是巨大的收获，我团队中很高兴与之合作的人们将是终身的导师。既然项目已经完成，我想坦率地回顾一下经验，并阐明从业余爱好者开始到在专业环境中开发机器学习模型的感觉。

背景 (Background)

My role was an Artificial Intelligence Intern for SPIE, a not-for-profit company in Bellingham, Washington where I was completing my undergraduate degree in MIS/Analytics. SPIE publishes research journals and other literature related to the study of photonics, along with hosting academic conferences throughout the year where researchers present new findings. When I joined, SPIE was just beginning to dip their toes into using machine learning to improve their business and needed to see what was possible using new technologies. Besides myself, the only other active member in developing the pipeline was a professor I had for a text mining class in the previous quarter. From my understanding, SPIE reached out to them to see if they knew of any students who would be interested in being part of a unique project that would eventually become the internship. I had written to my professor at the end of the quarter to see if he knew of opportunities for jobs to pursue after graduation, and this happened to fall into his lap at the right time.

我的角色是SPIE的人工智能实习生，该公司是位于华盛顿贝灵汉的一家非营利性公司，在那里我完成了MIS / Analytics的本科学位。 SPIE出版有关光子学研究的研究期刊和其他文献，并在全年举办学术会议，研究人员介绍新发现。当我加入时，SPIE刚刚开始涉足使用机器学习来改善他们的业务，并且需要了解使用新技术可以实现的目标。除了我自己，开发管道的唯一其他活跃成员是我上个季度的文本挖掘班的一位教授。据我了解，SPIE向他们伸出了援助之手，看他们是否知道有兴趣参加独特项目的学生，这些项目最终将成为实习生。我在本季度末写信给我的教授，看看他是否知道毕业后有工作机会，而这恰好在适当的时候落入他的腿上。

项目目标 (Project Goals)

SPIE’s goal for this project was to predict the success of a paper that a researcher submitted to a digital library. Using this prediction, they were hoping to be more efficient in using their resources on services related to these presentations, like translating the paper and getting a video of the presentation transcribed. These tasks have to be completed manually by people with domain experience due to the complex nature of the content (dense research papers).

SPIE对该项目的目标是预测研究人员提交给数字图书馆的论文是否成功。使用此预测，他们希望在与这些演示文稿相关的服务上使用其资源方面的效率更高，例如翻译论文和获取演示文稿的视频。由于内容的复杂性，这些任务必须由具有领域经验的人员手动完成(密集的研究论文)。

数据 (Data)

There were two datasets that I was working with during the project:

在项目期间，我使用了两个数据集：

Presentation Data — Condensed database of ~2 million research papers from the SPIE library. This dataset contained 101 features ranging from simple date features detailing when the paper was submitted to hierarchal information about what conference/sub-conference/symposium the paper was presented at. Also included in the dataset was attendance data detailing the number of participants that came to a given presentation. The result-set in the data showed the downloads for a given year dating back to 2015. So for each paper, there were five columns of data from 2015 through 2019 and how many time the paper was downloads from SPIE’s digital library that year.

演示数据-来自SPIE库的约200万篇研究论文的浓缩数据库。该数据集包含101个特征，范围从简单的日期特征(详细描述何时提交论文)到有关在什么会议/子会议/专题讨论会上展示论文的层次信息。数据集中还包括出席数据，详细说明了给定演示的参与者人数。数据中的结果集显示了可追溯到2015年的给定年份的下载量。因此，对于每篇论文，从2015年到2019年都有五列数据，以及那一年该论文从SPIE的数字图书馆下载了多少次。

Taxon Data — SPIE invested in creating a taxonomy that tied each paper to certain topics. When a researcher published their paper to the digital library, an automated system recommended what tags should be associated with the paper to make it easier for people to search for it. If the system recommended the correct tags, it was included in the document. The process also allowed for researchers to enter-in their own tags, which are also represented in the data. This document only had 5 features of which only three were of use.

分类数据– SPIE投资于创建分类法，将每篇论文与特定主题相关联。当研究人员将其论文发表到数字图书馆时，一个自动化系统建议应将哪些标签与论文相关联，以使人们更容易地进行搜索。如果系统推荐正确的标签，则将其包含在文档中。该过程还允许研究人员输入自己的标签，这些标签也显示在数据中。该文档仅具有5个功能，其中仅使用了3个。

工作细节 (Work Details)

The project was capped at 20 hours/week, but given that I was taking 19 credits and it was my last semester at college, most weeks I probably worked 12–17 hours on the project and billed the hours accordingly. As far as compensation, I was offered the choice of either working as an independent contractor for $23/hour or being an official SPIE employee for $21.05/hour; I chose the latter because I thought the title of “Artificial Intelligence Intern” would be an attention grabber when future potential employers looked at my resumé. The project lasted from February 2020 through August 2020 and was fully remote except for a preliminary onsite visit. I typically prefer working in an office environment, but the remote nature of the internship ended up being beneficial with COVID-19 going into full-swing a month after I started.

该项目的每周上限为20个小时，但是鉴于我要获得19个学分，而且这是我上大学的最后一个学期，所以大多数情况下，我大概在该项目上工作12至17个小时，并据此计费。至于报酬，我可以选择是作为独立承包商，每小时$ 23，还是作为SPIE的正式雇员，每小时$ 21.05。我之所以选择后者，是因为我认为，当未来的潜在雇主看我的简历时，“人工智能实习生”的头衔将是吸引人们注意的地方。该项目持续时间为2020年2月至2020年8月，除了初步的现场访问外，该项目完全处于远程状态。我通常更喜欢在办公室环境中工作，但是实习的远程性最终使我受益，因为COVID-19在我开始工作一个月后便全面展开。

工具与技术 (Tools and Technology)

I could go into a lot of the specific packages we used for every part of the project, but I think the tools people use aren’t always as important as they’re made out to be. As long as you know what you’re doing and you can get where you want to go, I always favor the path of least resistance.

我可以研究我们在项目的每个部分中使用的许多特定软件包，但是我认为人们使用的工具并不总是像实际使用的那样重要。只要您知道自己在做什么，就能到达自己想去的地方，我总是会选择阻力最小的路径。

Python— Easy, robust, and has great packages for what we were trying to do.
Python-简单，健壮，并具有我们尝试执行的出色程序包。
Cleaning — Custom SQL and Python scripts.
清洁-自定义SQL和Python脚本。
Packages — Pandas/Numpy for manipulation, Scikit-learn for main features of the model and exploratory analysis, and Genism for latent dirichlet allocation work.
软件包-用于操作的Pandas / Numpy，用于模型和探索性分析的主要功能的Scikit-learn ，以及用于潜在狄利克雷分配工作的Genism 。
Visualization: Matplotlib for the simple stuff, Tableau for the big stuff.
可视化： Matplotlib用于简单的东西， Tableau用于大的东西。

时间线 (Timeline)

阶段1：结识(Stage 1: Getting Acquainted)

During the first month of the project, I acquired the data and had to clean it. My first roadblock was actually looking at it. I was still new to Pandas at the time and the file sizes I was working with were massive. I spent an entire day trying to figure out what parameters would allow me to use their custom encoding and delimiters, but I eventually had to ask for help from one of my previous professors. Once I got past this, I began to explore the data. I reached out to another member of the team with more domain knowledge and really had to put in the work to be able to understand how the underlying organization of the data would contribute to the goal. During this stage, I also developed a pipeline for how I saw the project progressing and why each step would be warranted in the chain. We also decided on how the data would be trimmed, split, and tested. There were some issues with data fidelity (downloads are only given for the year as opposed to monthly or daily) as well as an issue with corruption of attendance data, so we had to find some workaround. We settled on advancing with about 40 features from the original presentation dataset and decided to use the download data as the dependent variable for the predictions.

在项目的第一个月中，我获取了数据并不得不对其进行清理。我的第一个障碍实际上是在看它。那时我还是Pandas的新手，并且正在使用的文件很大。我花了一整天的时间来弄清楚哪些参数可以使我使用其自定义编码和定界符，但最终我不得不向我以前的一位教授寻求帮助。一旦克服了这一点，我便开始探索数据。我与具有更多领域知识的团队的另一位成员联系，确实不得不投入工作以能够理解数据的底层组织如何为目标做出贡献。在此阶段，我还开发了一个管道，用于说明我如何看待项目的进展以及为什么要保证链中的每个步骤。我们还决定了如何修剪，拆分和测试数据。数据保真度存在一些问题(仅按年份提供下载，而不是每月或每天进行下载)，以及考勤数据损坏的问题，因此我们必须找到一些解决方法。我们决定从原始演示数据集中获取约40个特征，然后决定将下载数据用作预测的因变量。

阶段2：特征工程 (Stage 2: Feature Engineering)

During the second stage, the primary activity was engineering new features. One of our first ideas was to use a version of forward-chaining to see how the performance of topics in previous years could be used to predict the popularity of other papers of that topic later on. This approach essentially tried to track the momentum of certain subjects. We also used an unsupervised approach to cluster the taxons into their natural domains which turned out to be one of the most significant predictors of success. Other than this, we used a random forest approach based on the mix of categorical and numerical variables.

在第二阶段，主要活动是设计新功能。我们的第一个想法是使用前向链接版本，以查看过去几年中主题的表现如何可用于预测该主题的其他论文的普及程度。这种方法本质上是试图追踪某些主题的发展势头。我们还使用了无监督的方法将分类单元聚类到其自然域，这被证明是成功最重要的指标之一。除此之外，我们使用了基于分类变量和数值变量混合的随机森林方法。

阶段4：测试 (Stage 4: Testing)

Because the initial dataset was so large, we always ran tests on randomly seeded subsets of data. We used standard parameters for training/testing/validation where, for instance, we used 20% for testing and 80% for training. This was one of the most frustrating parts of the test because the issues of bias and variance highlighted some earlier mistakes we had made. We had to tweak settings for about a week before we felt happy with the way the model was setup and confident that it would perform well in the worst-case scenarios. We used forward chaining in the project to train the data on earlier papers and used the later years as testing data. For example, if a paper was published in 2018, we used 2015–2017 to train the model and judged the accuracy based on 2018. This was a beneficial step, but not entirely in-line with the business question of predicting the performance of new papers. We ended up scrapping this approach for the decision tree portion and using it only in the LDA sub-model. During this stage, we also presented our findings to the executive committee where we answered some of the questions they had about the characteristics that make a research paper popular and showed them how we got there.

因为初始数据集非常大，所以我们总是对随机播种的数据子集进行测试。我们使用标准参数进行培训/测试/验证，例如，我们使用20％的测试和80％的培训。这是测试中最令人沮丧的部分之一，因为偏见和差异问题突出了我们先前犯的一些错误。我们必须对设置进行大约一周的调整，然后才能对模型的设置感到满意，并确信它在最坏的情况下仍能正常工作。我们在项目中使用正向链来训练早期论文中的数据，并将以后的年份用作测试数据。例如，如果一篇论文在2018年发表，我们使用2015-2017年对该模型进行训练并基于2018年判断准确性。这是一个有益的步骤，但与预测新产品性能的业务问题并不完全一致文件。我们最终在决策树部分废弃了这种方法，仅在LDA子模型中使用了该方法。在此阶段，我们还向执行委员会介绍了我们的发现，在那里我们回答了他们有关使研究论文广受欢迎的特征的一些问题，并向他们展示了我们如何到达那里。

阶段3：打包和部署 (Stage 3: Packing and Deployment)

Up until this point, we were developing the pipeline in sections. Because there were so many individual components that were being treated in specific ways, the model was somewhat siloed. Certain parts required more preprocessing while other just waited for them to finish. I think we could have saved some effort on this stage if we had spent more time planning in the beginning, but I was still very new to professional data science work and might not have had the wisdom to actually make that happen. The model was going to be deployed on Azure, but we later found out that the deprecation of Jupyter Notebooks on the platform conflicted with the process we intended to use. Unfortunately, this meant that I didn’t get to see the model get deployed because my internship was wrapping up, but I believe we put in enough work to make the model fairly straightforward for the team to deploy it in the future. One of the more interesting parts of stitching my code together during this time was see my improvement over the course of this project. Reviewing everything I had written line-by-line (in all, about 2,000 lines) made me feel, for lack of a better word, nostalgic.

到目前为止，我们正在分节中开发管道。由于有许多单独的组件以特定方式进行处理，因此该模型有些孤立。某些零件需要更多的预处理，而其他零件则等待它们完成。我认为，如果我们在开始时花了更多时间进行规划，那么我们可以在此阶段节省一些精力，但是我对专业数据科学工作还是很陌生，可能没有真正实现这一目标的智慧。该模型将被部署在Azure上，但是我们后来发现平台上对Jupyter Notebook的弃用与我们打算使用的过程相冲突。不幸的是，这意味着我没有看到模型被部署，因为我的实习期结束了，但是我相信我们投入了足够的工作来使模型对于团队将来的部署相当简单。在这段时间内将我的代码拼接在一起的最有趣的部分之一就是看到我在该项目过程中的进步。逐行回顾我写的所有内容(总共约2000行)使我感到怀旧，因为缺少更好的词。

外卖 (Takeaways)

I also just wanted to note new things I learned along the way and what changed over the course of the project.

我也只想指出我在此过程中学到的新知识，以及在项目过程中发生的变化。

正确的做法 (What Went right)

I couldn’t have been as successful as I was without my team and the support of the executive team. Being open and candid in communicating where we were and what we were were doing was probably the single greatest contributor to the success of this project. As an example, during our exploratory phase, there was (healthy) conflict about whether to treat a paper’s MonthPublished feature as a categorical or numerical variable. I pushed for categorical so we would treat the publishing as a cyclical process, but I ended up being swayed because out team’s communication style allowed healthy disagreement to take place.
没有我的团队和执行团队的支持，我再也无法取得成功。坦诚坦诚地交流我们所处的位置和所做的事情可能是该项目成功的唯一最大贡献者。例如，在我们的探索阶段，是否将论文的MonthPublished功能视为分类变量或数值变量存在(正常)冲突。我坚持进行分类，因此我们将发布视为一个周期性的过程，但是由于团队的沟通方式允许进行健康的意见分歧，我最终受到了影响。
The willingness for everyone to be curious and explore their hunches allowed for a lot of room for creativity in the way we set up the model. Because the data was as unique as it was, we had to think outside of the box with the way we treated it. Furthermore, I think the sub-components of the model were the best choices we could have made.
每个人的好奇心和探索直觉的意愿为我们建立模型的方式提供了很大的创造空间。因为数据是如此独特，所以我们必须以处理数据的方式来思考。此外，我认为模型的子组件是我们可以做出的最佳选择。

我需要改进的地方 (Where I Need to Improve)

Not spending enough time specifically planning the sequence of the pipeline lead to bottlenecks in my workflow.
没有花费足够的时间专门规划管道的顺序会导致我的工作流出现瓶颈。
It can be fun to try and test parts of the model together at early stages, but relying too much on early findings skewed my decision making down the line and we ended up eliminating entire portions because they weren’t robust with more data.
在早期阶段尝试测试模型的各个部分可能会很有趣，但是过多地依赖早期发现会使我做出的决策偏颇，最终我们删除了整个部分，因为它们对更多数据而言不够稳健。
We spent about a third of our time on exploratory research, but I think we could have still spent more time on it. I would have ideally spent half of the project getting familiar with the relationships between variables.
我们在探索性研究上花费了大约三分之一的时间，但是我认为我们仍然可以花更多的时间在它上面。理想情况下，我会花一半的时间来熟悉变量之间的关系。
The process of working with such a large dataset went suspiciously smoothly and wish there were more challenges during this stage.
处理如此大的数据集的过程令人怀疑地顺利进行，希望在此阶段还有更多挑战。
I mentioned in the section above that I thought we used methodologies in creative ways (and we certainly did), but I wish I had more time to explore other approaches that I haven’t used, like transformers or PyTorch.
我在上面的部分中提到，我认为我们以创造性的方式使用了方法(当然，确实如此)，但是我希望我有更多的时间来探索其他我尚未使用的方法，例如变压器或PyTorch。

我学到的其他东西 (Other Things I Learned)

Many of the packages I thought I knew well are incredibly deep and can accomplish a lot of common tasks I was doing manually until I learned about them. Read the docs!
我以为我很了解的许多程序包都非常深入，可以完成我以前手动执行的许多常见任务，直到我了解了它们。 阅读文档！
A broad knowledge-base of math, computer science, and database management is very important before you move onto larger projects. You can get by doing small projects built on shoddy code with little understanding of the underlying theory, but you’ll end up plateauing.
在进行大型项目之前，广泛的数学，计算机科学和数据库管理知识基础非常重要。您可以通过在不了解底层理论的情况下进行基于伪劣代码的小型项目来完成工作，但最终将陷入停滞。
Real-world data is messy. Cleanliness is next to godliness.
现实世界的数据是混乱的。洁净仅次于圣洁。

Overall, I would call the project successful based on the fact that the data was accurate at predicting the future popularity of papers with significant confidence.

总的来说，我认为该项目是成功的，因为该数据可以准确地预测论文的未来流行度，并且具有很大的信心。

Thanks for reading!

谢谢阅读！

翻译自: https://medium.com/@thejackmccumber/my-dive-into-professional-machine-learning-b10a9a4e49bd

jenkins 潜入网页

weixin_26721705

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
jenkins 潜入网页_我潜入专业机器学习

jenkins 潜入网页I finished up my internship about a month back. The experience itself was hugely rewarding and the people on my team that I had the pleasure of working with will be lifelong mentors. Now t...
复制链接

扫一扫