项目更新：2016年7月-CSDN博客

About a quarter ago (April), I posted my first regular update on all of the various projects I’m working on. As side projects tend to go, some fall into and out of favor, and occasionally new ones crop up. As I develop on projects, I post regular updates, but it’s helpful to me (and hopefully some of you), to do an occasional 30,000 ft view of all of them in one place.

大约四分之一年前（四月），我发布了我正在进行的所有项目的第一次定期更新。随着辅助项目的发展，一些项目会失宠，有时还会出现新的项目。在我开发项目时，我会定期发布更新，但这对我（并希望对您中的某些人）很有帮助，偶尔在一个地方查看所有30,000英尺的视图。

The vast majority of these projects are open source, and I’m trying to be better about actually including outside contributors, so if any of these projects pique your interest, please reach out, here or on github.

这些项目中的绝大多数都是开源的，我想在实际包括外部贡献者方面做得更好，因此，如果这些项目中的任何一个引起了您的兴趣，请在此处或在github上寻求帮助。

辅助项目 (Side Projects)

First off, side projects.

首先，附带项目。

I’ve posted a couple of times recently about www.pedalwrencher.com. It was launched about a year and a half ago, and one of my goals for this past quarter was to update it’s codebase, get it more stable, add some features (like streamlined onboarding, an administration interface, and email notificaiton support), and try to get it growing again. It has grown a bit over 10% over the course of the quarter in terms of users, which is modest, but pretty low-touch.

我已经发布了对夫妇的时代最近约www.pedalwrencher.com 。它大约在一年半前推出，而我过去一个季度的目标之一是更新它的代码库，使其更稳定，添加一些功能（例如简化的入门，管理界面和电子邮件通知支持），以及尝试使其再次增长。就用户而言，它在本季度中增长了10％以上，虽然规模适中，但触感却很低。

开源项目 (Open Source Projects)

My open source projects have also continued to plug along, with some new ones being added, and an increased effort in actually presenting them and getting help with them. You can check out a couple of presentations I’ve given to that end here and here.

我的开源项目也一直在继续进行，添加了一些新项目，并且在展示它们并获得帮助方面付出了更多的努力。您可以在此处和此处查看我为此目的提供的一些演示文稿。

To simplify things, I’ll break up these projects into 3 categories:

为简化起见，我将这些项目分为3类：

Actively Developing: these are projects that me and others are regularly commiting new features to, and are seeking more contributors and users for.
Maintenance Mode: these are stable projects without any new features planned, but are being actively maintained and will continue to be updated as is required over time.
Defunct / Toy Projects: these are projects that have fallen into disrepair or have been superseded by another project, or are just one off toy projects not intended for real long-term use.

积极开发：这些项目是我和其他人定期向其提交新功能的项目，正在寻求更多的贡献者和用户。
维护模式：这些是稳定的项目，没有计划任何新功能，但是正在积极维护，并将随着时间的推移继续进行更新。
已废止/玩具项目：这些项目已经失修或已被另一个项目取代，或者只是一个玩具项目，并非旨在长期使用。

积极发展 (Actively Developing)

git-pandas: Still my favorite of the bunch, git-pandas continues to be developed. At this stage, most of the work has been moving towards a v2 release, which has two primary goals: clarity and performance. To that end, I’ve done two main additions, of unified glob-style syntax and parallelizing the cumulative blame function. The next big step is going to be parallelizing at the Repo level. All project directories have loops that iterate over calls to their constituent repos that can be parallelized quite cleanly, speeding things up drastically. I’d love help with this, so if you’re interested in high performance numeric python, joblib and the sort, reach out.
twitter-pandas: Inspired by git-pandas, twitter-pandas is intended to provide a clean, clear interface to twitter API data with a pandas DataFrame based construct. There’s 3 of us working on this seriously, and we’ve gotten most of the tweepy API replicated, so now it’s onto the more interesting parts: making it actually useful. This encompasses two main thrusts: making the inputs to the methods less complex, and reducing the number of methods into a smaller set of obvious functionality. We do this by picking out use-cases, and working through them with those two goals in mind, refactoring along the way. Look out for a blog post on how this was applied to the friendship methods next week to make it trivial to find the people you follow that don’t follow you back.
categorical_encoding: category_encoders started out as a pair of blog posts about the concept of encoding categorical variables, and eventually grew into a production grade pip-installable library that is compatible with scikit-learn. It’s being used in production, and over the past quarter it’s gotten some stability and consistency upgrades, better testing, better support for edge cases and missing values, and a bunch of other sort of boring improvements. It’s now also available on conda as well as pypi, so check it out. In the next quarter I’d like to get a good benchmarking written for both performance in terms of time/cpu/memory, but also quality of encoding. If you’re a data scientist interested in that sort of thing, of course please reach out.
DummyRDD: DummyRDD has continued to be extremely useful for unit testing pyspark based software. We’ve got a bunch of the RDD functionality supported (no DataFrame or DataSet support yet), so if you find it useful, let me know, if we’re missing something that would help you, let me know, and if you want to help out, there’s still a ton to do.
petersburg: Totally just my personal little experiment, petersburg continues to get development. We support frequency and mixed-mode frequency/classification based estimation now, but I haven’t really found a usecase with open data where it is extremely useful. Still neat to hack on.

git-pandas ：仍然是我最喜欢的一堆，git-pandas仍在继续开发。在此阶段，大多数工作已朝着v2版本迈进，该版本具有两个主要目标：清晰度和性能。为此，我完成了两个主要补充，统一的glob样式语法和并行化累计blame函数。下一步将是在回购级别并行化。所有项目目录都有循环，可以循环访问其组成存储库，这些存储库可以非常清晰地并行化，从而大大加快了处理速度。我很乐意提供帮助，因此，如果您对高性能数字python，joblib和sort感兴趣，请联系。
twitter-pandas ：受git-pandas的启发， twitter-pandas旨在使用基于pandas DataFrame的构造为Twitter API数据提供干净，清晰的界面。我们当中有3个人认真地致力于这一工作，并且我们已经复制了大多数tweepy API，因此现在涉及到更有趣的部分：使其真正有用。这包括两个主要方面：减少方法的输入复杂度，以及将方法的数量减少为一组较小的明显功能。为此，我们挑选出用例，并牢记这两个目标来进行使用，并在整个过程中进行重构。请在下周查找有关如何将其应用于友谊方法的博客文章，以轻松找到不追随您的人。
categorical_encoding ：category_encoders最初是关于有关对分类变量进行编码的概念的博客文章，最终发展为与scikit-learn兼容的可生产级pip可安装的库。它已用于生产中，在过去的一个季度中，它进行了一些稳定性和一致性升级，更好的测试，对边缘情况和缺失值的更好支持以及许多其他无聊的改进。现在，它也可以在conda和pypi上使用，因此请查看。在下个季度中，我希望针对时间/ cpu /内存的性能以及编码质量获得良好的基准测试。如果您是对此类事情感兴趣的数据科学家，请与我们联系。
DummyRDD ：DummyRDD对于基于pyspark的软件的单元测试仍然非常有用。我们已经支持了许多RDD功能（尚不支持DataFrame或DataSet），所以如果您觉得它有用，请告诉我，如果我们缺少可以帮助您的信息，请告诉我，如果您想要要提供帮助，还有很多事情要做。
彼得斯堡：完全是我个人的小实验，彼得斯堡继续得到发展。我们现在支持基于频率和混合模式频率/分类的估计，但是我还没有真正找到一个使用开放数据的用例，它非常有用。仍然很整洁。

维护模式 (Maintenance Mode)

pypi-publisher: used in production a good bit, apparently (search “pypi-publisher” in the code on github), and by me nearly daily. It does what it’s supposed to and meets my needs perfectly, but could use support for things like bdist or conda. If anyone wants to work on that sort of thing, I’d welcome it, but I probably won’t personally any time soon.
gitnoc: I use this about weekly, currently it’s a bit slow, so it’s on the backburner until the repo-level parallelism in git-pandas is done.
cookiecutter-flask: people seem to be using this still pretty consistently, which is great. I haven’t touched it in a while.
pygeohash: this gets used in production at at least a couple of companies, is stable, fast, and well supported. I’m not sure what of value I could add to this other than continued maintenance, but if you have ideas, let me know.
pyculiarity: this code is still kind of ugly, but it works and is being used in a few places. I’d really like to go through it at some point and simplify it a lot, but don’t have plans to do so in the near term.

pypi-publisher ：相当多地用于生产中（在github上的代码中搜索“ pypi-publisher”），而我几乎每天都在使用。它可以满足我的需求，并且可以满足bdist或conda之类的需求。如果有人想从事这类工作，我会很欢迎，但我个人不会很快离开。
gitnoc ：我大约每周使用一次，目前它有点慢，所以它一直很忙，直到git-pandas中的回购级并行性完成为止。
cookiecutter-flask ：人们似乎仍然一致地使用它，这很棒。我有一段时间没碰过它了。
pygeohash ：至少有两家公司用于生产，稳定，快速且得到了良好的支持。除了继续维护之外，我不确定可以增加什么价值，但是如果您有想法，请告诉我。
pyculiarity ：该代码仍然很难看，但是它可以工作并且在一些地方使用。我真的很想在某些时候进行一些简化，但是在短期内没有计划。

退役/玩具项目 (Defunct / Toy Projects)

RogerRoger: this is a little monitoring API written in Scala. It’s mostly just so that I can learn some Scala and D3, so nothing to worry about at this stage. Later on it may be a neat little utility, but don’t hold your breath.
flink-python-examples: I think this has finally fallen out of sync with flink, so I’m not sure any of the examples still work. I’m not using flink for anything, so haven’t had a free day to devote to updating it.
incomprehensible: still kinda fun, but served it’s purpose.
sklearn-extensions: I think there should be a single reference for scikit-learn style 3rd party packages, but now think that it may be better off as a website or something like that than as a package. It’s a lot of work and a licensing nightmare to keep it all in one package. To that end, scikit-learn-contrib is interesting, check that out.

RogerRoger：这是用Scala编写的一点监视API。基本上只是为了让我可以学习一些Scala和D3，所以在此阶段无需担心。稍后可能是一个整洁的小工具，但是请不要屏住呼吸。
flink-python-examples ：我认为这最终与flink失去了同步，因此我不确定任何示例是否仍然有效。我没有使用flink做任何事情，因此还没有空闲的日子来更新它。
难以理解：仍然很有趣，但达到了目的。
sklearn-extensions ：我认为scikit-learn风格的3rd Party软件包应该有一个参考，但是现在认为，将它作为网站或类似的东西可能比打包更好。要将其全部打包在一个包中，需要进行大量工作和许可梦night。为此， scikit-learn-contrib很有趣，请检查一下。