gsoc 任务_gsoc 2020最终报告wikimedia transferpy改进

最新推荐文章于 2024-08-10 23:14:03 发布

weixin_26755331

最新推荐文章于 2024-08-10 23:14:03 发布

阅读量181

点赞数

文章标签： python java 人工智能算法

原文链接：https://medium.com/@ajupazhamayil/gsoc-2020-final-report-wikimedia-transferpy-improvements-2f4fce77004a

版权

gsoc 任务

介绍 (Introduction)

Transferpy is a database backup and recovery tool intended to move large files between WMF production hosts and backup MariaDB remotely in an efficient way. It has root privileges and can open firewalls, perform compression, encryption, and checksum.

Transferpy是数据库备份和恢复工具，旨在在WMF生产主机之间移动大文件并以有效方式远程备份MariaDB。它具有root特权，可以打开防火墙，执行压缩，加密和校验和。

The aim of this proposal was to make the service more usable and faster. The project proposal comprised of three significant changes:

该提议的目的是使服务更加可用和更快。该项目提案包含三个重大更改：

1. Automatic free port detection: The transferpy needs a port to perform transfer between hosts, and the port needs to be explicitly specified by the user. Automating this frees the user from the burden of finding a free port in the machine.

1. 自动空闲端口检测 ：Transferpy需要一个端口来执行主机之间的传输，并且该端口需要由用户明确指定。自动执行此操作使用户免于在机器中找到可用端口的负担。

2. Enable Parallel checksum: The transferpy is intended to transfer huge files which has a size in the order of TBs. The checksum computation could take as much time as that of the transfer time. Calculating it in parallel with the data transfer improves the execution time.

2. 启用并行校验和 ：transferpy用于传输大小约为TB的大文件。校验和计算可能需要花费与传输时间一样多的时间。与数据传输并行计算可缩短执行时间。

3. Enable multiprocessing for multiple destination transfers: The transferpy has the capability to transfer data from one source to multiple destinations. Enabling multiprocessing with this transfer should shorten the turnaround time, provided the network has enough bandwidth.

3. 对多个目标传输启用多处理：传输程序具有将数据从一个源传输到多个目标的功能。如果网络具有足够的带宽，则通过此传输启用多处理应会缩短周转时间。

Automatic port detection and the concurrency issues which were introduced as a consequence of the addition of these functionalities were solved. The parallel checksum was completed and provided two options to the user viz. actual parallel transfer with lesser reliability and source only parallel transfer with more reliability. Enabling multiprocessing for transfer is still underway due to time constraints.

解决了自动端口检测和由于添加这些功能而引入的并发问题 。并行校验和已完成，并为用户提供了两个选项。实际并行传输具有较低的可靠性，而仅源并行传输具有较高的可靠性。由于时间限制，仍在为传输启用多处理。

Possible improvements were identified in the project while working on the first task. Hence the decision was made to complete two more tasks. The first task was the packaging of transferpy for Debian, and the second task was sphinx documentation.

在执行第一个任务时，在项目中确定了可能的改进。因此，决定再完成两项任务。第一个任务是Debian的transferpy 打包，第二个任务是sphinx 文档。

工作 (Work)

Initially, the task was to refactor the repository. Formerly, transferpy was part of the wmfmariadbpy project, and as these projects were logically separated, transferpy was moved to its own repository to make the development easy and clean. The structure of transferpy was improved by modularising it into a Firewall, Transfererrer, and MariaDB with required attributes and functions to make the future development more straightforward. The logs were also improved by using the python logging module and by adding additional error messages. A new parameter, verbose, was introduced to allow the user to input the level of logging.

最初，任务是重构存储库 。以前，transferpy是wmfmariadbpy项目的一部分，并且由于这些项目在逻辑上是分开的，因此transferpy被移至其自己的存储库中，以使开发变得容易且整洁。通过将Transferpy 模块化为具有所需属性和功能的Firewall，Transferrerer和MariaDB进行了改进，以使将来的开发更加简单。通过使用python日志记录模块并添加其他错误消息 ，日志也得到了改进。引入了一个新参数verbose，以允许用户输入日志记录级别。

An integral part of the initial work was to find a way to detect a free port on the destination machine for netcat to listen to. The netstat utility was found to be useful in finding free ports and was used to find the same. There was a potential of two processes finding the same port as an open port at an instance of time (concurrency issue). The resolution was to use an interprocess lock-based solution.

最初工作的组成部分是找到一种方法来检测目标计算机上的空闲端口，以供netcat侦听。发现netstat实用程序可用于查找可用端口，并用于查找相同端口。在某个时间实例(并发问题)中，有两个进程可能会找到与开放端口相同的端口。解决方案是使用基于进程间锁的解决方案 。

While working on the concurrency issue, I understood the project better. Based on further discussions, few more tasks apart from the ones proposed initially were taken up. The first one was related to the output generation. The transferpy used to provide output for every command it executed on the remote machines because of Cumin. Changes were made to suppress this cumin output, providing a better experience to the user.

在处理并发问题时，我对项目有了更好的了解。根据进一步的讨论，除了最初提出的任务外，几乎没有其他任务可以处理了。第一个与产出产生有关。由于Cumin的缘故，transpypy用于为其在远程计算机上执行的每个命令提供输出。进行了更改以抑制此孜然输出，从而为用户提供了更好的体验 。

The second one was about the distribution of the code. A consensus exists that it would work great for the user if a package is available for Debian and could distribute it via the Wikimedia repository. As it would be easier to maintain and roll out updates, packaging was done.

第二个是关于代码的分发。存在一个共识，即如果有可用于Debian的软件包，并且可以通过Wikimedia存储库分发它，它将对用户非常有用。由于维护和推出更新更加容易，因此打包已经完成。

The third one was related to the two-time documentation problem(in the wiki and code). Enough information for the user was present inside the code itself as comments and argument `help` options, which lead to consider making use of the same rather than writing content for wiki separately. So Sphinx, a documentation tool that automatically indexes the code using the comments present in the code, was leveraged for use. Sphinx also takes information from the `help` option and generates documentation as needed. These modifications changed the look of transferpy to a great extent. The documentation has been pushed to doc.wikimedia.com.

第三个与两次文档问题有关 (在Wiki和代码中)。对于用户来说，足够多的信息以注释和自变量“ help”选项的形式存在于代码本身中，从而导致考虑使用相同的信息，而不是分别为Wiki编写内容。因此，利用了Sphinx(一种文档工具，该工具使用代码中的注释自动为代码建立索引)的使用。 Sphinx也从`help`选项中获取信息，并根据需要生成文档。这些修改在很大程度上改变了转移的外观。该文档已推送到doc.wikimedia.com。

The next task was on parallel checksum. Since the files have a size in the order of TBs, the checksum calculation process itself takes a significant amount of time. So, the implementation of a parallel checksum using pipes and tee utilities was considered. The script was modified to calculate md5sum and write the checksum into a file. When the transfer is complete, the tool would compare the content of the md5sum file saved earlier with the transferred file’s checksum. A new parameter named parallel-checksum was created to make use of this feature. The existing checksum calculation was also improved by calculating the source checksum during the transfer itself. Parallel-checksum is faster but can only detect network errors. The standard checksum is slower but could also detect the disk errors. Hence both the options were kept. As I have been given two cloud machines with enough storage (2TB) for testing the results, I have benchmarked these options and observed significant improvement in the performance of the transferpy checksum calculation scenario.

下一个任务是并行校验和 。由于文件的大小约为TB，因此校验和计算过程本身会花费大量时间。因此，考虑了使用管道和tee实用程序实现并行校验和。修改了脚本以计算md5sum并将校验和写入文件中。传输完成后，该工具会将之前保存的md5sum文件的内容与传输文件的校验和进行比较。创建了一个名为parallel-checksum的新参数来利用此功能。通过在传输本身期间计算源校验和，也改进了现有的校验和计算。 并行校验和速度更快，但只能检测网络错误 。 标准校验和比较慢，但也可以检测到磁盘错误 。因此，两个选项都保留了。由于为我提供了两台具有足够存储(2TB)的云机以测试结果，因此我对这些选项进行了基准测试，并观察到了可传递校验和计算方案的性能有了显着提高 。

The number of parameters was increasing, and it was found that it would be inconvenient for the user to remember everything. So a configuration file was introduced so that the user just needs to tweak it once for their use. The tool could take the options from the configuration file in subsequent transfers. The feature was implemented with the help of the configparser python module. The command-line arguments were given the most priority, followed by the configuration file and the default options having the least priority. Since it is required to create temporary files for locking and calculating the checksum, necessary changes were made to clean them properly at the end of execution, even with the occurrence of any possible errors/exceptions.

参数的数量正在增加，并且发现对于用户来说记住所有内容将很不方便。因此引入了一个配置文件 ，以便用户只需对其进行一次调整即可使用。该工具可以在后续传输中从配置文件中获取选项。该功能是在configparser python模块的帮助下实现的。命令行参数的优先级最高，其次是配置文件和默认选项的优先级最低。由于需要创建用于锁定和计算校验和的临时文件，因此即使在执行任何可能的错误/异常的情况下，都必须进行必要的更改以正确清理它们。

Incorporating the concept of multiprocess to the transfers was the next step. This was expected to help the user to transfer source data to multiple destinations simultaneously. The testing unveiled that multiprocessing improves the performance in an environment with a higher network bandwidth and good disk I/O. A POC code for this was implemented and tested. It is a work in progress and currently facing failure when transferring a higher workload in the test machines.

下一步将多进程的概念纳入转移。预期这将帮助用户将源数据同时传输到多个目标 。测试表明，在具有更高网络带宽和良好磁盘I / O的环境中，多处理可提高性能。为此的POC代码已实现并经过测试。这是一项正在进行的工作，当前在测试机中传输更高的工作量时面临故障。

I have followed best practices, made required comments, have written documentation and tests for all the above-said features. The final product has better usability and also is better for future developments.

我遵循了最佳实践，提出了必要的意见，并为所有上述功能编写了书面文档和测试 。最终产品具有更好的可用性，也更适合将来的开发。

I am delighted that the code has reached production, and WMF machines are currently using it. The Debian package (version 1.0) is being used as a medium to distribute the final working product to the production.

我很高兴代码已经投入生产 ，WMF计算机目前正在使用它。 Debian软件包(版本1.0)被用作将最终工作产品分发到生产中的媒介。

The proposal: Click here

提案：点击这里

All the merged PRs: Click here

所有合并的PR：单击此处

My Phabricator dashboard: Click here

我的Phabricator仪表板：单击此处

Transferpy documentation: Click here

Transferpy文档：单击此处

Transferpy Debian package: Click here

Transferpy Debian软件包：单击此处

挑战性 (Challenges)

Some unanticipated challenges emerged during the course of work. The concurrency issue that happens when multiple processes try to read or write from the same shared resource, causing side effects, was a major one among them and it came up multiple times.

在工作过程中出现了一些意想不到的挑战。当多个进程尝试从同一共享资源读取或写入并引起副作用时，发生的并发问题是其中的主要问题，并且多次出现。

The first occurrence was during the automatic port detection task. There was a possibility for transferpy to detect the same port number as a free port while two instances of transferpy were running simultaneously. This issue has been solved by implementing a locking mechanism. The issue occurred again while working on the parallel checksum. The intermediate file names used for storing calculated checksum caused concurrency issues. The creation of these intermediate files was necessary because the checksum calculation of a directory with a large number of files was resulting in a deadlock due to the limitation of pipe buffer size in python. This issue was fixed by naming it using destination port and hostname, which is a unique key.

第一次出现是在自动端口检测任务期间。当两个transferpy实例同时运行时，transferpy可能会检测到与空闲端口相同的端口号。通过实现锁定机制已解决了该问题。使用并行校验和时，再次发生此问题。用于存储计算的校验和的中间文件名导致并发问题。这些中间文件的创建是必要的，因为包含大量文件的目录的校验和计算会由于python中管道缓冲区大小的限制而导致死锁。通过使用目标端口和主机名(这是唯一键)对其进行命名，从而解决了此问题。

Another challenge was related to the tox development environment. I was not much aware of that, and running everything via tox was easy compared to using direct commands. So I learned, implemented, and resolved the issues related to it when I wrote the documentation and packaging.

另一个挑战与毒物开发环境有关 。我对此并不太了解，与使用直接命令相比，通过tox运行所有内容都很容易。因此，在编写文档和打包文档时，我学习，实施并解决了与之相关的问题。

Packaging transferpy for Debian and releasing it was another exciting challenge for me as it was new to me. Also, the administration of assigned test machines from scratch was another interesting challenge.

包装Debian的transferpy并发布它对我来说是另一个令人兴奋的挑战，因为这对我来说是新的。同样，从头开始管理分配的测试机也是另一个有趣的挑战。

In all the challenges I faced, the interaction I had with the mentors were incredibly helpful. They gave timely input and encouraged me to come up with effective solutions.

在我面对的所有挑战中， 我与导师的互动都非常有帮助。他们及时提供了意见，并鼓励我提出有效的解决方案。

经验 (Experience)

I have learned a lot of new things this summer. First, I learned how Wikimedia projects work using Gerrit and Phabricator. I was able to communicate with members using Zulip and IRC.

今年夏天，我学到了很多新东西。首先， 我了解了 Gerrit和Phabricator 如何使用Wikimedia项目 。我能够使用Zulip和IRC成员进行沟通。

For the purpose of benchmarks, I have been assigned a couple of machines, and I have taken care of the administration of the same from scratch. It gave me knowledge of how Wikimedia servers are being monitored using the horizon and been configured using Puppet with the help of profile and role configurations. I was excited to have configured the machines allotted to me using Puppet, to have Cumin (a remote machine executor) and MariaDB with xtrabackup. It was an amazing experience. Everything was new and exciting.

为了进行基准测试，我被分配了几台机器，并且从头开始管理这些机器。它使我了解了如何使用地平线监视 Wikimedia服务器以及如何使用Puppet在配置文件和角色配置的帮助下配置Wikimedia服务器。我被激发以已配置成使用木偶分配给我的机器，有孜然 (远程机器执行程序)，并用MariaDB的 xtrabackup。 这是一次了不起的经历。一切都是新的，令人兴奋。

This GSoC with Wikimedia was an enjoyable experience. The mentors I have been assigned with were very friendly, helpful, and supportive. The community as a whole is fabulous and enthusiastic. The GSoC-Outreachy video calls I had were remarkable. Everyone shared their stories in a biweekly report, and that was inspirational.

带有Wikimedia的GSoC是令人愉快的体验。我被指派的导师非常友好，乐于助人和支持。整个社区是美好而热情的。我进行的GSoC-Outreachy视频通话非常出色。每个人每两周发表一次报告，分享他们的故事，这是鼓舞人心的 。

The coding experience was enlightening. I have learned a lot of best practices in coding and other GNU/Linux related concepts. The work was always enjoyable and insightful. Mentors were very generous in explaining why I should be doing things in a particular way. The Overall experience with the Wikimedia community was awesome. The mentors were very knowledgeable. I don’t have words to portray how fantastic they were.

编码经验很有启发性。我已经学习了很多编码和其他GNU / Linux相关概念的最佳实践。这项工作总是令人愉快和有见地。导师非常慷慨地解释为什么我应该以特定的方式做事。 Wikimedia社区的整体体验很棒。导师知识渊博。我没有话能描述他们有多棒。

致谢 (Acknowledgment)

I am very thankful to my mentors Jaime Crespo and Manuel Aróstegui. Without them, the work never would have been this joyful and rewarding. Jaime Crespo was always available to answer my questions on time. I love the Wikimedia community for its welcoming and enthusiastic nature.

我非常感谢我的导师Jaime Crespo和ManuelAróstegui 。没有他们，这项工作将永远不会是充满欢乐和收获的。 Jaime Crespo随时可以回答我的问题。我喜欢Wikimedia社区的热情和热情。

I thank Hashar, Wikimedia, for his utmost help during the time of packaging and documentation. I also thank Srishti Sethi and Pavithra Eswaramoorthy, Wikimedia GSoC organization, for organizing GSoC-Outreachy meets and taking care of all the students’ needs and doubts.

感谢Wikimedia的Hashar在打包和记录文档过程中提供的最大帮助。我也感谢Wikimedia GSoC组织的Srishti Sethi和Pavithra Eswaramoorthy组织GSoC-Outreachy满足并照顾了所有学生的需求和疑虑。

I would also like to thank my friend Jyothis Jagan. He was generous and extremely helpful in proofreading and editing from the very start of the proposal to this final stage report.

我还要感谢我的朋友Jyothis Jagan 。从提案开始到最终报告，他在校对和编辑方面都很慷慨而且非常有帮助。

Finally, thanks to the GSoC program, without which, I wouldn’t be a part of this awesome project and gain this memorable experience. I am thankful to them for creating this kind of opportunity for students.

最后，由于有了GSoC程序 ，如果没有GSoC程序 ，我将不会成为这个令人敬畏的项目的一部分，而不会获得这种令人难忘的经验。我感谢他们为学生创造了这种机会。

未来的工作 (Future Work)

- Complete multiprocess transfer.

-完整的多进程传输。

- Improve kill_job function in CuminExecution: Currently, kill_job function kills the subprocess in the transferpy running machine. Instead, it should kill the actual process in the remote machine.

-改进了CuminExecution中的kill_job函数：目前，kill_job函数杀死了运行中的子程序。相反，它应该终止远程计算机中的实际进程。

- Improve transferpy with better GNU/Linux commands.

-使用更好的GNU / Linux命令提高了传输效率。

翻译自: https://medium.com/@ajupazhamayil/gsoc-2020-final-report-wikimedia-transferpy-improvements-2f4fce77004a

gsoc 任务

weixin_26755331

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
gsoc 任务_gsoc 2020最终报告wikimedia transferpy改进

gsoc 任务介绍 (Introduction)Transferpy is a database backup and recovery tool intended to move large files between WMF production hosts and backup MariaDB remotely in an efficient way. It has root privi...
复制链接

扫一扫