Smarter API Reviews With Gen AI
更智能的API评论与生成式人工智能
January 04, 2024 by Daniel Smith | Comments
2024年1月4日:Daniel Smith |评论
With the rise of generative AI, businesses are trying to figure out how they can implement it in their environment to speed up one process or another. Perhaps the best place to start looking are existing pain points in any process, then working through ideas of how AI can apply to the situation. (This blog post was written by a Real Human 😜) 随着生成式人工智能的兴起,企业正试图找出如何在自己的环境中实现它,以加快一个或另一个过程。也许最好的开始是寻找任何过程中现有的痛点,然后研究人工智能如何应用于这种情况。 | Dall-E 3 image prompt: A futuristic, intelligent robot with a sleek design sitting at a computer, analyzing code diffs on a screen. The robot has a friendly, approachable appearance, and the screen shows complex code with highlighted sections. The environment suggests a modern, high-tech office space. Dall-E 3图片提示:一个设计时尚的未来智能机器人坐在电脑前,分析屏幕上的代码差异。该机器人外观友好、平易近人,屏幕上显示了带有突出显示部分的复杂代码。这里的环境让人联想到一个现代化、高科技的办公空间。 |
History
历史
One major pain point in The Qt Project has historically been completing API Reviews on-time before a release. Since API additions and changes are significant alterations to how the Qt Framework is used, and such changes to existing APIs are often breaking for users, these changes must be carefully considered before being allowed into the final release. Adding new features to Qt often means adding new APIs, and we want those new APIs to be well designed and stable for users in the future. At times though, existing APIs need to be changed and we need to also ensure that those changes are well reasoned and have no alternatives that avoid breaking compatibility for users.
Qt项目的一个主要痛点是在发布前按时完成API评审。由于API的添加和更改是对Qt框架使用方式的重大更改,并且对现有API的此类更改通常会对用户造成破坏,因此在允许这些更改进入最终版本之前,必须仔细考虑这些更改。向Qt添加新功能通常意味着添加新的API,我们希望这些新的API在未来为用户设计良好且稳定。然而,有时需要更改现有的API,我们还需要确保这些更改是合理的,并且没有其他选择可以避免破坏用户的兼容性。
The way this has been done for years works, but has lead to missed release dates on a number of occasions where a significant change to the API was initially merged, but upon review at release time, needs to be retracted or significantly changed before final release. To combat this delay between implementation of an API change and final preparation for a release, we wanted to move the reviewing of API changes earlier in the cycle. But how?
多年来,这种方式一直有效,但在许多情况下,API的重大更改最初被合并,但在发布时审查后,需要撤回或在最终发布前进行重大更改,从而导致错过发布日期。为了克服API变更的实现和发布的最终准备之间的延迟,我们希望在周期的早期对API变更进行审查。但是怎么做呢?
Initial discussion about this centered around just simply tagging any header file changes for manual review, but the noise created by this was deemed to be excessive. Such a solution would just lead to even more work. But what if we could offload some of initial code analysis to an AI to at the least, determine if a change should be deemed "significant" or not?
关于这一点的最初讨论集中在简单地标记任何头文件更改以供手动审查,但由此产生的噪音被认为是过大的。这样的解决方案只会带来更多的工作。但是,如果我们可以将一些初始代码分析卸载给人工智能,至少确定一个更改是否应该被视为“重大”,该怎么办?
What is GPT?
什么是GPT?
GPT, short for "Generative Pretrained Transformer", is a complex mathematical model that "understands" the relationships between data. In uses like ChatGPT which you may have used, the model has been trained on language data to create a model of how words in a language relate to each other. By studying billions of words in books, documents, blog posts, recording transcripts and so on, the model understands how one word relates to another depending on contexts. In this way, it can also understand new inputs and generate outputs, word-by-word, forming a response similar to how we speak. When it generates a new word for output, it reviews the context and generates the most likely to occur next word in the sentence.
GPT,简称“生成预训练转换器”,是一个复杂的数学模型,可以“理解”数据之间的关系。在可能使用过的ChatGPT等用途中,该模型已根据语言数据进行了训练,以创建一个语言中单词如何相互关联的模型。通过研究书籍、文档、博客文章、录音记录等中的数十亿个单词,该模型了解了一个单词如何根据上下文与另一个单词相关联。通过这种方式,它还可以逐字理解新的输入并产生输出,形成类似于我们说话方式的反应。当它生成一个新单词用于输出时,它会查看上下文并生成句子中最有可能出现的下一个单词。
There are now many LLMs (Large Language Model) out there today, and OpenAI's GPT-4 model is one such LLM. There have been four major iterations in GPT thus far, with each leapfrogging the previous model in capability, memory, and comprehension.
现在有很多LLM(大型语言模型),OpenAI的GPT-4模型就是这样的LLM之一。到目前为止,GPT已经进行了四次主要迭代,每次迭代在能力、内存和理解方面都超越了以前的模型。
While GPT has traditionally been used for human language comprehension, it's just a model and can also be trained to understand relationships in other types of data as well, such as code. This means that we can have a chat with it about some block of code and ask it for analysis or have a back-and-forth conversation to better understand the code ourselves.
虽然GPT传统上用于人类语言理解,但它只是一个模型,也可以被训练来理解其他类型数据(如代码)中的关系。这意味着我们可以与它就某个代码块进行聊天,并要求它进行分析,或者进行来回对话,以更好地理解代码。
Where we stand today
我们今天的处境
Since mid-December, we've been operating a proof-of-concept bot which monitors changes submitted to codereview.qt-project.org and runs the generated diff through a GPT-4 analysis. Since generative AI does not act on its own, it must be prompted in order to generate an output. A prompt can contain instructions, a request, contextual information, and so on. Here's what we use for our API reviews, along with a raw diff of a given code change:
自12月中旬以来,我们一直在运行一个概念验证机器人,该机器人监控提交给codereview.qt-project.org的更改,并通过GPT-4分析运行生成的差异。由于生成性人工智能不会自行行动,因此必须对其进行提示才能生成输出。提示可以包含指令、请求、上下文信息等。以下是我们用于API审查的内容,以及给定代码更改的原始差异:
[abridged] "Task: Classify the change in a public header file as significant to the behavior and usage of the API or not. Additional qualifications: changes to 'private:' sections of public headers are not significant; changes to platform plugins, sometimes identified by the file path, are not significant; whitespace-only changes are not significant; ..."
[节略]“任务:将公共头文件中的更改分类为对API的行为和使用是否重要。其他限定条件:对公共头的'private:'部分的更改不重要;对平台插件的更改(有时由文件路径标识)不重要;仅空白区的更改不重大;…”
Along with some backend tricks to shoehorn GPT into providing a relevant response, the behavior has proven generally reliable when using GPT-4. In about one month of operation, over 230 changes have been tagged with the "Needs API-Review" hashtag, each receiving a brief analysis of what changes make it significant to the use and operation of the public API.
除了一些后端技巧来迫使GPT提供相关响应外,在使用GPT-4时,该行为已被证明总体上是可靠的。在大约一个月的运行过程中,超过230项更改被标记为“Needs API-Review”标签,每个更改都会收到对哪些更改对公共API的使用和运行具有重要意义的简要分析。
Example of an addition of a new API
新增API示例
In this example, new functionality is being added to QRemoteObjectHost.
在本例中,将向QRemoteObjectHost添加新功能。
diff --git a/.../qconnectionfactories.h b/.../qconnectionfactories.h
index b56a34f..7eabfd6 100644
--- a/src/remoteobjects/qconnectionfactories.h
+++ b/src/remoteobjects/qconnectionfactories.h
@@ -18,7 +18,7 @@
#include <QtNetwork/qabstractsocket.h>
#include <QtRemoteObjects/qtremoteobjectglobal.h>
-
+#include <QtRemoteObjects/qremoteobjectnode.h>
QT_BEGIN_NAMESPACE
@@ -116,6 +116,7 @@
Q_SIGNALS:
void shouldReconnect(QtROClientIoDevice*);
+ void setError(QRemoteObjectNode::ErrorCode);
protected:
virtual void doDisconnectFromServer() = 0;
Git diff: Show more
diff --git a/.../qremoteobjectnode.h b/.../qremoteobjectnode.h
index 18f75bc..13e2540 100644
--- a/src/remoteobjects/qremoteobjectnode.h
+++ b/src/remoteobjects/qremoteobjectnode.h
@@ -6,6 +6,7 @@
#include <QtCore/qsharedpointer.h>
#include <QtCore/qmetaobject.h>
+#include <QtNetwork/qlocalserver.h>
#include <QtRemoteObjects/qtremoteobjectglobal.h>
#include <QtRemoteObjects/qremoteobjectregistry.h>
#include <QtRemoteObjects/qremoteobjectdynamicreplica.h>
@@ -61,7 +62,8 @@
MissingObjectName,
HostUrlInvalid,
ProtocolMismatch,
- ListenFailed
+ ListenFailed,
+ SocketAccessError
};
Q_ENUM(ErrorCode)
@@ -181,6 +183,7 @@
~QRemoteObjectHost() override;
QUrl hostUrl() const override;
bool setHostUrl(const QUrl &hostAddress, AllowedSchemas allowedSchemas=BuiltInSchemasOnly) override;
+ static void setLocalServerOptions(QLocalServer::SocketOptions options);
Q_SIGNALS:
void hostUrlChanged();
When prompted, GPT-4 provides the following analysis of the new API additions, summarizing them nicely:
提示时,GPT-4对新添加的API进行了以下分析,并对其进行了很好的总结:
1.A new
#include <QtRemoteObjects/qremoteobjectnode.h>
directive is added inqconnectionfactories.h
, which could potentially expose additional symbols or types that were not available before in this header. This change can affect users who include this header.1.在qconnectionfactories.h中添加了一个新的#include<QtRemoteObjects/qremoteobjectnode.h>指令,该指令可能会暴露以前在此头文件中不可用的其他符号或类型。此更改可能会影响包含此头文件的用户。
2.The addition of the
setError(QRemoteObjectNode::ErrorCode)
signal in theqconnectionfactories.h
file is a significant change. It represents a new API for signaling errors with an error code, which will affect how users of the API can handle error conditions.2.在qconnectionfactories.h文件中添加setError(QRemoteObjectNode::ErrorCode)信号是一个重大更改。它代表了一个新的API,用于用错误代码发出错误信号,这将影响API用户处理错误条件的方式。
3.The addition of a new
SocketAccessError
enum value to theErrorCode
enum inqremoteobjectnode.h
is a significant change. It adds a new potential error state that users of the API will need to handle.3.在qremoteobjectnode.h中的ErrorCode枚举中添加新的SocketAccessError枚举值是一个重大更改。它添加了API用户需要处理的新的潜在错误状态。
4.The addition of the
static void setLocalServerOptions(QLocalServer::SocketOptions options)
function to theQRemoteObjectHost
class inqremoteobjectnode.h
is a significant change. It introduces a new API for setting options on the local server, which will influence how users configure their instances ofQRemoteObjectHost
.4.将static void setLocalServerOptions(QLocalServer::SocketOptions options)函数添加到qremoteobjectnode.h中的QRemoteObjectHost类是一个重大更改。它引入了一个新的API,用于在本地服务器上设置选项,这将影响用户如何配置他们的QRemoteObjectHost实例。
These changes can affect both the behavior of existing code (if it relies on previous error codes or does not expect the newsetError
signal) and the way developers use the API (by providing new methods and configuration options).这些更改可能会影响现有代码的行为(如果它依赖于以前的错误代码或不期望新的setError信号)和开发人员使用API的方式(通过提供新的方法和配置选项)。
In this case, GPT identifies the new behavior, but also makes a note that even though this is an addition, it has the potential to cause issues for existing users if their code is not ready to handle the new signal. Such a warning may be useful to remind a reviewer that additional scrutiny and perhaps a documentation note is needed before merging the change.
在这种情况下,GPT识别了新的行为,但也注意到,即使这是一个添加,如果现有用户的代码还没有准备好处理新信号,它也有可能给现有用户带来问题。这样的警告可能有助于提醒审查人员,在合并更改之前,可能需要额外的审查和文档说明。
Example of a change to an existing API
现有API变更示例
When a simple change is made to an existing API in qvariantanimation.h, this what the bot sees:
当对qvariantamation.h中现有的API进行简单更改时,机器人程序会看到:
diff --git a/.../qvariantanimation.h b/.../qvariantanimation.h
index 640c057..4bdb971 100644
--- a/src/corelib/animation/qvariantanimation.h
+++ b/src/corelib/animation/qvariantanimation.h
@@ -26,7 +26,7 @@
BINDABLE bindableEasingCurve)
public:
- typedef QPair<qreal, QVariant> KeyValue;
+ using KeyValue = std::pair<qreal, QVariant>;
typedef QList<KeyValue> KeyValues;
QVariantAnimation(QObject *parent = nullptr);
When prompted, GPT-4 provides the following analysis which can be used to prime the reviewer later about what change was made:
当出现提示时,GPT-4提供以下分析,可用于稍后向审查人员介绍所做的更改:
"In qvariantanimation.h, the change from a typedef to a using declaration for KeyValue alters the underlying type from QPair to std::pair, potentially impacting how API users interact with KeyValue instances."
“在qvariantamation.h中,KeyValue从typedef更改为using声明会将基础类型从QPair更改为std::pair,这可能会影响API用户与KeyValue实例的交互方式。”
Early Attempts
早期尝试
Generative AI is a fickle thing. As mentioned earlier, it just guesses the next word, one at a time, over and over until it thinks it should stop. (Check out how LLMs function, it's pretty mind boggling.) While GPT 3.5 is pretty great for having a chat with, it often forgets what you told it, and doesn't follow instructions very well. It also suffers from "recency bias", where the words later in your prompt are considered more important that earlier words. This a bit of a problem since some of the instructions may be ignored, or worse, some of the change's diff being passed are simply not considered when generating output. This resulted in a less than stellar hit rate when using GPT 3.5 for analysis. Even taking steps to request a more consistent and less creative output, it would still inconsistently ignore whole parts of the request, or even just hallucinate entirely what was added or removed in a change. 生成式人工智能是一件变化无常的事情。如前所述,它只是一遍又一遍地猜测下一个单词,直到它认为应该停止为止。(看看LLM是如何工作的,这真是令人难以置信。)虽然GPT 3.5非常适合聊天,但它经常忘记你告诉它的内容,而且没有很好地遵循说明。它还受到“近因偏见”的影响,提示中较晚的单词被认为比较早的单词更重要。这有点问题,因为一些指令可能会被忽略,或者更糟的是,在生成输出时根本不考虑传递的一些更改的diff。当使用GPT 3.5进行分析时,这导致了低于恒定的命中率。即使采取措施要求更一致、更不具创造性的输出,它仍然会不一致地忽略请求的整个部分,甚至只是对更改中添加或删除的内容产生幻觉。 | Dall-E 3 prompt: A confident robot in an office, proudly presenting incorrect information on a whiteboard. The robot has an air of self-assurance, using a marker to underline wrong answers with a smile. The scene is slightly humorous, highlighting the robot's misplaced confidence in its incorrect answers. The environment is a modern office setting, and the whiteboard is filled with equations and text, with some parts clearly marked as incorrect but the robot appears oblivious. Dall-E 3提示:一个自信的机器人在办公室里,骄傲地在白板上展示错误的信息。机器人有一种自信的表情,用记号笔微笑着在错误的答案下面划线。这一幕略显幽默,突出了机器人对错误答案的错误信心。这个环境是一个现代的办公环境,白板上写满了等式和文本,有些部分被清楚地标记为不正确,但机器人似乎没有注意到。 |
In November, Microsoft introduced a beefed up version of GPT 3.5, called 3.5-instruct, which was supposed to improve its instruction-following ability as the name suggests. While this worked better, the model still suffered from hallucinations of what was actually changed in a diff.
11月,微软推出了GPT 3.5的增强版,名为3.5指令,顾名思义,该版本旨在提高其指令遵循能力。虽然这效果更好,但该模型仍然存在实际变化的幻觉。
To get around some of these issues, a best-of-three model was tried such that two of the three shots needed to agree on the significance of the change before adopting a result. This at least improved overall accuracy, but outputs were still lacking detail and clearly suffering from not fully understanding the changes made at a software development level.
为了解决其中的一些问题,我们尝试了三局三胜制的模型,这样三局中的两局需要在采用结果之前就变化的意义达成一致。这至少提高了总体准确性,但产出仍然缺乏细节,而且显然是因为没有完全理解软件开发层面的变化。
When the cost of GPT-4 was lowered, even a cursory test showed greatly improved results. GPT-4 has a better memory for what it reads, and is able to more clearly comment on its own reasoning behind a decision. Due to higher accuracy, the bot was moved back to a single-shot configuration and only minor tweaks to the prompt have been needed since.
当GPT-4的成本降低时,即使是粗略的测试也显示出大大改善的结果。GPT-4对它所读的内容有更好的记忆,并且能够更清楚地评论它自己在决策背后的推理。由于精度更高,机器人被移回了单次射击配置,此后只需要对提示进行微小的调整。
Next Steps and beyond
下一步及以后
AI is not a magic bullet. If it ever is, that's still quite far off from today. Unlike a human, it has no idea what it's actually doing; it simply responds to a contextual prompt by choosing the most likely next word. That means that if it chooses wrongly at any point, the rest of the reply can be sent down the wrong path and producing a confident, but very wrong answer. Some steps can be taken to mitigate this, but not without time and cost. Despite the shortcomings of current generation generative AI, we have a solid start to reducing the effort for our API reviews.
人工智能不是灵丹妙药。如果真的是这样的话,那离今天还有很长的路要走。与人类不同,它不知道自己到底在做什么;它只是通过选择最有可能的下一个单词来响应上下文提示。这意味着,如果它在任何时候选择错误,其余的回复可能会被发送到错误的路径上,并产生一个自信但非常错误的答案。可以采取一些措施来缓解这种情况,但并非没有时间和成本。尽管当前一代生成式人工智能存在缺陷,但我们有一个坚实的开端来减少API审查的工作量。
Next, we want to explore larger contexts for evaluating the significance of changes, including multiple files in a single-shot evaluation so the full change is understood, instead of the file-by-file evaluation that occurs today. Further, we hope that future iterations of GPT or related LLM tech can improve the hit rate and instruction-following. Even though GPT-4 is miles ahead of 3.5 for this use case, it still makes mistakes and ignores context on occasion.
接下来,我们想探索更大的上下文来评估更改的重要性,包括单次评估中的多个文件,以便了解完整的更改,而不是今天发生的逐文件评估。此外,我们希望GPT或相关LLM技术的未来迭代能够提高命中率和指令遵循率。尽管GPT-4在这个用例中远远领先于3.5,但它仍然会出错,有时会忽略上下文。
Conclusion
结论
By shifting the review to a per-change effort at the time of approval for merging, we can save hours spent on reviews during the crunch time right before release. This new method also gives a bit more context so that discussions on the necessity of a change need to only happen once. In our traditional reviews, a rather complicated script performed a bunch of hard-coded logic to exclude irrelevant lines of code, creating throwaway commits that collected all the API changes together for review. Although the previous method provided a quick overview of the changes, it lacked the necessary context. Additionally, tracking down the source of the change and discussing its importance was a time-consuming process. Given the significant amount of time that could elapse between writing a change and the API review, it could be challenging to remember the reasoning behind each change.
通过在批准合并时将审查转移到每次更改的工作量,我们可以在发布前的关键时刻节省审查时间。这种新方法还提供了更多的上下文,因此关于更改必要性的讨论只需要进行一次。在我们的传统评审中,一个相当复杂的脚本执行了一系列硬编码逻辑,以排除不相关的代码行,创建一次性提交,将所有API更改收集在一起进行评审。尽管前面的方法提供了对变化的快速概述,但它缺乏必要的上下文。此外,追踪变化的来源并讨论其重要性是一个耗时的过程。考虑到编写变更和API评审之间可能需要相当长的时间,记住每次变更背后的原因可能是一项挑战。
This proof-of-concept API Review Bot is just another tool to help everyone in The Qt Project contribute more easily and get the attention a change deserves sooner. In the end, every change still requires human eyes on it, but we hope that this new bot makes that a little easier.
这个概念验证API Review Bot只是另一个工具,可以帮助Qt项目中的每个人更容易地做出贡献,并更快地获得变革所应得到的关注。最终,每一个变化都需要人类的关注,但我们希望这个新的机器人能让这变得更容易。