【Java-LangChain:使用 ChatGPT API 搭建系统-9】评估(上)-存在一个简单的正确答案时

第九章,评估(上)-存在一个简单的正确答案时

在之前的章节中,我们展示了如何使用 LLM 构建应用程序,包括评估输入、处理输入以及在向用户显示输出之前进行最终输出检查。 构建这样的系统后,如何知道它的工作情况?甚至在部署后并让用户使用它时,如何跟踪它的运行情况,发现任何缺陷,并持续改进系统的答案质量?

在本章中,我们想与您分享一些最佳实践,用于评估 LLM 的输出。

构建基于 LLM 的应用程序与传统的监督学习应用程序有所不同。由于可以快速构建基于 LLM 的应用程序,因此评估方法通常不从测试集开始。相反,通常会逐渐建立一组测试示例。

在传统的监督学习环境中,需要收集训练集、开发集或保留交叉验证集,然后在整个开发过程中使用它们。

然而,如果能够在几分钟内指定 Prompt,并在几个小时内得到相应结果,那么暂停很长时间去收集一千个测试样本将是一件极其痛苦的事情。因为现在,可以在零个训练样本的情况下获得这个成果。

因此,在使用 LLM 构建应用程序时,您将体会到如下的过程:

首先,您会在只有一到三个样本的小样本中调整 Prompt,并尝试让 Prompt 在它们身上起作用。
然后,当系统进行进一步的测试时,您可能会遇到一些棘手的例子。Prompt 在它们身上不起作用,或者算法在它们身上不起作用。
这就是使用 ChatGPT API 构建应用程序的开发者所经历的挑战。

在这种情况下,您可以将这些额外的几个示例添加到您正在测试的集合中,以机会主义地添加其他棘手的示例。
最终,您已经添加了足够的这些示例到您缓慢增长的开发集中,以至于通过手动运行每个示例来测试 Prompt 变得有些不方便。
然后,您开始开发在这些小示例集上用于衡量性能的指标,例如平均准确性。

这个过程的一个有趣方面是,如果您觉得您的系统已经足够好了,您可以随时停在那里,不再改进它。事实上,许多已部署的应用程序停在第一或第二个步骤,并且运行得非常好。 需要注意的是,有很多大模型的应用程序没有实质性的风险,即使它没有给出完全正确的答案。

但是,对于部分高风险应用,如果存在偏见或不适当的输出可能对某人造成伤害,那么收集测试集、严格评估系统的性能、确保在使用之前它能够做正确的事情,就变得更加重要。

但是,如果您只是使用它来总结文章供自己阅读,而不是给别人看,那么可能造成的危害风险更小,您可以在这个过程中早早停止,而不必去花费更大的代价去收集更大的数据集。

一,环境配置

参考第二章的 环境配置小节内容即可。

获取相关产品和类别

        

二,找出相关产品和类别名称

        String system = "您将提供客户服务查询。\n" +
                "    客户服务查询将用" + delimiter + "字符分隔。\n" +
                "    输出一个 Python 列表,列表中的每个对象都是 Json 对象,每个对象的格式如下:\n" +
                "        'category': <Computers and Laptops, Smartphones and Accessories, Televisions and Home Theater Systems, \n" +
                "    Gaming Consoles and Accessories, Audio Equipment, Cameras and Camcorders中的一个>,\n" +
                "    以及\n" +
                "        'products': <必须在下面允许的产品中找到的产品列表>\n" +
                "    \n" +
                "    其中类别和产品必须在客户服务查询中找到。\n" +
                "    如果提到了一个产品,它必须与下面允许的产品列表中的正确类别关联。\n" +
                "    如果没有找到产品或类别,输出一个空列表。\n" +
                "    \n" +
                "    根据产品名称和产品类别与客户服务查询的相关性,列出所有相关的产品。\n" +
                "    不要从产品的名称中假设任何特性或属性,如相对质量或价格。\n" +
                "    \n" +
                "    允许的产品以 JSON 格式提供。\n" +
                "    每个项目的键代表类别。\n" +
                "    每个项目的值是该类别中的产品列表。\n" +
                "    允许的产品:" + productsAndCategory + "";


        String fewShotUser_1 = "我想要最贵的电脑。";
        String fewShotAssistant_1 = "[{\"category\":\"Computers and Laptops\",\"products\":[\"TechPro Ultrabook\",\"BlueWave Gaming Laptop\",\"PowerLite Convertible\",\"TechPro Desktop\",\"BlueWave Chromebook\"]}]";


        List<ChatMessage> messages = new ArrayList<>();

        ChatMessage systemMessage = new ChatMessage();
        systemMessage.setRole("system");
        systemMessage.setContent(system);
        messages.add(systemMessage);

        ChatMessage userMessage = new ChatMessage();
        userMessage.setRole("user");
        userMessage.setContent(delimiter + fewShotUser_1 + delimiter);
        messages.add(userMessage);

        ChatMessage assistantMessage = new ChatMessage();
        assistantMessage.setRole("assistant");
        assistantMessage.setContent(fewShotAssistant_1);
        messages.add(assistantMessage);

        ChatMessage userInputMessage = new ChatMessage();
        userInputMessage.setRole("user");
        userInputMessage.setContent(delimiter + userInput + delimiter);
        messages.add(userInputMessage);

        return this.getCompletionFromMessage(messages, 0);

三,在一些查询上进行评估


        //第一个评估
        String result = findCategoryAndProductV1("如果我预算有限,我可以买哪款电视?", allProducts);

        log.info("test1: {}", result);
[{"category":"Televisions and Home Theater Systems","products":["CineView 4K TV","SoundMax Home Theater","CineView 8K TV","SoundMax Soundbar","CineView OLED TV"]}]
        //第二个评估
        String result = findCategoryAndProductV1("我需要一个智能手机的充电器", allProducts);

        log.info("test2: {}", result);
[{"category":"Smartphones and Accessories","products":["MobiTech PowerCase","MobiTech Wireless Charger"]}]
        //第三个评估
        String result = findCategoryAndProductV1("你们有哪些电脑?", allProducts);

        log.info("test3: {}", result);

[{"category":"Computers and Laptops","products":["TechPro Ultrabook","BlueWave Gaming Laptop","PowerLite Convertible","TechPro Desktop","BlueWave Chromebook"]}]
        //第四个评估
        String result = findCategoryAndProductV1(" 告诉我关于smartx pro手机和fotosnap相机的信息,那款DSLR的。\n" +
                "另外,你们有哪些电视?", allProducts);

        log.info("test4: {}", result);
[{"category":"Smartphones and Accessories","products":["SmartX ProPhone"]},{"category":"Cameras and Camcorders","products":["FotoSnap DSLR Camera"]},{"category":"Televisions and Home Theater Systems","products":["CineView 4K TV","SoundMax Home Theater","CineView 8K TV","SoundMax Soundbar","CineView OLED TV"]}]

四,更难的测试用例

找出一些在实际使用中,模型表现不如预期的查询。

        //第五个评估
        String result = findCategoryAndProductV1("告诉我关于CineView电视的信息,那款8K的,还有Gamesphere游戏机,X款的。\n" +
                "我预算有限,你们有哪些电脑?", allProducts);

        log.info("test5: {}", result);
[{"category":"Televisions and Home Theater Systems","products":["CineView 4K TV","SoundMax Home Theater","CineView 8K TV","SoundMax Soundbar","CineView OLED TV"]},{"category":"Gaming Consoles and Accessories","products":["GameSphere X","ProGamer Controller","GameSphere Y","ProGamer Racing Wheel","GameSphere VR Headset"]},{"category":"Computers and Laptops","products":["TechPro Ultrabook","BlueWave Gaming Laptop","PowerLite Convertible","TechPro Desktop","BlueWave Chromebook"]}]

具体来说,CineView 8K电视是一款高端电视,具有8K分辨率和OLED显示屏。GameSphere X是一款游戏机,具有高性能和多种游戏选择。对于预算有限的电脑,您可以考虑TechPro Chromebook或TechPro Ultrabook,它们都是较为经济实惠的选择。

五,修改指令以处理难测试用例

我们在提示中添加了以下内容,不要输出任何不在 JSON 格式中的附加文本,并添加了第二个示例,使用用户和助手消息进行 few-shot 提示。


        String system = "您将提供客户服务查询。\n" +
                "    客户服务查询将用" + delimiter + "字符分隔。\n" +
                "    输出一个列表,列表中的每个对象都是 JSON 对象,每个对象的格式如下:\n" +
                "        'category': <Computers and Laptops, Smartphones and Accessories, Televisions and Home Theater Systems, \n" +
                "    Gaming Consoles and Accessories, Audio Equipment, Cameras and Camcorders中的一个>,\n" +
                "    和\n" +
                "        'products': <必须在下面允许的产品中找到的产品列表>\n" +
                "    不要输出任何不是 JSON 格式的额外文本。\n" +
                "    输出请求的 JSON 后,不要写任何解释性的文本。\n" +
                "    \n" +
                "    其中类别和产品必须在客户服务查询中找到。\n" +
                "    如果提到了一个产品,它必须与下面允许的产品列表中的正确类别关联。\n" +
                "    如果没有找到产品或类别,输出一个空列表。\n" +
                "    \n" +
                "    根据产品名称和产品类别与客户服务查询的相关性,列出所有相关的产品。\n" +
                "    不要从产品的名称中假设任何特性或属性,如相对质量或价格。\n" +
                "    \n" +
                "    允许的产品以 JSON 格式提供。\n" +
                "    每个项目的键代表类别。\n" +
                "    每个项目的值是该类别中的产品列表。\n" +
                "    允许的产品:" + productsAndCategory + "";


        String fewShotUser_1 = "我想要最贵的电脑。";
        String fewShotAssistant_1 = "[{\"category\":\"Computers and Laptops\",\"products\":[\"TechPro Ultrabook\",\"BlueWave Gaming Laptop\",\"PowerLite Convertible\",\"TechPro Desktop\",\"BlueWave Chromebook\"]}]";

        //增加了一个 few-shot 提示
        String fewShotUser_2 = "我想要最便宜的电脑。你推荐哪款?";
        String fewShotAssistant_2 = "[{\"category\":\"Computers and Laptops\",\"products\":[\"TechPro Ultrabook\",\"BlueWave Gaming Laptop\",\"PowerLite Convertible\",\"TechPro Desktop\",\"BlueWave Chromebook\"]}]";


        List<ChatMessage> messages = new ArrayList<>();

        ChatMessage systemMessage = new ChatMessage();
        systemMessage.setRole("system");
        systemMessage.setContent(system);
        messages.add(systemMessage);

        ChatMessage userMessage = new ChatMessage();
        userMessage.setRole("user");
        userMessage.setContent(delimiter + fewShotUser_1 + delimiter);
        messages.add(userMessage);

        ChatMessage assistantMessage = new ChatMessage();
        assistantMessage.setRole("assistant");
        assistantMessage.setContent(fewShotAssistant_1);
        messages.add(assistantMessage);


        ChatMessage shot2 = new ChatMessage();
        shot2.setRole("user");
        shot2.setContent(delimiter + fewShotUser_2 + delimiter);
        messages.add(shot2);

        ChatMessage shotAssistant2 = new ChatMessage();
        shotAssistant2.setRole("assistant");
        shotAssistant2.setContent(fewShotAssistant_2);
        messages.add(shotAssistant2);


        ChatMessage userInputMessage = new ChatMessage();
        userInputMessage.setRole("user");
        userInputMessage.setContent(delimiter + userInput + delimiter);
        messages.add(userInputMessage);

        return this.getCompletionFromMessage(messages, 0);

六,在难测试用例上评估修改后的指令


        String result = findCategoryAndProductV2("告诉我关于smartx pro手机和fotosnap相机的信息,那款DSLR的。\n" +
        "另外,你们有哪些电视?", allProducts);

        log.info("test6: {}", result);
[{"category":"Smartphones and Accessories","products":["SmartX ProPhone"]},{"category":"Cameras and Camcorders","products":["FotoSnap DSLR Camera"]},{"category":"Televisions and Home Theater Systems","products":["CineView 4K TV","SoundMax Home Theater","CineView 8K TV","SoundMax Soundbar","CineView OLED TV"]}]

七,回归测试:验证模型在以前的测试用例上仍然有效

检查并修复模型以提高难以测试的用例效果,同时确保此修正不会对先前的测试用例性能造成负面影响。

        String result = findCategoryAndProductV2("如果我预算有限,我可以买哪款电视?", allProducts);

        log.info("test7: {}", result);
[{"category":"Televisions and Home Theater Systems","products":["CineView 4K TV","SoundMax Home Theater","CineView 8K TV","SoundMax Soundbar","CineView OLED TV"]}]

如果您的预算有限,我们建议您购买CineView 4K电视或SoundMax家庭影院。

八,收集开发集进行自动化测试

当您要调整的开发集不仅仅是一小部分示例时,开始自动化测试过程就变得有用了。

[{
	"customer_msg": "Which TV can I buy if I'm on a budget?",
	"ideal_answer": {
		"Televisions and Home Theater Systems": ["CineView 4K TV", "SoundMax Home Theater", "CineView 8K TV", "SoundMax Soundbar", "CineView OLED TV"]
	}
}, {
	"customer_msg": "I need a charger for my smartphone",
	"ideal_answer": {
		"Smartphones and Accessories": ["MobiTech PowerCase", "MobiTech Wireless Charger", "SmartX EarBuds"]
	}
}, {
	"customer_msg": "What computers do you have?",
	"ideal_answer": {
		"Computers and Laptops": ["TechPro Ultrabook", "BlueWave Gaming Laptop", "PowerLite Convertible", "TechPro Desktop", "BlueWave Chromebook"]
	}
}, {
	"customer_msg": "tell me about the smartx pro phone and the fotosnap camera, the dslr one. Also, what TVs do you have?",
	"ideal_answer": {
		"Smartphones and Accessories": ["SmartX ProPhone"],
		"Cameras and Camcorders": ["FotoSnap DSLR Camera"],
		"Televisions and Home Theater Systems": ["CineView 4K TV", "SoundMax Home Theater", "CineView 8K TV", "SoundMax Soundbar", "CineView OLED TV"]
	}
}, {
	"customer_msg": "tell me about the CineView TV, the 8K one, Gamesphere console, the X one.\nI'm on a budget, what computers do you have?",
	"ideal_answer": {
		"Televisions and Home Theater Systems": ["CineView 8K TV"],
		"Gaming Consoles and Accessories": ["GameSphere X"],
		"Computers and Laptops": ["TechPro Ultrabook", "BlueWave Gaming Laptop", "PowerLite Convertible", "TechPro Desktop", "BlueWave Chromebook"]
	}
}, {
	"customer_msg": "What smartphones do you have?",
	"ideal_answer": {
		"Smartphones and Accessories": ["SmartX ProPhone", "MobiTech PowerCase", "SmartX MiniPhone", "MobiTech Wireless Charger", "SmartX EarBuds"]
	}
}, {
	"customer_msg": "I'm on a budget. Can you recommend some smartphones to me?",
	"ideal_answer": {
		"Smartphones and Accessories": ["SmartX EarBuds", "SmartX MiniPhone", "MobiTech PowerCase", "SmartX ProPhone", "MobiTech Wireless Charger"]
	}
}, {
	"customer_msg": "What Gaming consoles would be good for my friend who is into racing games?",
	"ideal_answer": {
		"Gaming Consoles and Accessories": ["GameSphere X", "ProGamer Controller", "GameSphere Y", "ProGamer Racing Wheel", "GameSphere VR Headset"]
	}
}, {
	"customer_msg": "What could be a good present for my videographer friend?",
	"ideal_answer": {
		"Cameras and Camcorders": ["FotoSnap DSLR Camera", "ActionCam 4K", "FotoSnap Mirrorless Camera", "ZoomMaster Camcorder", "FotoSnap Instant Camera"]
	}
}, {
	"customer_msg": "I would like a hot tub time machine.",
	"ideal_answer": []
}]

九,通过与理想答案比较来评估测试用例

        //测试用例
        String testCase = "[{\"customer_msg\":\"Which TV can I buy if I'm on a budget?\",\"ideal_answer\":{\"Televisions and Home Theater Systems\":[\"CineView 4K TV\",\"SoundMax Home Theater\",\"CineView 8K TV\",\"SoundMax Soundbar\",\"CineView OLED TV\"]}},{\"customer_msg\":\"I need a charger for my smartphone\",\"ideal_answer\":{\"Smartphones and Accessories\":[\"MobiTech PowerCase\",\"MobiTech Wireless Charger\",\"SmartX EarBuds\"]}},{\"customer_msg\":\"What computers do you have?\",\"ideal_answer\":{\"Computers and Laptops\":[\"TechPro Ultrabook\",\"BlueWave Gaming Laptop\",\"PowerLite Convertible\",\"TechPro Desktop\",\"BlueWave Chromebook\"]}},{\"customer_msg\":\"tell me about the smartx pro phone and the fotosnap camera, the dslr one. Also, what TVs do you have?\",\"ideal_answer\":{\"Smartphones and Accessories\":[\"SmartX ProPhone\"],\"Cameras and Camcorders\":[\"FotoSnap DSLR Camera\"],\"Televisions and Home Theater Systems\":[\"CineView 4K TV\",\"SoundMax Home Theater\",\"CineView 8K TV\",\"SoundMax Soundbar\",\"CineView OLED TV\"]}},{\"customer_msg\":\"tell me about the CineView TV, the 8K one, Gamesphere console, the X one.\\nI'm on a budget, what computers do you have?\",\"ideal_answer\":{\"Televisions and Home Theater Systems\":[\"CineView 8K TV\"],\"Gaming Consoles and Accessories\":[\"GameSphere X\"],\"Computers and Laptops\":[\"TechPro Ultrabook\",\"BlueWave Gaming Laptop\",\"PowerLite Convertible\",\"TechPro Desktop\",\"BlueWave Chromebook\"]}},{\"customer_msg\":\"What smartphones do you have?\",\"ideal_answer\":{\"Smartphones and Accessories\":[\"SmartX ProPhone\",\"MobiTech PowerCase\",\"SmartX MiniPhone\",\"MobiTech Wireless Charger\",\"SmartX EarBuds\"]}},{\"customer_msg\":\"I'm on a budget. Can you recommend some smartphones to me?\",\"ideal_answer\":{\"Smartphones and Accessories\":[\"SmartX EarBuds\",\"SmartX MiniPhone\",\"MobiTech PowerCase\",\"SmartX ProPhone\",\"MobiTech Wireless Charger\"]}},{\"customer_msg\":\"What Gaming consoles would be good for my friend who is into racing games?\",\"ideal_answer\":{\"Gaming Consoles and Accessories\":[\"GameSphere X\",\"ProGamer Controller\",\"GameSphere Y\",\"ProGamer Racing Wheel\",\"GameSphere VR Headset\"]}},{\"customer_msg\":\"What could be a good present for my videographer friend?\",\"ideal_answer\":{\"Cameras and Camcorders\":[\"FotoSnap DSLR Camera\",\"ActionCam 4K\",\"FotoSnap Mirrorless Camera\",\"ZoomMaster Camcorder\",\"FotoSnap Instant Camera\"]}},{\"customer_msg\":\"I would like a hot tub time machine.\",\"ideal_answer\":[]}]";
        JSONArray array = JSONUtil.parseArray(testCase);

        JSONObject cc = (JSONObject) array.get(7);
        String customerMsg = cc.getStr("customer_msg");
        Object idealAnswer = cc.getJSONObject("ideal_answer");

        log.info("用户提问: {}", customerMsg);
        log.info("标准答案: {}", idealAnswer);
用户提问: What Gaming consoles would be good for my friend who is into racing games?
标准答案: {"Gaming Consoles and Accessories":["GameSphere X","ProGamer Controller","GameSphere Y","ProGamer Racing Wheel","GameSphere VR Headset"]}

        public double evalResponseWithIdeal(String response, JSONObject ideal) {

        log.info("回复:{}", response);

        if (StrUtil.isBlank(response) && ideal == null) {
            return 1;
        }

        if (StrUtil.isBlank(response) || ideal == null) {
            return 0;
        }

        //统计正确总数
        int correct = 0;

        JSONArray array = JSONUtil.parseArray(response);

        for (Object o : array) {

            JSONObject jsonObject = (JSONObject) o;
    
            String category = jsonObject.getStr("category");
            List<String> products = jsonObject.getBeanList("products", String.class);
    
            if (category != null && CollectionUtil.isNotEmpty(products)) {
    
                List<String> fps = ideal.getJSONArray(category).toList(String.class);
        
                log.info("产品集合:{}", products);
                log.info("标签答案集合:{}", fps);
        
                if (CollectionUtil.isNotEmpty(fps)) {
                    //找到的产品数组和标准答案给到的商品数组一致
                    if (fps.containsAll(products)) {
                    log.info("正确");
                    correct++;
                    } else {
                        log.info("产品集合: {}", products);
                        log.info("标准的产品集合: {}", fps);
                        if (products.size() < fps.size()) {
                            log.info("回答是标准答案的一个子集");
                        } else {
                            log.info("回答是标准答案的一个超集");
                        }
                    }
                }
            }


        }

        double pccorrect = correct / array.size();

        return pccorrect;
        }
        //测试用例
        String testCase = "[{\"customer_msg\":\"Which TV can I buy if I'm on a budget?\",\"ideal_answer\":{\"Televisions and Home Theater Systems\":[\"CineView 4K TV\",\"SoundMax Home Theater\",\"CineView 8K TV\",\"SoundMax Soundbar\",\"CineView OLED TV\"]}},{\"customer_msg\":\"I need a charger for my smartphone\",\"ideal_answer\":{\"Smartphones and Accessories\":[\"MobiTech PowerCase\",\"MobiTech Wireless Charger\",\"SmartX EarBuds\"]}},{\"customer_msg\":\"What computers do you have?\",\"ideal_answer\":{\"Computers and Laptops\":[\"TechPro Ultrabook\",\"BlueWave Gaming Laptop\",\"PowerLite Convertible\",\"TechPro Desktop\",\"BlueWave Chromebook\"]}},{\"customer_msg\":\"tell me about the smartx pro phone and the fotosnap camera, the dslr one. Also, what TVs do you have?\",\"ideal_answer\":{\"Smartphones and Accessories\":[\"SmartX ProPhone\"],\"Cameras and Camcorders\":[\"FotoSnap DSLR Camera\"],\"Televisions and Home Theater Systems\":[\"CineView 4K TV\",\"SoundMax Home Theater\",\"CineView 8K TV\",\"SoundMax Soundbar\",\"CineView OLED TV\"]}},{\"customer_msg\":\"tell me about the CineView TV, the 8K one, Gamesphere console, the X one.\\nI'm on a budget, what computers do you have?\",\"ideal_answer\":{\"Televisions and Home Theater Systems\":[\"CineView 8K TV\"],\"Gaming Consoles and Accessories\":[\"GameSphere X\"],\"Computers and Laptops\":[\"TechPro Ultrabook\",\"BlueWave Gaming Laptop\",\"PowerLite Convertible\",\"TechPro Desktop\",\"BlueWave Chromebook\"]}},{\"customer_msg\":\"What smartphones do you have?\",\"ideal_answer\":{\"Smartphones and Accessories\":[\"SmartX ProPhone\",\"MobiTech PowerCase\",\"SmartX MiniPhone\",\"MobiTech Wireless Charger\",\"SmartX EarBuds\"]}},{\"customer_msg\":\"I'm on a budget. Can you recommend some smartphones to me?\",\"ideal_answer\":{\"Smartphones and Accessories\":[\"SmartX EarBuds\",\"SmartX MiniPhone\",\"MobiTech PowerCase\",\"SmartX ProPhone\",\"MobiTech Wireless Charger\"]}},{\"customer_msg\":\"What Gaming consoles would be good for my friend who is into racing games?\",\"ideal_answer\":{\"Gaming Consoles and Accessories\":[\"GameSphere X\",\"ProGamer Controller\",\"GameSphere Y\",\"ProGamer Racing Wheel\",\"GameSphere VR Headset\"]}},{\"customer_msg\":\"What could be a good present for my videographer friend?\",\"ideal_answer\":{\"Cameras and Camcorders\":[\"FotoSnap DSLR Camera\",\"ActionCam 4K\",\"FotoSnap Mirrorless Camera\",\"ZoomMaster Camcorder\",\"FotoSnap Instant Camera\"]}},{\"customer_msg\":\"I would like a hot tub time machine.\",\"ideal_answer\":[]}]";
        JSONArray array = JSONUtil.parseArray(testCase);

        JSONObject cc = (JSONObject) array.get(2);
        String customerMsg = cc.getStr("customer_msg");
        JSONObject idealAnswer = (JSONObject) cc.getJSONObject("ideal_answer");

        String result = findCategoryAndProductV2(customerMsg, allProducts);

        log.info("用户提问: {}", customerMsg);
        log.info("标准答案: {}", idealAnswer);

        evalResponseWithIdeal(result, idealAnswer);

十,在所有测试用例上运行评估,并计算正确的用例比例

注意:如果任何 API 调用超时,将无法运行完成

        //测试用例
        String testCase = "[{\"customer_msg\":\"Which TV can I buy if I'm on a budget?\",\"ideal_answer\":{\"Televisions and Home Theater Systems\":[\"CineView 4K TV\",\"SoundMax Home Theater\",\"CineView 8K TV\",\"SoundMax Soundbar\",\"CineView OLED TV\"]}},{\"customer_msg\":\"I need a charger for my smartphone\",\"ideal_answer\":{\"Smartphones and Accessories\":[\"MobiTech PowerCase\",\"MobiTech Wireless Charger\",\"SmartX EarBuds\"]}},{\"customer_msg\":\"What computers do you have?\",\"ideal_answer\":{\"Computers and Laptops\":[\"TechPro Ultrabook\",\"BlueWave Gaming Laptop\",\"PowerLite Convertible\",\"TechPro Desktop\",\"BlueWave Chromebook\"]}},{\"customer_msg\":\"tell me about the smartx pro phone and the fotosnap camera, the dslr one. Also, what TVs do you have?\",\"ideal_answer\":{\"Smartphones and Accessories\":[\"SmartX ProPhone\"],\"Cameras and Camcorders\":[\"FotoSnap DSLR Camera\"],\"Televisions and Home Theater Systems\":[\"CineView 4K TV\",\"SoundMax Home Theater\",\"CineView 8K TV\",\"SoundMax Soundbar\",\"CineView OLED TV\"]}},{\"customer_msg\":\"tell me about the CineView TV, the 8K one, Gamesphere console, the X one.\\nI'm on a budget, what computers do you have?\",\"ideal_answer\":{\"Televisions and Home Theater Systems\":[\"CineView 8K TV\"],\"Gaming Consoles and Accessories\":[\"GameSphere X\"],\"Computers and Laptops\":[\"TechPro Ultrabook\",\"BlueWave Gaming Laptop\",\"PowerLite Convertible\",\"TechPro Desktop\",\"BlueWave Chromebook\"]}},{\"customer_msg\":\"What smartphones do you have?\",\"ideal_answer\":{\"Smartphones and Accessories\":[\"SmartX ProPhone\",\"MobiTech PowerCase\",\"SmartX MiniPhone\",\"MobiTech Wireless Charger\",\"SmartX EarBuds\"]}},{\"customer_msg\":\"I'm on a budget. Can you recommend some smartphones to me?\",\"ideal_answer\":{\"Smartphones and Accessories\":[\"SmartX EarBuds\",\"SmartX MiniPhone\",\"MobiTech PowerCase\",\"SmartX ProPhone\",\"MobiTech Wireless Charger\"]}},{\"customer_msg\":\"What Gaming consoles would be good for my friend who is into racing games?\",\"ideal_answer\":{\"Gaming Consoles and Accessories\":[\"GameSphere X\",\"ProGamer Controller\",\"GameSphere Y\",\"ProGamer Racing Wheel\",\"GameSphere VR Headset\"]}},{\"customer_msg\":\"What could be a good present for my videographer friend?\",\"ideal_answer\":{\"Cameras and Camcorders\":[\"FotoSnap DSLR Camera\",\"ActionCam 4K\",\"FotoSnap Mirrorless Camera\",\"ZoomMaster Camcorder\",\"FotoSnap Instant Camera\"]}},{\"customer_msg\":\"I would like a hot tub time machine.\",\"ideal_answer\":[]}]";
        JSONArray array = JSONUtil.parseArray(testCase);

        double sumScore = 0d;
        for (Object o : array) {
            JSONObject cc = (JSONObject) o;
            String customerMsg = cc.getStr("customer_msg");
            JSONObject idealAnswer = (JSONObject) cc.getJSONObject("ideal_answer");
            String result = findCategoryAndProductV2(customerMsg, allProducts);

            double score = evalResponseWithIdeal(result, idealAnswer);

            log.info("score: {}\n", score);
            sumScore = sumScore + score;
        }

        log.info("正确比例: {}/{}", array.size(), sumScore / array.size());

使用 Prompt 构建应用程序的工作流程与使用监督学习构建应用程序的工作流程非常不同。

因此,我们认为这是需要记住的一件好事,当您正在构建监督学习模型时,会感觉到迭代速度快了很多。
如果您并未亲身体验,可能会惊叹于仅有手动构建得极少样本,就可以产生高效的评估方法。您可能会认为,仅有 10 个样本是不具备统计意义的。但当您真正运用这种方式时,您可能会对向开发集中添加一些复杂样本所带来的效果提升感到惊讶。

这对于帮助您和您的团队找到有效的 Prompt 和有效的系统非常有帮助。
在本章中,输出可以被定量评估,就像有一个期望的输出一样,您可以判断它是否给出了这个期望的输出。

在下一章中,我们将探讨如何在更加模糊的情况下评估我们的输出。即正确答案可能不那么明确的情况。

Java快速转换到大模型开发:
配套课程的所有代码已经发布在:https://github.com/Starcloud-Cloud/java-langchain
课程合作请留言

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

df007df

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值