软件测试 | 测试开发 | TorchServe搭建codeBERT分类模型服务

518 篇文章 3 订阅
514 篇文章 2 订阅

背景

最近在做有关克隆代码检测的相关工作,克隆代码是软件开发过程中的常见现象,它在软件开发前期能够提升生产效率,产生一定的正面效益,然而随着系统规模变大,也会产生降低软件稳定性,软件bug传播,系统维护困难等负面作用。本次训练基于codeBERT的分类模型,任务是给定两个函数片段,判断这两个函数片段是否相似,TorchServe主要用于PyTorch模型的部署,现将使用TorchServe搭建克隆代码检测服务过程总结如下。

TorchServe简介

TorchServe是部署PyTorch模型服务的工具,由Facebook和AWS合作开发,是PyTorch开源项目的一部分。它可以使得用户更快地将模型用于生产,提供了低延迟推理API,支持模型的热插拔,多模型服务,A/B test版本控制,以及监控指标等功能。TorchServe架构图如下图所示:

TorchServe框架主要分为四个部分:Frontend是TorchServe的请求和响应的处理部分;Worker Process 指的是一组运行的模型实例,可以由管理API设定运行的数量;Model Store是模型存储加载的地方;Backend用于管理Worker Process。

codeBERT是什么?

codeBERT是一个预训练的语言模型,由微软和哈工大发布。我们知道传统的BERT模型是面向自然语言的,而codeBERT是面向自然语言和编程语言的模型,codeBERT可以处理Python,Java,JavaScript等,能够捕捉自然语言和编程语言的语义关系,可以用来做自然语言代码搜索,代码文档生成,代码bug检查以及代码克隆检测等任务。当然我们也可以利用CodeBERT直接提取编程语言的token embeddings,从而进行相关任务。

环境搭建

安装TorchServe

pip install torchserve
pip install torch-model-archiever

编写Handler类

Handler是我们自定义开发的类,TorchServe运行的时候会执行Handler类,其主要功能就是处理input data,然后通过一系列处理操作返回结果,其中模型的初始化等也是由handler处理。其中Handler类继承自BaseHandler,我们需要重写其中的initialize,preprocess,inference等。

  1. initialize方法
class CloneDetectionHandler(BaseHandler,ABC):
    def __int__(self):
        super(CloneDetectionHandler,self).__init__()
        self.initialized = False
    def initialize(self, ctx):
        self.manifest = ctx.manifest
        logger.info(self.manifest)
        properties = ctx.system_properties
        model_dir = properties.get("model_dir")
        serialized_file = self.manifest['model']['serializedFile']
        model_pt_path = os.path.join(model_dir,serialized_file)
        self.device = torch.device("cuda:"+str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")
        config_class, model_class,tokenizer_class = MODEL_CLASSES['roberta']
        config = config_class.from_pretrained("microsoft/codebert-base")
        config.num_labels = 2
        self.tokenizer = tokenizer_class.from_pretrained("microsoft/codebert-base")
        self.bert = model_class(config)
        self.model = Model(self.bert,config,self.tokenizer)
        self.model.load_state_dict(torch.load(model_pt_path))
        self.model.to(self.device)
        self.model.eval()
        logger.info('Clone codeBert model from path {0} loaded successfully'.format(model_dir))
        self.initialized = True
  1. preprocess方法
def preprocess(self, requests):
            input_batch = None
            for idx,data in enumerate(requests):
                input_text = data.get("data")
                if input_text is None:
                    input_text = data.get("body")
                logger.info("Received codes:'%s'",input_text)
                if isinstance(input_text,(bytes,bytearray)):
                    input_text = input_text.decode('utf-8')
                code1 = input_text['code1']
                code2 = input_text['code2']
                code1 = " ".join(code1.split())
                code2 = " ".join(code2.split())
                logger.info("code1:'%s'", code1)
                logger.info("code2:'%s'", code2)
                inputs = self.tokenizer.encode_plus(code1,code2,max_length=512,pad_to_max_length=True, add_special_tokens=True, return_tensors="pt")
                input_ids = inputs["input_ids"].to(self.device)
                if input_ids.shape is not None:
                    if input_batch is None:
                        input_batch = input_ids
                    else:
                        input_batch = torch.cat((input_batch,input_ids),0)
            return input_batch
  1. inference方法
def inference(self, input_batch):
    inferences = []
    logits = self.model(input_batch)
    num_rows = logits[0].shape[0]
    for i in range(num_rows):
    out = logits[0][i].unsqueeze(0)
    y_hat = out.argmax(0).item()
    predicted_idx = str(y_hat)
    inferences.append(predicted_idx)
    return inferences

模型打包

使用toch-model-archiver工具进行打包,将模型参数文件以及其所依赖包打包在一起,在当前目录下会生成mar文件

torch-model-archiver --model-name BERTClass --version 1.0 \
    --serialized-file ./CloneDetection.bin \
    --model-file ./model.py \
    --handler ./handler.py \

启动服务

torchserve --start --ncs --model-store ./modelstore --models BERTClass.mar

服务测试

import requests
import json
diff_codes = {
    "code1": "    private void loadProperties() {\n        if (properties == null) {\n            properties = new Properties();\n            try {\n                URL url = getClass().getResource(propsFile);\n                properties.load(url.openStream());\n            } catch (IOException ioe) {\n                ioe.printStackTrace();\n            }\n        }\n    }\n",
    "code2": "    public static void copyFile(File in, File out) throws IOException {\n        FileChannel inChannel = new FileInputStream(in).getChannel();\n        FileChannel outChannel = new FileOutputStream(out).getChannel();\n        try {\n            inChannel.transferTo(0, inChannel.size(), outChannel);\n        } catch (IOException e) {\n            throw e;\n        } finally {\n            if (inChannel != null) inChannel.close();\n            if (outChannel != null) outChannel.close();\n        }\n    }\n"
}
res = requests.post('http://127.0.0.1:8080/predictions/BERTClass",json=diff_codes).text

第二个请求输入克隆代码对,模型预测结果为1,两段代码段相似,是克隆代码对。克隆代码大体分为句法克隆和语义克隆,本例展示的句法克隆,即对函数名,类名,变量名等重命名,增删部分代码片段还相同的代码对。

clone_codes = {
    "code1":"    public String kodetu(String testusoila) {\n        MessageDigest md = null;\n        try {\n            md = MessageDigest.getInstance(\"SHA\");\n            md.update(testusoila.getBytes(\"UTF-8\"));\n        } catch (NoSuchAlgorithmException e) {\n            new MezuLeiho(\"Ez da zifraketa algoritmoa aurkitu\", \"Ados\", \"Zifraketa Arazoa\", JOptionPane.ERROR_MESSAGE);\n            e.printStackTrace();\n        } catch (UnsupportedEncodingException e) {\n            new MezuLeiho(\"Errorea kodetzerakoan\", \"Ados\", \"Kodeketa Errorea\", JOptionPane.ERROR_MESSAGE);\n            e.printStackTrace();\n        }\n        byte raw[] = md.digest();\n        String hash = (new BASE64Encoder()).encode(raw);\n        return hash;\n    }\n",
    "code2":"    private StringBuffer encoder(String arg) {\n        if (arg == null) {\n            arg = \"\";\n        }\n        MessageDigest md5 = null;\n        try {\n            md5 = MessageDigest.getInstance(\"MD5\");\n            md5.update(arg.getBytes(SysConstant.charset));\n        } catch (Exception e) {\n            e.printStackTrace();\n        }\n        return toHex(md5.digest());\n    }\n"
}
res = requests.post('http://127.0.0.1:8080/predictions/BERTClass",json=clone_codes).text

关闭服务

torchserve --stop

总结

本文主要介绍了如何用TorchServe部署PyTorch模型的流程,首先需要编写hanlder类型文件,然后用torch-model-archiver工具进行模型打包,最后torchserve启动服务,部署流程相对比较简单。

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
将这两个代码结合import cv2 import numpy as np import urllib.request import tensorflow as tf # 下载DeepLabv3+模型权重文件 model_url = "http://download.tensorflow.org/models/deeplabv3_mnv2_pascal_train_aug_2018_01_29.tar.gz" tar_filename = "deeplabv3_mnv2_pascal_train_aug.tar.gz" urllib.request.urlretrieve(model_url, tar_filename) # 解压缩 with tarfile.open(tar_filename, "r:gz") as tar: tar.extractall() model_filename = "deeplabv3_mnv2_pascal_train_aug/frozen_inference_graph.pb" # 加载模型 graph = tf.Graph() with graph.as_default(): od_graph_def = tf.GraphDef() with tf.io.gfile.GFile(model_filename, 'rb') as fid: serialized_graph = fid.read() od_graph_def.ParseFromString(serialized_graph) tf.import_graph_def(od_graph_def, name='') # 读取图像 image_path = "your_image.jpg" image = cv2.imread(image_path) # 进行图像分割 with tf.compat.v1.Session(graph=graph) as sess: input_tensor = graph.get_tensor_by_name('ImageTensor:0') output_tensor = graph.get_tensor_by_name('SemanticPredictions:0') output = sess.run(output_tensor, feed_dict={input_tensor: image}) # 解码并可视化分割结果 segmentation_mask = np.squeeze(output) segmentation_mask = np.uint8(segmentation_mask) segmentation_mask = cv2.resize(segmentation_mask, (image.shape[1], image.shape[0]), interpolation=cv2.INTER_NEAREST) # 显示原始图像和分割结果 cv2.imshow("Image", image) cv2.imshow("Segmentation Mask", segmentation_mask) cv2.waitKey(0) cv2.destroyAllWindows() model1 = models.CellposeModel(gpu=True, model_type='livecell') model2 = models.Cellpose(gpu=True,model_type='nuclei') model3= models.Cellpose(gpu=True,model_type='cyto2') 集成DeepLabv3+模型和cellpose模型
07-14
要将DeepLabv3+模型和Cellpose模型集成,你可以按照以下步骤进行修改: ```python import cv2 import numpy as np import urllib.request import tensorflow as tf from cellpose import models # 下载DeepLabv3+模型权重文件 model_url = "http://download.tensorflow.org/models/deeplabv3_mnv2_pascal_train_aug_2018_01_29.tar.gz" tar_filename = "deeplabv3_mnv2_pascal_train_aug.tar.gz" urllib.request.urlretrieve(model_url, tar_filename) # 解压缩 with tarfile.open(tar_filename, "r:gz") as tar: tar.extractall() model_filename = "deeplabv3_mnv2_pascal_train_aug/frozen_inference_graph.pb" # 加载DeepLabv3+模型 graph = tf.Graph() with graph.as_default(): od_graph_def = tf.GraphDef() with tf.io.gfile.GFile(model_filename, 'rb') as fid: serialized_graph = fid.read() od_graph_def.ParseFromString(serialized_graph) tf.import_graph_def(od_graph_def, name='') # 加载Cellpose模型 model1 = models.CellposeModel(gpu=True, model_type='livecell') model2 = models.CellposeModel(gpu=True, model_type='nuclei') model3 = models.CellposeModel(gpu=True, model_type='cyto2') # 读取图像 image_path = "your_image.jpg" image = cv2.imread(image_path) # 进行图像分割(DeepLabv3+模型) with tf.compat.v1.Session(graph=graph) as sess: input_tensor = graph.get_tensor_by_name('ImageTensor:0') output_tensor = graph.get_tensor_by_name('SemanticPredictions:0') output = sess.run(output_tensor, feed_dict={input_tensor: image}) # 解码并可视化分割结果(DeepLabv3+模型) segmentation_mask = np.squeeze(output) segmentation_mask = np.uint8(segmentation_mask) segmentation_mask = cv2.resize(segmentation_mask, (image.shape[1], image.shape[0]), interpolation=cv2.INTER_NEAREST) # 进行图像分割(Cellpose模型) masks1, _, _, _ = model1.eval(image) masks2, _, _, _ = model2.eval(image) masks3, _, _, _ = model3.eval(image) # 可视化结果 cv2.imshow("Original Image", image) cv2.imshow("Segmentation Mask (DeepLabv3+)", segmentation_mask) cv2.imshow("Cell Masks (Model 1)", masks1) cv2.imshow("Cell Masks (Model 2)", masks2) cv2.imshow("Cell Masks (Model 3)", masks3) cv2.waitKey(0) cv2.destroyAllWindows() ``` 这样,你就可以同时使用DeepLabv3+模型和Cellpose模型进行图像分割和细胞分割了。请注意,你需要根据你的需求进行适当的调整和修改。 希望这可以帮助到你!如果还有其他问题,请随时告诉我。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值