It's Time For AI: Benchmark a Coding Agent on Nutanix

Introduction

When we discuss coding language models (LLMs) and natural language (NL) language models comparatively, such as Llama3 vs. CodeLlama, we could readily identify some distinctions. In fact, coding LLMs are significantly more challenging to develop and work with compared to NL LLMs for the following reasons.

Precision and Syntax Sensitivity: Code is a formal language with strict syntax rules and structures. A minor error, such as a misplaced bracket or a missing semicolon, can lead to errors that prevent the code from functioning. This requires the LLM to have a high degree of precision and an understanding of syntactic correctness, which is generally more stringent than the flexibility seen in natural language.
Execution Semantics: Code not only needs to be syntactically correct, but it also has to be semantically valid—that is, it needs to perform the function it is supposed to do. Unlike natural language, where the meaning can be implicitly interpreted and still understood even if somewhat imprecisely expressed, code execution needs to yield very specific outcomes. If a code LLM gets the semantics wrong, the program might not work at all or might perform unintended operations.
Context and Dependency Management: Code often involves multiple files or modules that interact with each other, and changes in one part can affect others. Understanding and managing these dependencies and contexts is crucial for a coding LLM, which adds a layer of complexity compared to handling standalone text in natural language.
Variety of Programming Languages: There are many programming languages, each with its own syntax, idioms, and usage contexts. A coding LLM needs to potentially handle multiple languages, understand their unique characteristics, and switch contexts appropriately. This is analogous to a multilingual NL LLM but often with less tolerance for error.
Data Availability and Diversity: While there is a vast amount of natural language data available from books, websites, and other sources, high-quality, annotated programming data can be more limited. Code also lacks the redundancy and variability of natural languages, which can make training more difficult.
Understanding the Underlying Logic: Writing effective code involves understanding algorithms and logic. This requires not only language understanding but also computational thinking, which adds an additional layer of complexity for LLMs designed to generate or interpret code.
Integration and Testing Requirements: For a coding LLM, the generated code often needs to be tested to ensure it works as intended. This involves integrating with software development environments and tools, which is more complex than the generally self-contained process of generating text in natural language.

Each of these aspects makes the development and effective operation of coding LLMs a challenging task, often requiring more specialized knowledge and sophisticated techniques compared to natural language LLMs.

The deployment and life-cycle management of a LLM-serving API is challenging because of the autoregressive nature of the transformer-based generation algorithm. For code LLM, the problem is more acute for the following reasons:

Real-Time Performance: In many applications, coding LLMs are expected to provide real-time assistance to developers, such as for code completion, debugging, or even generating code snippets on the fly. Meeting these performance expectations requires highly efficient models and infrastructure to minimize latency, which can be technically challenging and resource-intensive.
Scalability and Resource Management: Code generation tasks can be computationally expensive, especially when handling complex codebases or generating lengthy code outputs. Efficiently scaling the service to handle multiple concurrent users without degrading performance demands sophisticated resource management and possibly significant computational resources. Also, the attention computation in the inference time takes quadratic time complexity with respect to the input. Often, the input sequence length for the code models are significantly higher than the NL models.
Context Management: Effective code generation often requires understanding not just the immediate code snippet but also broader project contexts, such as libraries used, the overall software architecture, and even the specific project's coding standards. Maintaining and accessing this contextual information in a way that is both accurate and efficient adds complexity to the serving infrastructure.
5. Security Concerns: Serving a coding LLM involves potential security risks, not only in terms of the security of the model itself (e.g., preventing unauthorized access) but also ensuring that the code it generates does not introduce security vulnerabilities into user projects. Ensuring both model and output security requires rigorous security measures and constant vigilance.
In summary, code LLMs are much harder to train and deploy for inference than NL LLMs. In this article, we cover an API benchmarking for a code generation developed entirely on Nutanix infrastructure.

Code Generation Workflow

Figure 1: Workflow of an LLM-assisted code generation system

Figure 1 shows an LLM-assisted code generation workflow. It combines a context with a prompt with a prompt template to generate the input sequence to a large language model (LLM). Then, the LLM generates the output which is passed to the evaluation system. If the output is satisfactory, the user can revise the prompt, prompt template, and LLM used. Figure 1 shows the taxonomy for the LLM-assisted code generation workflow.

Term	Description	Example
Prompt	Instruction to an LLM	Write unit test to the following function
Context	Code body on which the instruction is executed	def two_sum(nums, target): hash_map = {} for index, num in enumerate(nums): difference = target - num if difference in hash_map: return [hash_map[difference], index] hash_map[num] = index return None
Prompt Template	Template to combine prompt and context	Context: Response:
Input	A combination of prompt and context through prompt template	import unittest class TestTwoSum(unittest.TestCase): def test_two_sum_normal(self): self.assertEqual(two_sum([2, 7, 11, 15], 9), [0, 1]) def test_two_sum_no_solution(self): self.assertIsNone(two_sum([1, 2, 3, 4], 10)) def test_two_sum_negative_numbers(self): self.assertEqual(two_sum([-3, 4, 3, 90], 0), [0, 2]) def test_two_sum_same_element_twice(self): self.assertIsNone(two_sum([3, 3], 6)) def test_two_sum_one_element(self): self.assertIsNone(two_sum([3], 3)) def test_two_sum_empty_list(self): self.assertIsNone(two_sum([], 3))
Evaluation	Accuracy assessment by a subject matter expert	Provide feedback on the quality of the generated output and experiment with prompt, prompt template, and/or LLM for a given context.

Table 1: Taxonomy for the LLM-assisted code generation workflow

Nutanix Cloud Platform

At Nutanix, we are dedicated to enabling customers to build and deploy intelligent applications anywhere—edge, core data centers, service provider infrastructure, and public clouds. Figure 2 shows how AI/ML is integrated into the core Nutanix® infrastructure layer.

Figure 2: AI stack running on the cloud-native infrastructure stack of NCP. The stack provides holistic integration between supporting cloud-native infrastructure layer, including chip layer, followed by virtual machine layer, supporting library/tooling layer, and AI stack layer, including Foundation Models (different variants of transformers), task specific AI app layers.

As shown in Figure 2, the App layer runs on the top of the infrastructure layer. The infrastructure layer can be deployed in two steps, starting with Prism Element™ login followed by VM resource configuration. Figure 3 shows the UI for the Prism Element controller.

Figure 3: The UI showing the setup for a Prism Element on which the transformer model for this article was trained. It shows the AHV® hypervisor summary, storage summary, VM summary, hardware summary, monitoring for cluster-wide controller IOPS, monitoring for cluster-wide controller I/O bandwidth, monitoring for cluster-wide controller latency, cluster CPU usage, cluster memory usage, granular health indicators, and data resiliency status.

After logging into Prism Element, we create a virtual machine (VM) hosted on our Nutanix AHV® cluster. As shown in Figure 4, the VM has following resource configuration settings: 22.04 Ubuntu® operating system, 16 single core vCPUs, 64 GB of RAM, and NVIDIA® A100 tensor core passthrough GPU with 40 GB memory. The GPU is installed with the NVIDIA RTX 15.0 driver for Ubuntu OS (NVIDIA-Linux-x86_64-525.60.13-grid.run). The large deep learning models with transformer architecture require GPU or other compute accelerators with high memory bandwidth, large registers and L1 memory.

Figure 4: The VM resource configuration UI pane on Nutanix Prism Element. As shown, it helps a user configure the number of vCPU(s), the number of cores per vCPUs, memory size (GiB), and GPU choice. We used an NVIDIA A100 80G for this article.

The NVIDIA A100 Tensor Core GPU is designed to power the world’s highest-performing elastic datacenters for AI, data analytics, and HPC. Powered by the NVIDIA Ampere™ architecture, A100 is the engine of the NVIDIA data center platform. A100 provides up to 20X higher performance over the prior generation and can be partitioned into seven GPU instances to dynamically adjust to shifting demands.

To peek into the detailed features of A100 GPU, we run `nvidia-smi` command which is a command line utility, based on top of the NVIDIA Management Library (NVML), and intended to aid in the management and monitoring of NVIDIA GPU devices. The output of the `nvidia-smi` command is shown in Figure 6. It shows the Driver Version to be 515.86.01 and CUDA version to be 11.7. Figure 5 shows several critical features of the A100 GPU we used. The details of these features are described in Table 1.

Figure 5: Output of `nvidia-smi` for the underlying A100 GPU

Feature	Value	Description
GPU	0	GPU Index
Name	NVIDIA A100	GPU Name
Temp	34C	Core GPU Temperature
Persistence-M	On	Persistence Mode
Pwr: Usage/Cap	36W/250W	GPU Power Usage and its capacity
Bus Id	00000000:00:06.0	domain:bus:device.function
Disp. A	Off	Display Active
Memory-Usage	25939MiB/40960MiB	Memory allocation of total memory
Volatile Uncorr. ECC	0	Counter of uncorrectable ECC memory error
GPU-Util	0%	GPU Utilization
Compute M.	Default	Compute Mode
MIG M.	Disabled	Multi-Instance Mode

Table 2: Description of the key features of the underlying A100 GPU.

We aim to measure the impact on the code generation latency from code size, input token size, output token size, and different code generation tasks across a large sample size. For the benchmarking, it is instructive to choose the right code datasets. There are benchmarking datasets such as HumanEval, MBPP, APPS, Multiple-E, and GSM8K. For this article, we choose the Mostly Basic Programming Problem (MBPP) dataset (https://arxiv.org/abs/2108.07732), which consists of 974 programming tasks designed to be solvable by entry-level programmers. For the code LLM API, we have used CodeLlama-7b-instruct (https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf). The API server was implemented using FastAPI (https://fastapi.tiangolo.com/).

The MBPP dataset has the following structure, as shown in Table 3. It has 974 rows and 6 different features such as task_id, text, code, test_list, test_setup_code, challenge_test_list.

Structure

DatasetDict({
    test: Dataset({ thg
        features: [
            'task_id',
            'text',
            'code',
            'test_list',
            'test_setup_code',
            'challenge_test_list'],
        num_rows: 974
    })
})

Example

{
    'task_id': 1,
    'text': 'Write a function to find the minimum cost path to reach (m, n) from (0, 0) for the given
cost matrix cost[][] and a position (m, n) in cost[][].',
    'code': 'R = 3\r\nC = 3\r\ndef min_cost(cost, m, n): \r\n\ttc = [[0 for x in 
range(C)] for x in range(R)] \r\n\ttc[0][0] = cost[0][0] \r\n\tfor i in range(1, m+1): \r\n\t\ttc[i][0] =
tc[i-1][0] + cost[i][0] \r\n\tfor j in range(1, n+1): \r\n\t\ttc[0][j] = tc[0][j-1] + cost[0][j] 
\r\n\tfor i in range(1, m+1): \r\n\t\tfor j in range(1, n+1): \r\n\t\t\ttc[i][j] = min(tc[i-1][j-1],
tc[i-1][j], tc[i][j-1]) + cost[i][j] \r\n\treturn tc[m][n]',
    'test_list': [
        'assert min_cost([[1, 2, 3], [4, 8, 2], [1, 5, 3]], 2, 2) == 8',
        'assert min_cost([[2, 3, 4], [5, 9, 3], [2, 6, 4]], 2, 2) == 12',
        'assert min_cost([[3, 4, 5], [6, 10, 4], [3, 7, 5]], 2, 2) == 16'],
    'test_setup_code': '',
    'challenge_test_list': []
}

Table 3: MBPP Dataset Structure

We used CodeBLEU (https://arxiv.org/abs/2009.10297) to measure the fidelity of the code generation and test generation with respect to the reference data provided in the MBPP dataset, respectively. CodeBLEU is specifically designed for handling code data unlike the traditional BLEU score.
We measured the latency for each of the requests and compared it with the corresponding input/output token counts. Specifically, we measured the following metrics:
- Time to First Byte (TTFB): Time to first byte (TTFB) is a measurement used as an indicator of the responsiveness of the API. TTFB measures the duration from the user or client making an HTTP request to the first byte being received by the client.
- Time to Last Byte (TTLB): Time to last byte (TTLB) is a measurement used as an indicator of the responsiveness of the API. TTLB measures the duration from the user or client making an HTTP request to the last byte being received by the client.
- Input Token Count: The number of tokens in the API call query.
- Output Token Count: The number of tokens in the API call response.
We investigated whether CodeBLEU is related to the TTFB, TTLB, input token count, and output token count.

Results

Fidelity Benchmarking

Code Generation Use Case

Figure 6 shows the CodeBLEU score for code generation tasks in the MBPP dataset. It shows a reasonable Pass@1 accuracy as reported in the seminal CodeLlama paper (https://arxiv.org/abs/2308.12950).

Figure 6: CodeBLEU Scores for code generation tasks in the MBPP dataset

Test Generation Use Case

Figure 7 shows the CodeBLEU score for test generation tasks in the MBPP dataset. It shows a reasonable Pass@1 accuracy as reported in the seminal CodeLlama paper (https://arxiv.org/abs/2308.12950).

Figure 7: BLEU Scores for test generation tasks in the MBPP dataset

Latency and Token Count Benchmarking

Code Generation

In this use case, a text string is passed in the API query and a code body is returned as the API response.

Figure 8 shows the correlation matrix among {TTFB, TTLB, Input token count, Output token count, CodeBLEU score} for the code generation use case. We can make following observations from Figure 8:

TTLB and Output Token Count have a fairly high correlation score of 0.36 compared to other pairs.
Input Token Count (text) and Output Token Count (code) have a correlation score of 0.18 which is quite obvious. Here we are sending the text body as the input and receiving the code body in the response. In most cases, it is fair to expect that a longer input text will return a longer code block.
CodeBLEU score hardly has any correlation with other factors.

Figure 8: Correlation matrix for code generation among different fields such as time to first byte (TTFB) (s), time to last byte (TTLB) (s), input token count, output token count, and CodeBLEU.

From Figure 8, it appears that the relationship between output token count and TTLB is worth further scrutiny. Figure 9 shows the jointplot between time to last byte (TTLB) (s) and output token count. It clearly shows that TTLB increases with output token count. This proportionality can be explained by the fact that LMM generates one token at a time.

Figure 9: Time to Last Byte (TTLB) vs Output Token Count for code generation

Test Generation

In this use case, a code body is passed in the API query and a code body is returned as the API response.

Figure 10 shows the correlation matrix among {TTFB, TTLB, Input token count, Output token count, Code line count, CodeBLEU score} for the test generation use case. We can make following observations from Figure 10:

TTLB and Output Token Count have a fairly high correlation score of 0.86 compared to other pairs.
For test generation, we have an additional field: code line count field which is quite obviously highly correlated with the input token count (the correlation score of 0.92).
Input Token Count (code) and Output Token Count (code) have a correlation score of 0.15 which is quite obvious. A longer code body generally should have a longer test code body, in general.

Figure 10: Correlation matrix for test generation among different fields such as time to first byte (TTFB) (s), time to last byte (TTLB) (s), input token count, output token count

From Figure 10, it appears that the relationship between output token count and TTLB is worth further scrutiny. Figure 11 shows the jointplot between time to last byte (TTLB) (s) and output token count. It clearly shows that TTLB increases with output token count. This proportionality can be explained by the fact that LMM generates one token at a time.

Figure 11: Time to Last Byte (TTLB) vs Output Token Count for test generation

Empirical Best Practices/Insights for Code Generation and Unit Test Generation

For both code generation and test generation use cases, the response time varies proportionally with the output token count.
CodeBLEU score remains relatively invariant with input/output token count.
On average, the response times for both use cases vary between 0 and 20 s.

Impact

GitHub Copilot has delivered massive productivity gain–as high as 55%-- for developers across the verticals (Link). In fact, its economic impact is poised to grow beyond $1.5T in the next few years (Link). With the rapid advancement of the HuggingFace and Llama ecosystems, open large language models (LLMs) are experiencing significant progress, thereby making LLM application development accessible to typical enterprises.

In this macro-economic climate, potential benefits of developing open source LLM-based code assistants is humongous, but the evaluation of these code assistants is often challenging because of the infrastructure management, data privacy, and dependency management.

As you contemplate how AI will change your business, Nutanix GPT-in-a-Box 2.0 makes getting started with GenAI a snap to deploy real use cases and solutions built on standard hardware without the need for a special architecture. As demonstrated, LLMs change fast, and with Nutanix, you can stay ahead of the curve with a secure, full-stack platform to run GenAI data and apps anywhere enhancing developer productivity.

© 2024 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product, feature and service names mentioned herein are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. Other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s). This post may contain links to external websites that are not part of Nutanix.com. Nutanix does not control these sites and disclaims all responsibility for the content or accuracy of any external site. Our decision to link to an external site should not be considered an endorsement of any content on such a site. Certain information contained in this post may relate to or be based on studies, publications, surveys and other data obtained from third-party sources and our own internal estimates and research. While we believe these third-party studies, publications, surveys and other data are reliable as of the date of this post, they have not independently verified, and we make no representation as to the adequacy, fairness, accuracy, or completeness of any information obtained from third-party sources.

This post may contain express and implied forward-looking statements, which are not historical facts and are instead based on our current expectations, estimates and beliefs. The accuracy of such statements involves risks and uncertainties and depends upon future events, including those that may be beyond our control, and actual results may differ materially and adversely from those anticipated or implied by such statements. Any forward-looking statements included herein speak only as of the date hereof and, except as required by law, we assume no obligation to update or otherwise revise any of such forward-looking statements to reflect subsequent events or circumstances

It's Time For AI: Benchmark a Coding Agent on Nutanix