Introduction

Reasoning models are the talk of the town right now, thanks to the release of DeepSeek R1. The trend started when OpenAI announced OpenAI O1, which marked the beginning of the Reasoning and Test-time Compute era.

LLMs can only go so far with just training. When asked a question, LLMs don’t have the thinking ability that humans do. Chain of Thought was one of the first ways to explore how to provide thinking power to LLMs for better responses. CoT gave rise to other prompting techniques like the Tree of Thought, Self Consistency Prompting and many more prompting strategies. However, all these techniques relied heavily on the user prompting the model correctly. What if we didn’t need to prompt the model at all? What if the model intuitively knows when to use extra time for thinking and when to feel confident about its answers? Here comes the era of reasoning! Let the model use a scratchpad to collate all the thoughts and gather the final solution.

Nutanix Enterprise AI helps you deploy and manage these models on your hardware with a few easy clicks. In this article, we will explore how to run models with such superhuman reasoning capabilities on Nutanix at a predictable cost while prioritizing privacy and security. We use the recently released DeepSeek-R1-Distill-Llama-70B (Llama 70B distilled from DeepSeek R1). Note that the steps would be identical for smaller models like DeepSeek-R1-Distill-Llama-8B or the very recently released Mistral Small 24B with appropriate resource values.

Disclaimer: The reasoning models referenced in this article are not officially validated by Nutanix Enterprise AI (NAI) and are still undergoing internal testing.

Prerequisites

For this blog, we assume you have a running setup of Nutanix Enterprise AI on your Kubernetes® clusters. Refer to this Installation doc for more details. We will use DeepSeek-R1-Distill-Llama-70B, which DeepSeek released as part of their DeepSeek R1 paper. This model requires approximately 180GB of combined GPU memory (to run this model at a full context length of 128K) and 170GB of storage space. Alternatively, if you want something lighter, you can pick DeepSeek-R1-Distill-Llama-8B or Mistral Small 24B. Please refer to the table below for approximate resource requirements to run the said models.

Pro tip: Nutanix Enterprise AI will validate if a given hardware configuration can run your favourite models at your desired context lengths.

Model NameDisk Space (GiB)vCPUs (#)RAM (GiB)Total GPU RAM (GiB)
DeepSeek-R1-Distill-Llama-70B1701664180
DeepSeek-R1-Distill-Llama-8B5081624
Mistral Small 24B100122450

Step 1: Downloading the model

First we will download the DeepSeek-R1-Distill-Llama-70B model to an NFS file share or a S3 compatible Object Store. In this blog, we use Nutanix File Share as an example, mounted at the location /mnt/model_pvc/

(Note: Hugging Face model downloads require a user-access token/login using a Hugging Face account. More information here: https://huggingface.co/docs/hub/en/security-tokens)



import huggingface_hub as hfh
destination = '/mnt/model_pvc/models/DeepSeek-R1-Distill-Llama-70B' #Choose your destination
hf_token = 'hf_....' # Replace with your HuggingFace token

hfh.snapshot_download(
    repo_id='deepseek-ai/DeepSeek-R1-Distill-Llama-70B',
    local_dir=destination,
    local_dir_use_symlinks=False,
    token=hf_token,
)

Step 2: Importing the model to NAI

Next we go to the models page on your NAI instance. You will see a list of all models that you have already imported. Click “Import models” to open supported import options.

Model list page Model list page

Choose “Using Manual Upload” from the dropdown.

Model import options Model import options

Choose the “Custom Model” option.

Manual import options Manual import options

Then we complete filling the rest of the model details, such as the model name and developer. Ensure you add the correct values for the Model Size (170GB for the DeepSeek-R1-Distill-Llama-70B model).

Model resource specification Model resource specification

Here are the folder contents in the mounted file share for your reference.

Model files Model files
NFS details NFS details

Once we are done with all the details, click the “Upload” button and confirm your choice.

Now, you can see the new entry created for `deepseek-llama70b`. The initial status is “Pending”, which will turn to Active status. The model import can take some time, depending on your network connection.

Model status Model status

Step 3: Creating an Endpoint

On the Endpoints page, Click the “Create Endpoint” button.

Create endpoint page Create endpoint page

Fill out the endpoint details, such as name and description. From the model instance name picker, choose the created model deepseek-llama-70b. Based on the selected model, choose the GPU count and the type. In this case, we use 4 Nvidia-H100 GPUs for the deployment.

Endpoint details page Endpoint details page

Choose the Inference Engine as vLLM with 16 vCPUs and 64 GB of RAM. For now, we will create one such instance.  You can configure the instance count based on your API scaling requirements and the availability of hardware resources.

Endpoint resource specification Endpoint resource specification

Create an API key for the endpoint or select one to reuse the existing ones. Store the API key safely per your enterprise’s best practices and compliance for future API usage as the key will not be viewable again without recreating it.

API key creation API key creation

Once you have completed all the details, click the “Create” button to deploy the endpoint. The initial status is “Pending”, which will turn to Active status. The endpoint creation can take some time, depending on your hardware and network connection.

Congratulations—your endpoint is ready! You have successfully deployed the reasoning model on your servers.

Endpoint status Endpoint status

Step 4: Playing around and testing

Once the endpoint is active, click the “Test” button on the endpoint details page. You can test with either predefined sample requests or a custom request of your choice. A successful response indicates end-to-end deployment validation, and we are ready to use the Inference API.

Model playground Model playground

NAI inference APIs are OpenAI API compatible.  Click the View Sample Request button to view the sample curl request to interact with the model. Ensure you replace $API_KEY with the API KEY configured during endpoint creation.

Sample API list Sample API list

To integrate with the OpenAI Python client, you can use its SDK to interact with the deployed endpoint.

(Note: The OpenAI Python API library can be imported from GitHub using your own account. It is generated from the OpenAI API specification: https://github.com/openai/openai-python?tab=readme-ov-file)


from openai import OpenAI

client = OpenAI(
  base_url=f"{NAI_URL}/api/v1", #replace url with your nai instance url
  api_key=f"Bearer: {API_KEY}", #replace api_key with the api key you copied
)

# defining the chat content
messages = [
    {
    "role": "user",
    "content": "If a > 1, then the sum of the real solutions of \(\sqrt{a - \sqrt{a + x}} = x\) where \(a > 1\) is equal to"
    }
]

# querying the endpoint with said question
resp_stream = client.chat.completions.create(
messages=messages,
                    model='deepseek-llama70b', # replace with your endpoint name
                    max_tokens=31000, # adjust according to your preferences. Remove for max context length
                    stream = True
                )

for chunk in resp_stream:
    token = chunk.choices[0].delta.content
    if token:
        print(token, end='')

Depending on your requirements, you may limit the maximum number of generated tokens using the max_tokens field or set the stream field for streaming responses. 

A sample response is as follows. As a fun fact, this response took up 15k tokens, which includes thinking steps.




Okay, I've got this equation to solve: √(a - √(a + x)) = x, where a > 1. Hmm, it seems a bit complicated with nested square roots, but I think I can break it down step by step.

First off, I know that with equations involving square roots, it often helps to square both sides to eliminate the radical. So, maybe I should start by doing that. Let me write that down.

Starting with the equation:
√(a - √(a + x}) = x .......................
################## truncated the response for better visibility ######################

Thus, the sum of the real solutions is 
\boxed{\dfrac{\sqrt{4a - 3} - 1}{2}}


Conclusion

Today, we have shown how easy it is to create production-ready deployments of state-of-the-art reasoning models hosted on your infrastructure with just a few clicks on Nutanix Enterprise AI.

 

©2025 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product and service names mentioned herein are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. All other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s).

Our decision to link to or reference an external site should not be considered an endorsement of any content on such a site. Certain information contained in this post may relate to, or be based on, studies, publications, surveys and other data obtained from third-party sources and our own internal estimates and research. While we believe these third-party studies, publications, surveys and other data are reliable as of the date of this paper, they have not independently verified unless specifically stated, and we make no representation as to the adequacy, fairness, accuracy, or completeness of any information obtained from a third-party.

All code samples are unofficial, are unsupported and will require extensive modification before use in a production environment. This content may reflect an experiment in a test environment. Results, benefits, savings, or other outcomes described depend on a variety of factors including use case, individual requirements, and operating environments, and this publication should not be construed as a promise or obligation to deliver specific outcomes.

This content may reflect an experiment in a test environment. Results, benefits, savings, or other outcomes described depend on a variety of factors including use case, individual requirements, and operating environments, and this publication should not be construed as a promise or obligation to deliver specific outcomes.