Introduction
Introduction
Reasoning models are the talk of the town right now, thanks to the release of DeepSeek R1. The trend started when OpenAI announced OpenAI O1, which marked the beginning of the Reasoning and Test-time Compute era.
LLMs can only go so far with just training. When asked a question, LLMs don’t have the thinking ability that humans do. Chain of Thought was one of the first ways to explore how to provide thinking power to LLMs for better responses. CoT gave rise to other prompting techniques like the Tree of Thought, Self Consistency Prompting and many more prompting strategies. However, all these techniques relied heavily on the user prompting the model correctly. What if we didn’t need to prompt the model at all? What if the model intuitively knows when to use extra time for thinking and when to feel confident about its answers? Here comes the era of reasoning! Let the model use a scratchpad to collate all the thoughts and gather the final solution.
Nutanix Enterprise AI helps you deploy and manage these models on your hardware with a few easy clicks. In this article, we will explore how to run models with such superhuman reasoning capabilities on Nutanix at a predictable cost while prioritizing privacy and security. We use the recently released DeepSeek-R1-Distill-Llama-70B (Llama 70B distilled from DeepSeek R1). Note that the steps would be identical for smaller models like DeepSeek-R1-Distill-Llama-8B or the very recently released Mistral Small 24B with appropriate resource values.
Disclaimer: The reasoning models referenced in this article are not officially validated by Nutanix Enterprise AI (NAI) and are still undergoing internal testing.
Prerequisites
Prerequisites
For this blog, we assume you have a running setup of Nutanix Enterprise AI on your Kubernetes® clusters. Refer to this Installation doc for more details. We will use DeepSeek-R1-Distill-Llama-70B, which DeepSeek released as part of their DeepSeek R1 paper. This model requires approximately 180GB of combined GPU memory (to run this model at a full context length of 128K) and 170GB of storage space. Alternatively, if you want something lighter, you can pick DeepSeek-R1-Distill-Llama-8B or Mistral Small 24B. Please refer to the table below for approximate resource requirements to run the said models.
Pro tip: Nutanix Enterprise AI will validate if a given hardware configuration can run your favourite models at your desired context lengths.
Step 1: Downloading the model
Step 1: Downloading the model
First we will download the DeepSeek-R1-Distill-Llama-70B model to an NFS file share or a S3 compatible Object Store. In this blog, we use Nutanix File Share as an example, mounted at the location /mnt/model_pvc/
(Note: Hugging Face model downloads require a user-access token/login using a Hugging Face account. More information here: https://huggingface.co/docs/hub/en/security-tokens)
Step 2: Importing the model to NAI
Step 2: Importing the model to NAI
Next we go to the models page on your NAI instance. You will see a list of all models that you have already imported. Click “Import models” to open supported import options.
Choose “Using Manual Upload” from the dropdown.
Choose the “Custom Model” option.
Then we complete filling the rest of the model details, such as the model name and developer. Ensure you add the correct values for the Model Size (170GB for the DeepSeek-R1-Distill-Llama-70B model).
Here are the folder contents in the mounted file share for your reference.
Once we are done with all the details, click the “Upload” button and confirm your choice.
Now, you can see the new entry created for `deepseek-llama70b`. The initial status is “Pending”, which will turn to Active status. The model import can take some time, depending on your network connection.
Step 3: Creating an Endpoint
Step 3: Creating an Endpoint
On the Endpoints page, Click the “Create Endpoint” button.
Fill out the endpoint details, such as name and description. From the model instance name picker, choose the created model deepseek-llama-70b. Based on the selected model, choose the GPU count and the type. In this case, we use 4 Nvidia-H100 GPUs for the deployment.
Choose the Inference Engine as vLLM with 16 vCPUs and 64 GB of RAM. For now, we will create one such instance. You can configure the instance count based on your API scaling requirements and the availability of hardware resources.
Create an API key for the endpoint or select one to reuse the existing ones. Store the API key safely per your enterprise’s best practices and compliance for future API usage as the key will not be viewable again without recreating it.
Once you have completed all the details, click the “Create” button to deploy the endpoint. The initial status is “Pending”, which will turn to Active status. The endpoint creation can take some time, depending on your hardware and network connection.
Congratulations—your endpoint is ready! You have successfully deployed the reasoning model on your servers.
Step 4: Playing around and testing
Step 4: Playing around and testing
Once the endpoint is active, click the “Test” button on the endpoint details page. You can test with either predefined sample requests or a custom request of your choice. A successful response indicates end-to-end deployment validation, and we are ready to use the Inference API.
NAI inference APIs are OpenAI API compatible. Click the View Sample Request button to view the sample curl request to interact with the model. Ensure you replace $API_KEY
with the API KEY configured during endpoint creation.
To integrate with the OpenAI Python client, you can use its SDK to interact with the deployed endpoint.
(Note: The OpenAI Python API library can be imported from GitHub using your own account. It is generated from the OpenAI API specification: https://github.com/openai/openai-python?tab=readme-ov-file)
Depending on your requirements, you may limit the maximum number of generated tokens using the max_tokens field or set the stream field for streaming responses.
A sample response is as follows. As a fun fact, this response took up 15k tokens, which includes thinking steps.
Conclusion
Conclusion
Today, we have shown how easy it is to create production-ready deployments of state-of-the-art reasoning models hosted on your infrastructure with just a few clicks on Nutanix Enterprise AI.
©2025 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product and service names mentioned herein are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. All other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s).
Our decision to link to or reference an external site should not be considered an endorsement of any content on such a site. Certain information contained in this post may relate to, or be based on, studies, publications, surveys and other data obtained from third-party sources and our own internal estimates and research. While we believe these third-party studies, publications, surveys and other data are reliable as of the date of this paper, they have not independently verified unless specifically stated, and we make no representation as to the adequacy, fairness, accuracy, or completeness of any information obtained from a third-party.
All code samples are unofficial, are unsupported and will require extensive modification before use in a production environment. This content may reflect an experiment in a test environment. Results, benefits, savings, or other outcomes described depend on a variety of factors including use case, individual requirements, and operating environments, and this publication should not be construed as a promise or obligation to deliver specific outcomes.
This content may reflect an experiment in a test environment. Results, benefits, savings, or other outcomes described depend on a variety of factors including use case, individual requirements, and operating environments, and this publication should not be construed as a promise or obligation to deliver specific outcomes.