Introduction

In the first part of this two part blog series we briefly discussed power and energy metrics and presented some theoretical discussion on what should be considered when analyzing results in Nutanix’s X-Ray benchmarking tool. If you’ve not read that post, please head over and do that now because, while the following might seem like it makes sense, it will make a lot more sense if you’re already familiar with that post.

NB: For old time users of the Nutanix X-Ray benchmarking platform, or indeed any test run done with a version previous to X-Ray 4.4 (Jan 2024), power data will not have been collected so will not appear in the results. If this information is required, then it is recommended to re-run the scenarios with the latest version of X-Ray.

The intention for this second part of this blog series is to move on from theory and take a look at some actual results.

Baselining - Idle Scenario

Firstly, I ran an HCI benchmark test with minimal settings to try to create a baseline for 30 mins. I deliberately set out to make this scenario as close to an idle cluster as possible.

UX screenshot showing variables and values Xray part 2 figure 1

So that’s basically an HCI Benchmark intended to produce a grand total of 1 IOP. This is as idle as you can get whilst still actually running a test scenario. As per the first part of this blog series, this is a “fixed time” test set to run for 1800 seconds (30 mins), as opposed to a fixed work test that we’ll look at next.

Please note, in a wider sense, what actually constitutes an “idle” platform or server may depend on your particular circumstances. It could be a server booted to the BIOS with no actual load or it could be a server with an OS / hypervisor but no work besides background tests. It could be the low utilization point in your environment. It’s not as easy to define as a maximum that you will see in hardware documentation.

For our baseline HCI benchmark, the results summary is as follows:

Results summary of an HCI Benchmark producing 1IOP Figure: Results summary of an HCI Benchmark producing 1IOP.

From the results, we can see that the test time came to be 29.96 mins, near enough 30 mins (this will have been the time between the first and last polls, so close to the 30 mins requested, but for the purposes of results, never quite the same).

We can also see that the energy used during this test run was 0.26 kWh.  0.26 kWh per 30 mins is 0.52kWh per hour, or to put it another way, an average power draw of 0.52kW which is 520W.

If we then look at the Cluster Power Usage graph, we can see that the power consumption was around 500W for the entirety of the test.

Cluster power usage graph showing CPU utilization

This should clearly indicate the difference between power and energy.  But, to state it one last time, 520W of power draw running for 1 hour is 520Wh or 0.52kWh of energy.

Also note that there are four nodes in the cluster, so one would expect that energy use to be fairly consistent across all the nodes if the workloads are balanced between them.

Please note:  The terms draw / consumption and usage are all synonymous in this context.

Also not that the cluster power usage tracks pretty well with the cluster CPU utilization:

Cluster power usage graph showing CPU utilization Xray part2 fig4

CPU utilization is sometimes used as a proxy for power draw where no direct measurements are possible.  But there are scenarios, particularly where other hardware components (disks, network cards, GPUs) are extensively used, where this methodology can become inaccurate.

Ok, that’s a start, but what about something more interesting?

Comparison

To do that, we’re going to make use of X-Ray’s comparison feature to put similar test runs next to each other and see what we might surmise from the results.  So, to help make sure we get noticeably different results, we’re going to use two different clusters, each with different Nutanix software versions, one running AOS 5.20 and one running AOS 6.8, but with identical hardware configurations.

Comparison - Big Data Ingestion

We used the “Big Data Ingestion” test configured to process 1TiB of data. The power and energy comparisons are as follows:

A table of run time and energy use results summary from an X-Ray comparison. Figure: A table of run time and energy use results summary from an X-Ray comparison. The 5.20 AOS cluster is on the left (dark blue) and the AOS 6.8 cluster is on the right (light blue).
X-Ray comparison power draw graph. The AOS 5.20 cluster is dark blue and the AOS 6.8 cluster is light blue. Figure: X-Ray comparison power draw graph. The AOS 5.20 cluster is dark blue and the AOS 6.8 cluster is light blue.

From the graph above, the light blue line is higher, therefore the cluster it represents must be drawing more power.  But does that mean it's using more energy?  If you look closely, you can see that the light blue line (the AOS 6.8 cluster) starts slightly later and appears to finish earlier.

Taking a look at the table above, it is confirmed that the AOS 6.8 cluster completed the 1TB data ingestion over 1 minute quicker than the AOS 5.20 cluster.  So although the AOS 6.8 cluster can draw slightly more power at max, it completes the task much more quickly and so the overall energy use for completing the task is the same 0.19 kWh. 

Conclusion

The key learning here is that, just because a system can draw more power at peak, it doesn’t mean that it will consume more energy over a period of years.

If you are trying to understand an IT infrastructure platform's energy use for budgetary or carbon impact estimation purposes, then this single task is interesting, but perhaps not the whole story.  To get the full picture, the entire lifecycle of a system should be assessed, including:

  1. Multiple workloads and tasks running at the same time.
  2. Measurements for idle, or low periods that are often a far greater percentage of a lifetime of use than peak periods.
  3. Hardware costs and embodied emissions.

That last point is particularly interesting because, if a software upgrade alone is able to extract more performance from the same hardware, then it's quite possible that the new software is reducing the need for further hardware (i.e., adding another node) or even extending the life of the existing hardware investment.

In this case you might reasonably conclude that the AOS 6.8 cluster is able to push the physical hardware harder, allowing organizations to get more value from the same infrastructure simply by upgrading the software.  

© 2024 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product, feature and service names mentioned herein are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. Other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s). The content herein reflects an experiment in a test environment. Results, benefits, savings or other outcomes depend on a variety of factors including their use case, individual requirements, and operating environments, and this publication should not be construed to be a promise or obligation to deliver specific outcomes.