Introduction

The Nutanix Life Cycle Manager (LCM) software update tool takes care of managing the software and firmware components' lifecycle from one centralized control plane. Before upgrading, it's important to find out which versions are currently installed and any additional context about the installed versions that helps decide which versions are available and can be upgraded. This process is known as the Inventory operation.

The speed and efficiency of the inventory operation is crucial as it is a prerequisite for any LCM upgrades. To make the LCM orchestration layer scalable, it uses something called a plugin model. This means that owners of various components in the Nutanix stack provide their own plugins to check what versions are installed and what upgrades are available.

To make sure everything works correctly in different situations—like when a new cluster is set up, LCM is used for the first time, or upgrades are done outside LCM—these plugins need to be run by LCM to find the right information about installed versions and get the right upgrades that are compatible.

As LCM adds more and more components (like different software and firmware), making sure the inventory operation runs quickly and accurately becomes really important. In this blog, we'll talk about how we've improved the inventory operation's performance in LCM-2.6  by making it independent of the number of components in LCM. This could eventually mean that users won't even notice when the inventory operation happens aka invisible inventory.

Understanding the Challenge

Choosing the right tool

When undertaking performance optimization, it is crucial to identify any current bottlenecks within a codebase. Since our project was initially built in Python 2.7(we have now migrated to Python 3) and made use of greenlets, we required a Python profiler that could accommodate these requirements. To achieve precise insights into the identified bottlenecks, we opted for a deterministic profiler rather than a statistical profiler.

To determine the most suitable profiler for our requirements, we explored several options, including cProfile, py-instrument, and yappi. Each profiler had its strengths and weaknesses, and after careful evaluation, we decided to proceed with the yappi profiler. The decision to continue with the yappi profiler was based on its alignment with our specific profiling needs.

Challenge

In the pursuit of simplifying component management and promoting distributed ownership, LCM adopts a pluggable model, granting each component the ability to define and control the logic for detecting current running versions and executing upgrades. Consequently, LCM downloads component plugins onto the cluster to execute the operations. To optimize cluster space consumption during these operations, LCM employs a strategy of separately building component plugins. This approach allows LCM to download only the necessary plugins for each operation, minimizing disk usage by avoiding the simultaneous download of all plugins. Additionally, LCM diligently cleans up these plugins after execution to free up utilized disk space.

Upon analyzing the profiling results, we identified that a significant portion of inventory time is occupied by plugin download and cleanup. The discovery workflow, which detects the currently installed version for each component, can be broadly divided into five subtasks:

  1. Perform Space Check: Before initiating any plugin download, LCM ensures that the cluster has sufficient available space.
  2. Download: The management of plugins is handled by a distinct Catalog service, an internal repository for image management that facilitates the download and management of the required artifacts, storing them within an underlying distributed storage infrastructure. LCM employs the Catalog service API to retrieve the essential plugins as needed on the target machine.
  3. Extract: LCM builds the source distribution of plugins for each component, extracting the source gunzipped tar bundle on a designated target staging directory on the environment managed by LCM.
  4. Perform Operation: During this step, LCM executes the detect logic using the extracted plugin and stores the relevant information for future reference.
  5. Cleanup: As part of the cleanup process, LCM clears the staging directory to release the consumed disk space.

Below profiling results (time in secs) is aggregation (mean) of 3 instances on a 3 node cluster (AHV) with LCM 2.5 (refer Table 1):

Space Check (sec)Space Check (sec)Download Time (sec)Extract (sec)Operation (sec)Cleanup (sec)
417.5848.54205.8693.7637.277.56

Reevaluating Space vs. Time Trade-offs

Having understood the challenges posed by the inventory operation, we now delve into investigating the primary reason behind the prolonged preparation (space check, download, extract) and cleanup time, we identified that it wasn't solely due to the size of these modules but rather the overhead caused by a large number of independent plugin download requests. Compounding the issue, certain common library plugins were needed by multiple LCM entities, leading to redundant prepare and cleanup operations.

To chart our path forward, we delved into the plugin characteristics and made the following key observations:

  • Approximately 95% of the plugins had a size of less than 1MB.
  • The total size of plugins, each with an individual size less than 1MB, summed up to less than 9MB.
  • Only 1% of the plugins exceeded a size of 10MB.
  • Plugins larger than 1MB predominantly comprised third-party library images from supported OEMs and were specifically required for certain entities.
  • About 10% of the plugins consisted of third-party library images, while the cumulative size of the remaining plugins was around 1MB.

Armed with these valuable insights, we opted to optimize the preparation and cleanup time for these plugins. To achieve this, we reevaluated the previous approach that prioritized minimal disk space consumption, and decided to adopt a hybrid solution to reduce the overall time taken. Notably, approximately 90% of the plugins had a cumulative size of approximately 1MB. We devised an innovative approach. We began bundling these small-sized plugins together in one comprehensive package called "one-module." Prior to executing a batch of detect tasks, "one-module" is downloaded and extracted only once, and cleanup occurs after the completion of all detect tasks in a batch on a given node. Meanwhile, the preparation and cleanup of library images continue to be performed on demand, as before, thereby maintaining disk space consumption at a similar level.

By implementing this hybrid solution, we successfully optimized the inventory operation, reducing both preparation and cleanup times while effectively managing disk space usage. This enhancement represents a significant step towards achieving superior performance in LCM inventory operations within the Nutanix product.

Performance Improvement

After adopting the "one-module" approach in our inventory operations, we noticed significant improvements in the overall duration of inventory tasks. Nevertheless, it's crucial to recognize that the outcomes attained are subject to various factors such as the hardware model of the cluster, deployed software, and firmware versions, which can result in varying inventory performance.

Below profiling results (time in secs) is aggregation (mean) of 3 instances on a 3 node cluster (AHV) with LCM 2.6 (refer Table 2):

Space Check (sec)Space Check (sec)Download Time (sec)Extract (sec)Operation (sec)Cleanup (sec)
48.910.885.711.9130.940.46

We observed similar performance improvements on clusters with different configurations, varying in cluster size and hypervisors. The "Decrease in Total Detect time" indicates the reduction achieved in inventory time when comparing the prior LCM 2.5 release and LCM 2.6 on the same cluster configuration. However, it is essential to note that comparing results between different cluster configurations may not be appropriate, as inventory time is influenced by multiple factors, not limited to cluster size and hypervisor.

Below are the approximate decreases in Total Detect time for different cluster configurations:

Cluster Size (num of node)HypervisorDecrease in Total Detect Time (approx)Decrease in Total Inventory Time (approx)
3AHV87%60%
3VMware ESXi67%47%
8AHV59%35%
8VMware ESXi66%43%
32AHV84%46%
48VMware ESXi77%40%

Conclusion

In this blog, we have explored the innovative approach adopted by Nutanix Lifecycle Manager (LCM) to optimize inventory performance and reduce total inventory time. By introducing the "one-module" concept in LCM-2.6 , we significantly streamlined the package preparation and cleanup processes, resulting in noticeable improvements in the overall time taken for inventory operations.

Appendix

Table 1

Instance #Node #Total Time (sec)Space Check (sec)Download Time (sec)Extract (sec)Operation (sec)Cleanup (sec)
1Node 1402.34046.828199.72790.28235.6957.159
Node 2440.87451.855216.29498.52039.2008.124
Node 3412.35647.601203.44992.32836.4187.720
2Node 1404.43547.104201.01389.68736.4587.226
Node 2434.52350.666213.61397.82938.9567.795
Node 3411.25547.546203.09491.42137.1827.486
3Node 1402.20046.774198.53090.81835.6337.172
Node 2437.74650.747211.862100.80439.5298.022
Node 3412.50047.773205.11692.17436.4217.330

Table 2

Instance #Node #Total Time (sec)Space Check (sec)Download Time (sec)Extract (sec)Operation (sec)Cleanup (sec)
1Node 146.8850.9305.5911.77630.5340.419
Node 250.9641.0815.9872.11731.6490.444
Node 348.1810.9775.4671.81230.4740.475
2Node 146.8910.9555.4531.82330.3840.435
Node 250.7951.0275.8222.10331.8760.466
Node 348.1110.9365.5062.02230.2080.423
3Node 149.9170.9555.5651.68030.9510.511
Node 251.5211.0296.0661.99932.0380.560
Node 348.9330.09795.9211.84030.3290.442

Related Content

 

© 2024 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo,  and all Nutanix product, feature and service names mentioned herein are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. Nutanix, Inc. is not affiliated with VMware by Broadcom or Broadcom. VMware and the various VMware product names recited herein are registered or unregistered trademarks of Broadcom in the United States and/or other countries. Other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s). This post may contain links to external websites that are not part of Nutanix.com. Nutanix does not control these sites and disclaims all responsibility for the content or accuracy of any external site. Our decision to link to an external site should not be considered an endorsement of any content on such a site. Certain information contained in this post may relate to or be based on studies, publications, surveys and other data obtained from third-party sources and our own internal estimates and research. While we believe these third-party studies, publications, surveys and other data are reliable as of the date of this post, they have not independently verified, and we make no representation as to the adequacy, fairness, accuracy, or completeness of any information obtained from third-party sources.

This post may contain express and implied forward-looking statements, which are not historical facts and are instead based on our current expectations, estimates and beliefs. The accuracy of such statements involves risks and uncertainties and depends upon future events, including those that may be beyond our control, and actual results may differ materially and adversely from those anticipated or implied by such statements. Any forward-looking statements included herein speak only as of the date hereof and, except as required by law, we assume no obligation to update or otherwise revise any of such forward-looking statements to reflect subsequent events or circumstances.