Understanding VMware Capacity

Nominations for CMG Executive Conference – October 28-29
August 12, 2018
Measuring Client Performance & Inventing Teleportation
August 14, 2018
Nominations for CMG Executive Conference – October 28-29
August 12, 2018
Measuring Client Performance & Inventing Teleportation
August 14, 2018

Understanding VMware Capacity

As a Capacity Manager, do you fully understand the technology, how to monitor it, how to decide what headroom exists so that you can increase efficiency and cut costs?

At our webinar with Syncsort on July 26, many questions were asked on the topic of understanding VMware capacity. These topics included: Why OS Monitoring can be misleading; five Key Metrics; Measuring Processor Capacity; Measuring Memory Capacity; and Calculating Headroom in VMs.

As VMware remains the go-to option for virtualization for the majority of organizations and has been for some time, it’s critical to gain an in-depth understanding of the technology. That’s why we thought it would be helpful to list the questions and answers that were generated from our webinar.

Perhaps one or two of your questions will be answered here. Check them out:

You seem to treat all VMs equally. Can you break your measurement of the VMs down into use categories, Production, Development, Test, and so on?

  • Within Athene we certainly do categorize/group VMs based on their use or business service. This enables us to monitor and report on those groups. 
  • Within VMware this can also be done using Resource Pools. If Resource Pools have different reservations, limits and shares, then you also have to take that into account.

Can Athene help me identify inactive VMs that I can delete?

  • Yes. Athene gives you the ability to identify VMs that are “idle” or over configured. You can then make the decision on what to do with those VMs.

You mentioned ROT for a ready time of >10%, intervention is recommended or warranted. Do you have an ROT for Co-Time?

  • If either or the sum of both were to go over 10%, I’d want to look into why to see if action was needed.

Regarding vCPUs having to be scheduled all at once. I found this in a VMware white paper: Relaxed co-scheduling replaced the strict co-scheduling in ESX 3.x and has been refined in subsequent releases to achieve better CPU utilization and to support wide multiprocessor virtual machines. Can you elaborate?

  • Yes. “All at once” is a simplification to help us understand the issue. Relaxed co-scheduling enables the VMs vCPUs to be scheduled individually, but within a very tight timeframe. If all the vCPUs can’t find a physical CPU in that timeframe, processing stops, and Co-Stop time is recorded until all the vCPUs can get their required time on a physical CPU.

What does it mean when the Co-stop Graph has a spike?

  • At that point the VM had to stop processing so that all vCPUs could catch up with one or more that had already been allocated onto a physical CPU. Essentially, at that point there were not enough physical CPUs for all the processing that the VM needed. The odd spike isn’t a big issue. If you are consistently seeing spikes then you need to consider reducing the number of vCPUs in use in the cluster—particularly on that VM.

Why didn’t any network related metrics make it into your Top 5 list?

  • Network statistics are pretty standard as they would be for other platforms. We also don’t tend to find that network capacity at the host is the bottleneck, in the same way we do for vCPU numbers and memory.

What are you using to chart this? Is there a chart for this in VMware?

  • The charts in the presentation were created using Syncsort’s Athene tool for Capacity Management. Athene is a cross-platform Capacity Management tool.