CMG Board Member, Igor Trubin, Master Data Engineer, Cloud Engineering, is a contributor to Capital One’s technical blog. You can read more from Igor and the rest of the technical team here.
The public cloud has unlimited capacity if you have an unlimited budget. But the reality is that budgets are never truly unlimited and one needs to do rightsizing of their cloud objects to prevent allocating unused or unneeded cloud capacity. In this article, I will discuss some best practices of tracking and reporting on cloud usage, and how cloud cost optimization can be done to show efficiency of individual applications, lines of businesses, or an organization’s entire infrastructure.
If we take the statements in the introduction to be true then:
Cloud consumers will often use the following tools to analyze costs and get rightsizing recommendations:
These are powerful tools but some businesses will find that they are not always as helpful as they need them to be. After all, for businesses that are growing and developing more products, most of the time their cloud cost management tools will show growing expenses regardless of rightsizing efforts. The typical trend is shown below in a graph
That trend is typical as it reflects some additional spending the business will need over time for new product/tool development, but could also reflect the impact of not properly rightsizing cloud objects. This can prove a challenge for investors or business owners.
To show how effectively the cloud is used beyond just the CPU, multidimensional capacity utilization reports are a powerful tool to add to your process. Best approach for this type of reporting should cover four main subsystems. They are:
Let’s focus here on the main subsystem – CPU of a virtual cloud server.
To show how effectively the compute capacity is used, we should do a normalization and aggregation of all our different sizes of virtual servers. One approach is to use the AWS Elastic Compute Units – ECUs. This is a comparable indicator of the “horsepower” of servers, which can be obtained from the AWS EC2 price list. ECU usage was discussed at the CMG Impact 2020 conference in my talk Optimizing your Cloud and that method is becoming a common practice. Here is another example of it: Using ECU Based Cost Analysis on AWS for Better Cost Optimization
For instance, based on the AWS price list the “m5.xlarge” type of AWS server the ECU=16 and for “c5.4xlarge” – ECU=68
This can be aggregated into Compute Capacity Utilization (CCUt). CCUts are tabulated in percentages and have a natural and intuitive way to check progress – the closer to 100% the better.
How do we tabulate this percentage? Let’s look at Compute Capacity Available and Compute Capacity Used.
This is the overall sum of all “i” ECUs that we are trying to tabulate.The CCA gives the capacity amount purchased and available as follows: CCA= ∑ ECUi
For example, a combined compute capacity of m5.xlarge and c5.4xlarge would be 16+68=84 ECUs.
This is how much compute capacity has been used. It is as follows: CCU= ∑ (ECUi*CPUi%/100)
Here CPUi% is CPU utilization of “i” server (EC2). We could get this from AWS CloudWatch or another performance tool like DataDog.
Finally, with these two figures Compute Capacity Utilization can be calculated as follows: CCUt% = (CCU/ CCA)*100%
CCA vs. CCU can be used to compare the size and efficiency of cloud usage for two (or more) applications. Below is an example of comparing two applications, which shows that application APP_1 has much more opportunity to be downsized.
So far, we have been focused in great detail on just one dimension – the CPU subsystem, but the three following subsystems should be added to the analysis as well. The calculation should be similar to the above.
Why is that important? Because for workloads that are memory or/and I/Os intensive, the downsizing-only based Compute Capacity Optimization cannot be done correctly without those additional subsystems’ analysis.
Considering all four dimensions of capacity usage – compute capacity, RAM, disk I/Os, and network capacity utilization – allows us to see how rightsizing works by showing current capacity usage vs. a use case when all recommendations are implemented. The example below shows how the capacity usage of all four dimensions improves for the two applications from above:
Note: the culprit – the bottleneck or least optimized dimension – could be changed or stay the same in the recommended rightsizing.
To simplify, one can use a server utilization estimate as a maximum or average (simple or weighted) of the four described metrics, called the OR (Operating Ratio).
As we have covered:
Igor Trubin, Master Data Engineer, Cloud Engineering
Igor Trubin started in tech 1979 as an IBM/370 system engineer. In 1986 he got his PhD. in Robotics at St. Petersburg Technical University (Russia) and then worked as a professor teaching CAD/CAM, Robotics for about 12 years. In 1999 he moved to the US and started work as a Capacity Planner. After working for more than 2 years as the Capacity team lead for IBM, he then worked for SunTrust Bank for 3 years and then at IBM for 2+ years as Sr. IT Architect. Now he works for Capital One as an IT Manager/Master Data engineer in the Cloud Engineering department, and since 2015 he is a member of CMG.org Board of Directors. He runs his tech blog at www.Trub.in and YouTube channel https://www.youtube.com/
What is the traditional approach to performance monitoring on the Mainframe? Industry professionals know that...
Find out moreThe demands of modern work, the constant flow of information, and the expectation of being...
Find out more