It’s now common for me to visit customer environments that run Fibre Channel SANs yet have various factions that complain they are suffering performance issues due to lack of bandwidth or throughput, whether that's server, VM, Network or Storage teams. In every single instance FC utilization has actually been incredibly low with peaks of 10% at the most and that's with 4GB/s environments not 8GB/s! At worst there may be an extremely busy backup server that singlehandedly causes bottlenecks and creates the impression that the whole infrastructure is saturated but even these occasions are often rare. What seems to be the cause of this misconception is the lack of clarity between what is deemed throughput and what is an actual cause of bottlenecks and performance slow downs i.e. I/O latency.
Sadly (and I am the first to admit that I was also once duped), Storage folk have been hoodwinked into accepting metrics that just aren’t sufficient to meet their requirements. Much like the folklore and fables of Santa Claus that are told to children during Christmas, storage administrators, architects and engineers have also been spun a yarn that MB/s and IOPS are somehow an accurate determination of performance and design considerations. In a world where application owners, server and VM admins are busily speaking the language of response times, Storage folk are engrossed in a foreign vocabulary that revolves around RAID levels, IOPS and MB/s and then numerous calculations to try and correlate the two languages together. But what if an application owner requested Storage with a 10ms response time that the Storage Administrator could then allocate with a guarantee of that performance? That would entail the Storage engineer not just looking at a one dimensional view from the back end of the Storage Array but one that incorporated the comprehensive transaction time i.e. from the Server to the Switch port to the LUN. That would mean considering the Exchange Completion Time.
To elaborate, using MB/s as a measurement of performance is almost akin to how people used to count cars as a measurement of road traffic. Harking back to my days as a student and before all of the high tech cameras and satellites that now monitor road traffic, I was ‘lucky’ enough to have a job of counting the amount of cars that went through Trafalgar Square at lunchtime. It was an easy job, I'd see five cars and I'd click five times but this was hardly accurate as when there was a traffic jam and all of the lanes were occupied I was still clicking five cars. Here also lies the problem with relying on MB/s as a measurement of performance. As with the counting car situation a more accurate way would have been to instead watch each single car and measure it's time from its origin to its destination. In the same vein, to truly measure performance in a SAN Storage infrastructure you need to measure how long a transaction takes from being initiated by the host, received by the storage and acknowledged back by the host in real-time as opposed to averages. This is what is termed the Exchange Completion Time.
While many storage arrays have tools that provide information on IOPS and MB/s to get a better picture of a SAN Storage environment and it’s underlying latency it's also key to consider the amount of Frames per second. In Fibre Channel a Frame is comparable to a word, a Sequence a sentence and an Exchange the conversation. A Standard FC Frame has a Data Payload of 2112 bytes i.e. a 2K payload. So for example an application that has an 8K I/O will require 4 FC Frames to carry that data portion. In this instance this would equate to 1 IOP being 4 Frames and subsequently 100 IOPS of the same size equating to 400 Frames. Hence to get a true picture of utilization looking at IOPS alone is not sufficient because there exists a magnitude of difference between particular applications and their I/O size with some ranging from 2K to even 256K. With backup applications the I/O sizes can be even larger. Hence it's a mistake to not take into consideration the amount of Frames/sec when trying to measure SAN performance or if trying to identify whether data is being passed efficiently. For example even if you are witnessing a high throughput in MB/s you may be missing the fact that there is a minimum payload of data and the Exchange (conversation) is failing to complete. This is often the case when there’s a slow draining device, flapping SFP etc. in the FC SAN network where instead of data frames causing the traffic you have a number of management frames dealing with issues such as logins and logouts, loss of sync or some other optic degradation or physical layer issue. Imagine the scenario, a Storage Administrator is measuring the performance of his infrastructure or troubleshooting a performance issue and is seeing lots of traffic via MB/s – unaware that many of the environment’s transactions are actually being cancelled across the Fabric!
Monitoring your FC SAN Storage environment in a comprehensive manner that incorporates the SAN fabric and provides metrics such as the Exchange Completion Time rapidly changes FC SAN troubleshooting from a reactive to proactive exercise. It also enables Server, Storage and Application administrators to have a common language of ‘response times’ thus eliminating any potential silos. With the knowledge of application I/O latency down to the millisecond, FC SAN Storage administrators can quickly be transformed from the initial point of blame to the initial point of resolution, while also ensuring optimum performance and availability of your mission critical data.