Wednesday, August 31, 2016

VMware SCSI Controller Options

This blog article describes the different SCSI controllers in VMware ESXi and why one eventually is better than another one. I had multiple discussions with customers at my current job why and in which situation the LSI Logic SAS or Parallel makes more sense vs. the Paravirtual SCSI adapter (PVSCSI) and I didn’t really find a good blog article or KB explaining what I think is needed to really understand the differences. What I see in most of the environments is the standard adapter for the chosen Operation System and in many cases that is absolutely fine and works well. The problem starts when there is a limitation somewhere but how do find that out? Let’s start from the beginning. With the current version of ESXi 6.0 there are five options of SCSI Controllers which get illustrated in the following table:

SCSI controller comparison

Adapter Type
OS Type
Minimum Requirements
Maximum SCSI Adapters
Use Cases
BusLogic Parallel
Server
-
4
15 devices per controller, issues with 64-Bit OS, 2TB VMDK limit, VMware suggests to migrate off this adapter
LSI Logic Parallel (formerly LSI Logic)
Server/Desktop
-
4
15 devices per controller, required for Microsoft Clustering Service (older than Windows 2008)
LSI Logic SAS
Server/Desktop
HWv7
4
15 devices per controller, with most of OS the standard SCSI controller, required for MSCS (Windows 2008 or newer)
PVSCSI
Server
HWv7
4
15 devices per controller, lower CPU cost in many higher I/O use cases, suggested for high I/O use cases, no MSCS support for ESXi 5.5 U2 and lower
AHCI SATA
Server/Desktop
HWv10
4 (on top of the existing SCSI controller)
30 devices per controller, not recommended for high I/O environments, not as efficient as LSI Logic SAS or PVSCSI
Table 1: Adapter Types


Seeing that AHCI SATA controllers really don’t bring a huge benefit to the table this controller might could help if there is a need for a big amount of additional disks and where performance is not the main concern. You can add the AHCI SATA controller on top of the maximum supported SCSI controllers you already use.


What is good to know is the time since when the different adapters have been supported which gets illustrated in the the following table.


Feature
ESXi 6.0
ESXi 5.5
ESXi 5.1
ESXi 5.0
ESXi 4.x
ESXi 3.5
Hardware Version
11
10
9
8
7
4
Supported SCSI Adapters
BusLogic
LSI Parallel
LSI SAS
PVSCSI
AHCI
BusLogic
LSI Parallel
LSI SAS
PVSCSI
AHCI
BusLogic
LSI Parallel
LSI SAS
PVSCSI
BusLogic
LSI Parallel
LSI SAS
PVSCSI
BusLogic
LSI Parallel
LSI SAS
PVSCSI
BusLogic
LSI Logic
Table 2: ESXi Support

x86 Architecture

So I think you agree that the two really interesting adapters are the LSI Logic SAS and the PVSCSI adapter. What does this word “Paravirtual” even mean? Before getting into that let’s understand first where the drivers are sitting in the stack. In a x86 Architecture you will always find four levels or privileges. They are named Rings or or CPU modes and they are split into Ring 0 - 3. Traditional Operating Systems like Windows only use two Rings as at that time the available processors were not supporting more than two modes. These were Ring 3 for all applications as the least privileged and Ring 0 as the most privileged one. So every time the User Application want to use the hardware the CPU has to switch into user mode. If you like to learn more about kernel vs. user mode I suggest to read this blog article. In the following figure you see the traditional modes in a x86 architecture.
Figure 1: Traditional x86 Architecture

Binary translation using VMM

In a virtualized environment since the Hypervisor itself sits on top of the physical hardware, it becomes very difficult for a Guest VM OS to run in Ring 0 because the Ring 0 is now in use by the Hypervisor itself. What makes this even more complicated is the fact that some instruction are only able to get finished while running in Ring 0. So what to do now? VMware introduced certain binary translation techniques that allows the Virtual Machine Monitor (VMM) to run in Ring 0. This helps the VM because it now can execute these instructions with the help of VMM in Ring 0. For the application itself everything stays as it is. How that looks like you see in the following figure.


Figure 2: Binary Translation of OS Requests

Paravirtual implementation in ESXi

The name Paravirtual SCSI adapter is a bit of a wrong term here as all the virtual hardware in a Guest VM is paravirtual. The same is true for the VMXNET3 driver which is also a specific VMware driver. For both you need to install a driver in the Guest OS to be able to use this adapters. For the VMXNET3 driver this already happens when the VMware Tools get installed.

The paravirtual driver helps to get access to the ESXi kernel and does not need to communicate via the VMM to the system hardware. It does a "hypercall" to ESXi for certain critical operations like scheduling, interrupts and memory management. So it is pretty much a direct channel to the kernel from the driver perspective. The PVSCSI adapter in general offers better performance with lower CPU usage compared to the other SCSI controller options. But as with everything it depends. I will compare both the LSI Logic SAS and the PVSCSI a bit more in the next section. The following figure shows the general implementation of the "hypercalls" using a paravirtual adapter.


Figure 3: Paravirtual implementation in ESXi

Coalescing

Coalescing is a synonym for merge, join or assemble but what has that to do with a SCSI controller in a Guest VM? Very simple it optimises I/O in a very intelligent way. Let me point out what coalescing means:


  • A technique for storage driver efficiency
  • Coalescing can be thought of as buffering where multiple events are queue for simultaneous processing
  • Improves efficiency and interrupts but I/O must stream fast enough to create a large batch request
  • If an incoming stream of I/O is too low a timeout window will pass and the I/O will get an unnecessary delay
  • Both the LSI Logic SAS & PVSCSI handle interrupt coalescing in two different ways:


    • Outstanding I/O: VM demand for I/O
    • IOPS: Storage system supply of I/O


Let's compare now how this works different with both adapters.


  1. LSI SAS: The driver increases coalescing as Outstanding I/O (OIO) and IOPS increase and no coalescing is used with few OIO or low throughput so the driver is very efficient where OIO and I/O throughput is small.


  1. PVSCSI: The driver coalesces bases on OIO only and not throughput. When the VM is doing a lot of I/O but the storage does not deliver right away the PVSCSI driver coalesces interrupts. Without the storage supplying a steady stream of I/O’s there are obviously no interrupts to coalesce. As a conclusion there is a slightly increased latency and no gain for the PVSCSI controller in low throughput environments.


  1. CPU Cost: The difference between the LSI Logic SAS and PVSCSI controller at very low IOPS is not measurable but with larger numbers of IOPS the PVSCSI controller saves a huge amount of CPU cycles.


VMware says that they think everything with 2.000 IOPS peak performance and 4 OIO is a good reason due to this KB 1017652. They suggest to use the PVSCSI adapter in later versions of ESXi also with lower IOPS and OIO requirements.

Queue Depth

To maximize performance virtual disks should be distributed across multiple vSCSI adapters. A maximum of 4 vSCSI adapter can be configured per VM with a maximum of 15 vDisks per vSCSI adapter. By using multiple vSCSI adapters you open up more I/O queues. The following table shows the queue depths when using the PVSCSI adapters as compared to the LSI Logic SAS adapter.


Queue
PVSCSI
LSI Logic SAS
Default Adapter Queue Depth
256
128
Maximum Adapter Queue Depth
1024
128
Default Virtual Disk Queue Depth
64
32
Maximum Virtual Disk Queue Depth
254
32
Table 3: Queue Depth

Tuning of PVSCSI

There is a very good KB: VMware KB 2053145 you should follow to understand how to tune the different values in ESXi itself as well as the VM. Two settings can be tuned:


  • Adjust the queue depth for the HBAs on the ESXI host
  • Increase the PVSCSI queue inside the Windows or Linux guest


Note: The default number of ring pages is 8, each of which is 4 KB in size. One entry of 128 bytes is needed for a queued I/O, which means that 32 entries on a single page of 4096 bytes, resulting in 256 entries (8x32). The Windows PVSCSI driver adapter queue was hard coded in previous releases,  but it can be adjusted up to the maximum of 32 pages since the versions delivered with VMware Tools.

This note I got out of the KB itself but what does that mean? Let me explain what Ring Pages and Queue Depth really stands for as it often gets understood wrong.


Ring Pages:
The 128 byte entry describes the I/O. It is not the actual memory to which the I/O is being directed to/from. In other words, the pages that are used for ring buffer are used to describe the actual I/O operation. They are not used for the actual I/O itself. Think of them as having pointers to the pages that will be used for DMA operation. A portion of a page (non-ring pages) may be used for one I/O and another portion (possibly even overlapping) may be used for another I/O. So the latter is possible.


Queue Depth:
The queue depth is a number that in the case of PVSCSI reflects the limits of the adapter.
The adapter uses 8 ring pages and thus can support a queue depth of 256. It is really an artificial number since PVSCSI is not a real device because it is a VMware paravirtualized SCSI device. However for other devices (real adapters), it reflects an actual HW limit. The ring is in the hardware and it has a limit and hence the queue depth! The queue depth is per adapter. So if you have 4 PVSCSI or any LSI adapter for that matter you will get 4 * Queue Depth. As a consequence, there will be 4 * Ring Pages as well.


LSI Logic:
It is a similar construct. The only difference is that the hardware controller has an upper limit based on real queue resources on the HBA. However, the driver may artificially allow you to use a lower value than what the hardware can support. If the driver lets you set it to a higher value than what the HBA claims then it will be queued in the driver instead of being queued in the ESXi SCSI mid-layer queues. It will be accounted as part of the queueing delays in that case.

Conclusion

So what is the conclusion here now? I would say it depends on what you need as with everything in life. My perspective though is that you rather use the PVSCSI controller more often than less. In my ex-company we were running about 1500+ VMs. At one point in I guess 2010 where we had 80-90% virtualized I thought to myself how the hell can we find out where the LSI and where the PVSCSI controller makes sense. Knowing that at that time the monitoring tools were much worse then today I decided being the Infrastructure Architect that we use the PVSCSI controller in every template and no matter if it was a small side with a single ESXi server and 3 VMs or the big ERP systems in the centralized datacenters we were using the same configuration everywhere. In existing VMs we replaced the existing SCSI controller with the PVSCSI adapter then in the next maintenance window. Maybe at the end it was not beneficial for all the small VMs with very low I/O but for sure it helped with the VMs which had the requirement of higher I/O. Obviously this is something I decided but at the end in my opinion it was the right decision as downtime to change a SCSI controller in a VM at the end is always cost.
Tuning on the other side is something what I would do very specific as it depends on how your storage system performs because otherwise it simply does not make any difference to tune these parameters. A few years back I always thought adapting the queue depth on the controller or SCSI controller will always help improving performance but that really depends what you storage system and the stack in between the server and storage can deliver. If there is nobody taking the I/O out of the queue it does not make sense to fill it up with many I/Os but if the storage system performs very well you most likely can solve your bottleneck on the VM side with this. What means the storage system performs very well? That is something you have to find out with testing and tweaking. The easiest is use the local OS tools like Perfmon in Windows and see if you average disk queue length is always at the limit of the adapter. Also the number of additional PVSCSI adapters really only has a benefit when the the storage is sufficient to handle it. I remember I had one colleague who always wanted to configure as many as possible separate drives and controllers to spread the load as much as possible but if your storage simply is the bottleneck it just complicates the configuration. So more is not always more keep it simple to make it best working for 98% of your systems and focus then on the last 2% to tweak them if there is a need.

If you have any questions please let me know.


Sources and inspirations:


Blogs:

2 comments:

  1. I really appreciate the information shared above. It’s of great help. If someone wants to learn Online (Virtual) instructor lead live training in VMware TECHNOLOGY, kindly Contact MaxMunus
    MaxMunus Offer World Class Virtual Instructor-led training on TECHNOLOGY. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 1,00,000 + training in India, USA, UK, Australia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Pratik Shekhar
    MaxMunus
    E-mail: pratik@maxmunus.com
    Ph:(0) +91 9066268701
    www.MaxMunus.com

    ReplyDelete
  2. Great stuff!!! explained very well :)!!! thank you.. keep up the good work..

    ReplyDelete