We have been working on implementing M10 vGPUs in our VMware environment and have been experiencing performance issues. We worked with NVIDIA to verify that the environment is setup correctly. Here is quick bullet point list for the environment.
- vSphere 6.7
- Host have VMware ESXi, 6.7.0
- PowerEdge R740
- Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
- 768 GB RAM
- 2x M10 GPUs
- Horizon 7.4.0
- Linked Clones
- Windows 10 1803 (also tried 1809 & 1709 builds)
- 4 core
- 8 GB RAM
- M10-1B vGPU profile
- Teradici based zero clients (PCOIP)
What we have noticed is after our first small set of user testing and no issues we began a larger test and noticed that once we hit about 15 users per M10 we began getting performance issue reports. The users are seeing lag in the interface, a right click on the desktop might take 5-10 seconds for the context menu to appear. The same could be seen with the start menu. Additionally these issues were only occurring on vGPU VMs. On the same host non vGPU VMs were not experiencing the same slowdowns. We began to notice that the pcoip_server_win32.exe was using a lot of CPU and GPU time via the task manager. We began trying different version of the vmware agent and direct connect, various revisions of the driver for esxi. We tried standalone fresh copies of windows 10 and various build numbers. Thus far no combination we have attempted has resolved the issues for vGPU machines when there are more than a few users per machine. The performance problem appears even if we use the Horizon software client and have it set to PCoIP.
We tried running it with different Horizon Agent versions (6, 7.0.2, 7.2.0, 7.4.0, 7.5.1, 7.6 & 7.7) and using direct connect bypassing Horizon Server.
We also tried running VMs using VMware Blast protocol and it didn’t have the high GPU usage issues, but unfortunately, almost all of our thin clients only support PCOIP.
Attaching screenshot below: please note the GPU utilization of PCoIP Server (32bit) process.
UPDATE 1:
After a long discussion with NVIDIA, they concluded that it's not on their side.
They pointed us at this KB: https://nvidia.custhelp.com/app/answers/detail/a_id/4156/~/nvidia-smi-shows-high-gpu-utilization-for-vgpu-vms-with-active-horizon-session
Looks like it's a known issue with Teradici PCOIP protocol and it hasn't been fixed yet.
UPDATE 2:
I tried downloading and installing Teradici's PCOIP Agent (PCoIP_agent_release_installer_graphics.exe) direct from Teradici. Then I ran "NvFBCEnable.exe -disable", it disables NVFBC capture and switches back to CPU. And it works great - no GPU spike when idle and a much better performance overall.
However, when I try to do this on Horizon Agent's Teradici protocol it disables NvFBC for a brief period of time and then as soon as I reconnect via PCOIP it re-enables it back, see extract from a log below:
- Svgadevtap:NvFBC–Fixed capture by enabling NvFBC
Is there a way to permanently disable NvFBC on Horizon Agent's Teradici Protocol?