PVS Failover graceful - a network view

Patrick Matula included in categories Citrix Troubleshooting PVS

2022-08-07 2022-08-07 956 words 5 minutes

Contents

The setup

Our environment has the following servers:

Citrix Provisioning Server (version 2206):
- ctxpvs1 (192.168.0.30)
- ctxpvs2 (192.168.0.31)
Target-Device:
- ctxvda1master (192.168.0.103)

Failover process

We want to look at the failover process from one target device from one PVS server to the other. To simulate the failover we will stop the Citrix PVS Stream Service via services.msc.

Requirements

As general information: PVS HA - docs.citrix.com.

For a successful failover, the following is necessary:

The vDisk must be exactly the same on the PVS servers (different timestamps are already problematic).
The vDisk in the PVS console must be set to Use the load balancing algorithm.
- Best Effort is also not a problem and allows failover across subnet boundaries. Fixed prohibits failover across subnet boundaries. Reference: CTX138933
The PVS servers and the target devices must be network reachable.

For network connectivity, we check the port matrix of Citrix.

Failover

In the test lab the target device (ctxvda1master - 192.168.0.103) is connected to the PVS server (ctxpvs2 - 192.168.0.31). So we stop the service Citrix PVS Stream Service and when the failover should happen, nothing happens. The target device hangs.

Troubleshooting

PVS is a network product, so it makes sense to do a network trace. One way to do this is as follows (server 2019 and higher):

network tracing: 
pktmon start --capture
{reproduce the issue}
pktmon stop
pktmon etl2pcap PktMon.etl --out PktMon.pcapng

Theoretically, a CDF trace (using CDFControl) would also be useful, but Citrix does not provide public symbols for StreamProcess.exe (but for SoapServer.exe!). I assume now in advance that CDF traces of SoapServer.exe do not help here.

How it works

In order to troubleshoot the problem at all, we should of course know what/how is communicating. The port matrix reference shows us the following communication between the target device and provisioning server:

Source	Destination	Type	Port	Details
Target Device	PVS Server	UDP	6910-6930	vDisk Streaming
Target Device	PVS Server	UDP	6901,6902,6905	??

In the default configuration (which can be changed) the UDP ports 6910-6930 are responsible for the “content” streaming (i.e. the content from the vDisk to the target device). But there are still the ports 6901, 6902 and 6905. I’m not aware of any publicly available documentation that describes what exactly these ports are used for.

Analysis

The “normal” streaming activity from vDisk to the target device looks like this in network traffic:

ctxpvs2 sends data over port 6930 to the target device with port 6905. The port 6905 on the target device is the service where the vDisk data is processed.

What actually happens on ctxpvs1 (the PVS server to which the target device should failover):

Two ports are used here: 6903 as well as 6895. The port 6895 is specified in the port matrix under “Inter-server communication”, so we can match that. This was actually all the communication between the two PVS servers.

If we look at the network traffic from ctxpvs2, we see that regarding failover:

The packet 9013 is the last packet sent as a “normal” streaming packet. After that, we see a new UDP stream where the PVS server wants to contact the target device on port 6902. This port is blocked on the target device (because not specified in the port matrix).

We also see the packet on the target device:

There is no response to the request from the target device. Whatever port 6902 is responsible for, it looks like ctxpvs2 wants to say to the target device ctxvda1master: “My Citrix PVS stream service is stopped, please switch to the other PVS server”.

If port 6902 is allowed on the target device firewall, the target device network trace looks like this:

So again we see a UDP packet from ctxpvs2 to ctxvda1master on port 6902 but this time with an important difference. The target device now goes to the other PVS server (ctxpvs1) on port 6910. After that, we see some more communication between ctxvda1master (port 6901) and ctxpvs1 (port 6930). To finish the network trace we see the already known pattern between ports 6905and 6930.

Wait a minute, the failover works for me…

Obviously, Citrix is missing a specification in the port matrix for a graceful failover. But, most who install the target device software will not have the problem. Why? Because the setup creates a firewall rule automatically, namely this one:

firewall rule added by target device setup

Summary

In summary, based on the above analysis, it seems that a firewall enable on the target device with port 6902 is necessary to guarantee a graceful failover. The target device setup creates this firewall rule. However, it is missing in the Citrix documentation. Additionally, there is information in the system requirements that port 6901 is allowed on the target device. This requirement for a local firewall does not create the setup nor is it in the port matrix.

Probably a good way would be to open all ports (6901, 6902, 6905) between the PVS server and the target device in the firewall to avoid current and future problems.

I wrote about this about a year ago:

PVS failover doesn't work or only very slowly? Thanks to a Citrix case we finally found the solution. It's necessary to add an inbound firewall rule for the corresponding ports. #citrix #citrixpvs #pvs #provisioning #failover #slow
— Patrick Matula (@p_matula) October 15, 2021

- but since the port matrix has still not been adopted, I have now rebuilt the scenario with a recent version to check if the behavior is still there.

Finally, I would like to say that this cannot be the only failover mechanism. We have looked at graceful failover here, but if a PVS server dies from one moment to the next, communication can no longer take place. So there’s obviously still a “plan b” there. That would probably be a topic for another blog post.

Happy troubleshooting.