Troubleshooting vSAN Encryption and KMS Connectivity

Troubleshooting vSAN Encryption

 

Checklist

  • Ensure the KMS server is reachable and responding on the KMIP port (5696 by default). For initial configuration of vSAN Encryption, the vCenter and the ESXi hosts in the cluster will require connectivity, but for ongoing operation, only the hosts require it. vCenter is only required when configuration needs to be changed or when enabling/disabling Encryption.
  • Ensure the host has the right credentials to communicate with the KMS – i.e. the client cert exists and is the right type to establish trust with the KMS.
  • Ensure the host can enter crypto-safe mode. To do this, it requires access to its HostKey, which is a different key to that required to mount any encrypted diskgroups. Without this key, the host is not deemed secure enough to host encrypted diskgroups or VMs.
  • Ensure the host has access to and can retrieve the required vSAN KEK.

Remember, that once a host has possession of a key, it is kept in memory until the host is rebooted. So, a loss of connectivity to KMS causes no issues unless the host is rebooted. Therefore, if the KMS has suffered a permanent failure and you know the keys cannot be retrieved again, DO NOT reboot any hosts.

 

Testing KMS connectivity

  • In the absence of vCenter, you will need to verify which KMS servers the ESXi server will attempt to contact to retrieve any keys. To do this, grep for ‘kmip’ in the esx.conf file.
[root@vsan-esxi-01:~] grep kmip /etc/vmware/esx.conf
/vsan/kmipServer/child[0001]/old = "false"
/vsan/kmipServer/child[0001]/port = "5696"
/vsan/kmipServer/child[0001]/address = "10.27.46.207"
/vsan/kmipServer/child[0001]/name = "KMS2"
/vsan/kmipServer/child[0001]/kmipClusterId = "HyTrust"
/vsan/kmipServer/child[0001]/kmskey = "HyTrust/KMS2"
/vsan/kmipServer/child[0000]/kmskey = "HyTrust/KMS1"
/vsan/kmipServer/child[0000]/kmipClusterId = "HyTrust"
/vsan/kmipServer/child[0000]/name = "KMS1"
/vsan/kmipServer/child[0000]/address = "10.27.46.209"
/vsan/kmipServer/child[0000]/port = "5696"
/vsan/kmipServer/child[0000]/old = "false"
/vsan/kmipClusterId = "HyTrust"

 

In this example, there is one KMIP cluster, called ‘HyTrust’ and 2 KMIP servers in the cluster, indicated by the [child0000] and [child0001] entries. You can validate the IP/FQDN and ports by checking these entries.

If the original KMS Server had been removed from vCenter, it must be added back to vCenter using exactly the same kmipClusterId, or the hosts will assume it is a brand new cluster and any keys referencing the Cluster as the source will not be retrievable.

To test connectivity, you can simply use

nc -z <KMS Address> 5696

Locating the KMIP Client cert

[root@vsan-esxi-01:/var/log] cd /etc/vmware/ssl/
[root@vsan-esxi-01:/etc/vmware/ssl] ls
castore.pem               openssl.cnf               rui.crt                   vsan_kms_castore.pem      vsan_kms_client.crt       vsan_kms_client_old.crt   vsanvp_castore.pem
iofiltervp.pem            rui.bak                   rui.key                   vsan_kms_castore_old.pem  vsan_kms_client.key       vsan_kms_client_old.key

 

Check the /etc/vmware/ssl folder on the host to ensure that a copy of the vsan_kms_client.crt exists along with a copy of the private key (vsan_kms_client.key). These files should be identical on all hosts in the cluster.

The vsan_kms_castore.pem file is a copy of the server certificate that the host uses to compare with the cert returned by the KMIP server during initial SSL handshake. If the server cert has been changed and does not match what ESXi has stored here, the connection will not be established.

 

If vCenter is available and the host is missing any of this information, vCenter will provide the host with copies of the certificates it has stored in VECS. The certificates that will be provided to the host can be found in the

Ensure the host can enter crypto-safe mode

 

Next step is to check if the host is in crypto-safe mode. To enter crypto-safe mode, the host must be able to retrieve a special key called the HostKey. This key is separate to any other keys that would be required to encrypt VMs or the vSAN datastore. It is the key used by the host to encrypt core dumps. Without access to this key, the host cannot even request any other keys from the KMS server, even if it is accessible.

When vSAN Encryption was first enabled on the cluster, the host transitioned to ‘crypto-safe’ mode for the first time and was assigned a key to install as its HostKey. The host will always look for this key, based on the key identifier, when booting up. The host will NOT attempt to retrieve, nor will it request, a different key if the original key is not available. So for the host to re-enter crypto-safe, this key MUST be available.

To determine if a HostKey has been installed (i.e. the host is crypto-safe), you can use the UI (if available).

Select the host in the inventory and go to Configure > Security Profile > Host Encryption Mode

host encryption mode

Check that Encryption Mode is enabled. If it is not, try to enable it through the UI. If the host will not enter encryption mode, then it cannot retrieve its HostKey.

 

If the UI is not available, you can use the crypto-util utility on the host to see if a HostKey has been installed or not.

[root@vsan-esxi-01:~] crypto-util keys getkidbyname HostKey
vmware:key/fqid/<VMWARE-NULL>/HyTrust/04f631cc%2d84dd%2d11e8%2d8194%2d00505698ddb6

If a key value is returned, the host is in crypto-safe mode. If the message indicates that a HostKey has not been established, then the host is not in crypto-safe mode.

 

If you want to know which key the host requires to enter crypto-safe mode, you can find this value by looking in the vCenter MOB. (The host MOB is no longer available but can be accessed via vCenter).

Navigate to the host page in the MOB: https://vcsa.domain.local/mob/?moid=host-40 (where host-40 is the MoRef for the host) for example.

  1. Click Runtime
  2. Click CryptoKeyId

Here you can see the UUID of the key the host will require to enter crypto-safe mode.

  1. Click ProviderId.

Here you can see the name of the KMIP Cluster from which the host will request the key.

 

If vCenter is not available, you will only be able to determine the HostKey identifier through logging. Grep for the term ‘CryptoManager’ in the hostd.log to see the host adding keys to its keyCache. For example, my host logged the following when it successfully added the HostKey to the cache:

[root@vsan-esxi-02:~] grep CryptoManager /var/log/hostd.log
2018-07-11T07:37:45.992Z info hostd[2099589] [Originator@6876 sub=Solo.Vmomi opID=4b3daa3a-84dd-11e8-4b-bc3b user=:com.vmware.vsan.health] Activation [N5Vmomi10ActivationE:0x000000a14601e520] : Invoke done [IsEnabled] on [vim.encryption.CryptoManagerHost:ha-crypto-manager]
-->    object = 'vim.encryption.CryptoManagerHost:ha-crypto-manager',
2018-07-11T07:37:46.159Z info hostd[2099206] [Originator@6876 sub=Hostsvc.CryptoManager opID=4b3daa3a-84dd-11e8-4b-bc43 user=vpxuser:com.vmware.vsan.health] Host has been placed in Crypto-prepared state
2018-07-11T07:37:46.166Z info hostd[2099589] [Originator@6876 sub=Hostsvc.CryptoManager opID=4b3daa3a-84dd-11e8-4b-bc45 user=vpxuser:com.vmware.vsan.health] Adding host key 04f631cc-84dd-11e8-8194-00505698ddb6 to the Key Cache
2018-07-11T07:37:46.166Z info hostd[2099589] [Originator@6876 sub=Hostsvc.CryptoManager opID=4b3daa3a-84dd-11e8-4b-bc45 user=vpxuser:com.vmware.vsan.health] Host has been placed in Crypto-safe state

 

You can also check the syslog.log for errors. If there are TCP communication errors (i.e. the port is blocked, the KMS server is not responding etc.), you will see errors such as the following:

2018-07-11T09:27:32Z jumpstart[2097479]: VsanUtil: Failed to connect to key server, Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:27:32Z jumpstart[2097479]: 2018-07-11T09:27:32Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve key 04f631cc-84dd-11e8-8194-00505698ddb6 from KMS KMS1: Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:28:32Z jumpstart[2097479]: VsanUtil: Failed to connect to key server, Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:28:32Z jumpstart[2097479]: 2018-07-11T09:28:32Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve key 04f631cc-84dd-11e8-8194-00505698ddb6 from KMS KMS2: Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:28:32Z jumpstart[2097479]: 2018-07-11T09:28:32Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve host key from KMS: Failed to retrieve key from key management server cluster HyTrust. Will have 1 retries.
2018-07-11T09:28:37Z jumpstart[2097479]: VsanUtil: Get kms client key and cert, old:1
2018-07-11T09:29:37Z jumpstart[2097479]: VsanUtil: Failed to connect to key server, Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:29:37Z jumpstart[2097479]: 2018-07-11T09:29:37Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve key 04f631cc-84dd-11e8-8194-00505698ddb6 from KMS KMS1: Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:30:37Z jumpstart[2097479]: VsanUtil: Failed to connect to key server, Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:30:37Z jumpstart[2097479]: 2018-07-11T09:30:37Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve key 04f631cc-84dd-11e8-8194-00505698ddb6 from KMS KMS2: Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:30:37Z jumpstart[2097479]: 2018-07-11T09:30:37Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve host key from KMS: Failed to retrieve key from key management server cluster HyTrust. Will have 0 retries.
2018-07-11T09:30:37Z jumpstart[2097479]: VsanInfoImpl: Failed to load DEKs: Failed to retrieve key from key management server cluster HyTrust

Note how it tries to communicate with each server in the cluster in turn. The ‘QLC_ERROR_COMMUNICATE’ error indicates a networking issue that must be resolved.

 

 

If there is a problem with the client certificate or private key, you will see errors like this in syslog.log:

2018-07-11T10:19:16Z jumpstart[2097479]: VsanUtil: Get kms client key and cert, old:0
2018-07-11T10:19:16Z jumpstart[2097479]: VsanInfoImpl: Joining vSAN cluster 52faacd9-6a43-a600-e0b8-0485de72758b
2018-07-11T10:19:16Z jumpstart[2097479]: VsanInfoImpl: SyncConfigurationCallback called
2018-07-11T10:19:16Z jumpstart[2097479]: VsanSysinfo: Loading module cmmds
2018-07-11T10:19:16Z jumpstart[2097479]: VsanInfoImpl: Retrieving the host key with keyId: 04f631cc-84dd-11e8-8194-00505698ddb6
2018-07-11T10:19:16Z jumpstart[2097479]: VsanUtil: Get kms client key and cert, old:1
2018-07-11T10:19:16Z jumpstart[2097479]: VsanUtil: GetKmsServerCerts Old KMS certs not found
2018-07-11T10:19:16Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve host key from KMS: Invalid key or certs. Will have 1 retries.
2018-07-11T10:19:21Z jumpstart[2097479]: VsanUtil: Get kms client key and cert, old:1
2018-07-11T10:19:21Z jumpstart[2097479]: VsanUtil: GetKmsServerCerts Old KMS certs not found
2018-07-11T10:19:21Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve host key from KMS: Invalid key or certs. Will have 0 retries.
2018-07-11T10:19:21Z jumpstart[2097479]: VsanInfoImpl: Failed to load DEKs: Invalid key or certs

 

I recreated this issue by deleting the client cert and key files. Following the reboot of the host, vCenter was down and so could not send the client cert to the host. Looking in the /etc/vmware/ssl folder, I found that the files had been recreated but with no content. E.g.

[root@vsan-esxi-01:/etc/vmware/ssl] cat vsan_kms_client.crt
[root@vsan-esxi-01:/etc/vmware/ssl]

 

Because the client cert has not been populated, it is impossible to establish trust with the KMS.

If other hosts have connected successfully, then they should have a valid copy of the certificate, which can be copied.

[root@vsan-esxi-02:/etc/vmware/ssl] cat vsan_kms_client.crt
-----BEGIN CERTIFICATE-----
MIIDkzCCAnugAwIBAgIFANEDIiowDQYJKoZIhvcNAQELBQAwVzELMAkGA1UEBhMC
VVMxFTATBgNVBAoTDEh5VHJ1c3QgSW5jLjExMC8GA1UEAxMoSHlUcnVzdCBLZXlD
<<__snip__>>
vm0H5PwefRocE/is0Zhjz08+4DQXNYbRjt3yvymS/052jgF+6FKFMh6rQBbSyo5T
Jarp8kprRtoT9mmwX+Dn/NuaH8KGjWInps/saHQ8vIeMPLRhzXTjkTdsYcDNjBmu
xvEftbcR/A==
-----END CERTIFICATE-----
[root@vsan-esxi-02:/etc/vmware/ssl]

Copy this file and the vsan_kms_client.key file from the working host to the non-working host and reboot.

 

If there is a problem validating the server’s certificate, the issue will show up slightly differently. The syslog.log will show something like this if there is no KMS server cert saved in this location:

2018-07-11T10:45:50Z jumpstart[2097479]: VsanUtil: Get kms client key and cert, old:0
2018-07-11T10:45:50Z jumpstart[2097479]: VsanUtil: GetKmsServerCerts KMS certs not found
2018-07-11T10:45:50Z jumpstart[2097479]: VsanInfoImpl: Joining vSAN cluster 52faacd9-6a43-a600-e0b8-0485de72758b
2018-07-11T10:45:50Z jumpstart[2097479]: VsanInfoImpl: SyncConfigurationCallback called
2018-07-11T10:45:50Z jumpstart[2097479]: VsanSysinfo: Loading module cmmds
2018-07-11T10:45:50Z jumpstart[2097479]: VsanInfoImpl: Retrieving the host key with keyId: 04f631cc-84dd-11e8-8194-00505698ddb6
2018-07-11T10:45:50Z jumpstart[2097479]: VsanUtil: Get kms client key and cert, old:1
2018-07-11T10:45:50Z jumpstart[2097479]: VsanUtil: GetKmsServerCerts Old KMS certs not found
2018-07-11T10:45:50Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve host key from KMS: Invalid key or certs. Will have 1 retries.
2018-07-11T10:45:55Z jumpstart[2097479]: VsanUtil: Get kms client key and cert, old:1
2018-07-11T10:45:55Z jumpstart[2097479]: VsanUtil: GetKmsServerCerts Old KMS certs not found
2018-07-11T10:45:55Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve host key from KMS: Invalid key or certs. Will have 0 retries.
2018-07-11T10:45:55Z jumpstart[2097479]: VsanInfoImpl: Failed to load DEKs: Invalid key or certs

This shows that the KMS’s cert has not been stored

 

For a working host, there should a copy of the server cert for each server in the cluster providing the key. E.g.:

[root@vsan-esxi-02:/etc/vmware/ssl] cat vsan_kms_castore.pem
-----BEGIN CERTIFICATE-----
MIIDvTCCAqWgAwIBAgIFANEDIiYwDQYJKoZIhvcNAQELBQAwVzELMAkGA1UEBhMC
VVMxFTATBgNVBAoTDEh5VHJ1c3QgSW5jLjExMC8GA1UEAxMoSHlUcnVzdCBLZXlD
<<__snip__>>
SpQQLt8G3Zk9Yz75yfjSREHbJ0XHLqX25k9SwJaP20vf+Bz/tQFilpg+To6plw2z
xYzApJGjNEL0+k7W5YquUr5foFjAlrNW3GNzzYtt3CqKDSt201BchE82UYBgTzlb
MA==
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
MIIDvTCCAqWgAwIBAgIFAM+OvdYwDQYJKoZIhvcNAQELBQAwVzELMAkGA1UEBhMC
VVMxFTATBgNVBAoTDEh5VHJ1c3QgSW5jLjExMC8GA1UEAxMoSHlUcnVzdCBLZXlD
<<__snip__>>
m6hsrmBfRTSTbPpRimDXXQ7weBehjCHkIpKOqBUtNRVN4qArvkSO/cwZCB/7y7Gr
3A==
-----END CERTIFICATE-----

 

Option 1 – You can repopulate this file by opening a browser and pointing at https://<KMS_Address>:5696 and copying the cert presented by the browser. Convert it a PEM file and copy into the vsan_kms_castore.pem file. (If more than one server exists per cluster, append the file with any additional certs so they appear one after another, with no spaces, in the vsan_kms_castore.pem file). You will need to use this option if the server cert has been changed.

Option 2 – copy the file from a working host if the server cert has not been changed.

 

If there is a server certificate saved in the /etc/vmware/ssl folder, but it is not the correct certificate, you should see errors like the following in the syslog.log:

2018-07-11T10:59:34Z jumpstart[2097476]: VsanUtil: Failed to connect to key server, QLC_ERR_NEED_AUTH
2018-07-11T10:59:34Z jumpstart[2097476]: VsanInfoImpl: Failed to retrieve key 04f631cc-84dd-11e8-8194-00505698ddb6 from KMS KMS1: QLC_ERR_NEED_AUTH

The QLC_ERR_NEED_AUTH is a clear indication that the host’s copy of the server cert does not match the cert the server is presenting when the SSL handshake is taking place. If this is the case and vCenter is not available, you will have to use Option1 above.

If vCenter is available, use the UI options to re-establish trust with the KMS.

Make VC Trust KMS

This action will need to be performed for each KMS server individually.


Comments

5 responses to “Troubleshooting vSAN Encryption and KMS Connectivity”

  1. […] Content for checklist is available here at : Checklist […]

    Like

  2. […] Troubleshooting vSAN Encryption and KMS Connectivity […]

    Like

  3. César Badilla Badilla Avatar
    César Badilla Badilla

    Excellent

    Liked by 1 person

  4. Jose Martinez Avatar
    Jose Martinez

    I’m trying to contact you. All hope was lost this evening and this article helped me fixed out environment. I can’t express how much this helped. I hope I can send you a gift card or something. Please contact me if you’d like @ifoam on twitter.

    Like

    1. Glad the info was of assistance to you.

      Like

Leave a comment