Customizing VMware Aria Operations Cloud Proxies to use non-default Docker networks

I ran into an issue for a customer with a dual-site VCF deployment where the management network of the primary VCF site used a subnet in the default Docker network of 172.17.0.0/16.

VMware Aria Operations Cloud Proxy VMs include an implementation of Docker for internal services, where everything is deployed using default values.
(For information about default Docker networking, see Networking | Docker Docs)

By default, Docker will instantiate two specific networks, one for user-defined networks, and one for containers that are created without a user-defined network (a sort of default, fallback network) called the Bridge network.

Docker will try to use the 172.17.0.0/16 network for the Bridge network, and is shown as the docker0 interface when you run ip addr or ifconfig on your appliance. It tries to use 172.18.0.0/16 for the user-defined networks, with an interface created on your appliance called something like br-xxxxxxxxxxxx.

If your appliance already has an interface within one of these ranges, Docker will increment the second octet by one and use that instead. For example, if your appliance has eth0 (management IP address) configured within 172.17.10.0/24, then Docker will use 172.18.0.0/16 for the Bridge network, and 172.19.0.0/16 for the user-defined networks istead.

In my case, the management IP for the Cloud Proxy appliance was configured in the 172.18.0.0/16 range, so Docker used the default 172.17.0.0/16 for the Bridge networking, but used 172.19.0.0/16 for the user-defined networks instead, to avoid a conflict with the management network. See below:

root@cloudproxy1 [ ~ ]# ip addr
1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: eth0: mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:50:56:89:30:6c brd ff:ff:ff:ff:ff:ff
altname eno1
altname enp11s0
altname ens192
inet 172.18.41.31/24 brd 172.18.41.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::250:56ff:fe89:306c/64 scope link
valid_lft forever preferred_lft forever
3: br-12c410b77095: mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:00:89:f7:eb brd ff:ff:ff:ff:ff:ff
inet 172.19.0.1/16 brd 172.19.255.255 scope global br-12c410b77095
valid_lft forever preferred_lft forever
inet6 fe80::42:ff:fe89:f7eb/64 scope link
valid_lft forever preferred_lft forever
4: docker0: mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:5b:2d:f9:16 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:5bff:fe2d:f916/64 scope link
valid_lft forever preferred_lft forever
6: vethb100a06@if5: mtu 1500 qdisc noqueue master br-12c410b77095 state UP group default
link/ether d2:62:f0:00:c1:a9 brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet6 fe80::d062:f0ff:fe00:c1a9/64 scope link
valid_lft forever preferred_lft forever
8: vethe9e3ea1@if7: mtu 1500 qdisc noqueue master docker0 state UP group default
link/ether 26:62:5d:b6:45:7f brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::2462:5dff:feb6:457f/64 scope link
valid_lft forever preferred_lft forever

However, the management components from which the Cloud Proxies were to collect metrics in the Primary Datacenter were all configured to use a network in the 172.17.0.0/16 range. This meant that there was a conflict with the range used by Docker in the appliance, and the routing table routed all communications to anything in the 172.17.0.0/16 range through the docker0 interface, instead of eth0.

root@cloudproxy1 [ ~ ]# ip route
default via 172.18.41.250 dev eth0 proto static
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
172.18.41.0/24 dev eth0 proto kernel scope link src 172.18.41.31
172.19.0.0/16 dev br-12c410b77095 proto kernel scope link src 172.19.0.1

There is a KB from VMware which addresses this issue to some extent, but it assumes that you have uses 172.17.0.0/16 as your management IP, and that the subsequent Bridge network of 172.18.0.0/16 was the one causing the conflict.
Cloud Proxies becoming dysfunctional after upgrading to vRealize Operations 8.10 or above

I’ll be honest, I could not figure out from this which was the Bridge network, and which was the user-defined network, as it is not very well explained, so I wanted to customize both networks to ensure that there would be no conflicts.

As the KB describes, you must edit the /etc/docker/daemon.json file, and enter some custom values for the BIP (bridge IP). This threw me off, because I assumed that the interface called br-xxxxxxxxxxxx would naturally be the Bridge IP. But, in my case, the br-xxxxxxxxxxxxx interface was not the one causing the conflict!!

Therefore, I set about trying to find out how to change not only the Bridge IP, but also the network used for the user-defined networks.

It wasn’t too hard in the end, but it took a while for to get the syntax correct. You need to add the lines shown below, including the comma after the "icc": "false" entry.

root@cloudproxy1 [ /etc/docker ]# vi daemon.json
{
"iptables": true,
"log-driver": "syslog",
"log-opts": {"syslog-address": "udp://127.0.0.1:514", "syslog-facility": "daemon"},
"log-level": "info" ,
"live-restore": true,
"no-new-privileges": true,
"icc": false,
"default-address-pools": [{"base": "172.19.0.0/24","size":24}]
"bip": "172.20.0.1/24"
}

Once you have added the lines, save and exit the VI editor hitting ESC, then typing :wq <Enter>.

Then reboot the appliance with reboot -f.

Interestingly, when the appliance comes back up, the /etc/docker/daemon.json appears to revert back to its previous values, but fear not. Using the ip addr command, I could see that the networks I had defined before the reboot were now applied to the interfaces.

root@cloudproxy1 [ ~ ]# ip addr
1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: eth0: mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:50:56:89:30:6c brd ff:ff:ff:ff:ff:ff
altname eno1
altname enp11s0
altname ens192
inet 172.18.41.31/24 brd 172.18.41.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::250:56ff:fe89:306c/64 scope link
valid_lft forever preferred_lft forever
3: br-12c410b77095: mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:a4:e7:f6:d5 brd ff:ff:ff:ff:ff:ff
inet 172.19.0.1/16 brd 172.19.255.255 scope global br-12c410b77095
valid_lft forever preferred_lft forever
inet6 fe80::42:a4ff:fee7:f6d5/64 scope link
valid_lft forever preferred_lft forever
4: docker0: mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:f4:e9:d1:d2 brd ff:ff:ff:ff:ff:ff
inet 172.20.0.1/24 brd 172.20.0.255 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:f4ff:fee9:d1d2/64 scope link
valid_lft forever preferred_lft forever
6: veth250c9eb@if5: mtu 1500 qdisc noqueue master br-12c410b77095 state UP group default
link/ether 86:62:45:8c:9e:d5 brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet6 fe80::8462:45ff:fe8c:9ed5/64 scope link
valid_lft forever preferred_lft forever
8: veth11e21f6@if7: mtu 1500 qdisc noqueue master docker0 state UP group default
link/ether ae:32:c1:dd:6b:2c brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::ac32:c1ff:fedd:6b2c/64 scope link
valid_lft forever preferred_lft forever

As you can see, the IP address defined in the “bip” entry of the daemon.json was actually applied to the docker0 interface, not the br-xxxxxxxxxxxx interface as I had assumed. Nevertheless, modifying these two network ensured I no longer had any internal conflicts and the Cloud Proxies could then connect to all my management components.

Addendum

All of the above was required to allow the Cloud Proxies within the Primary VCF site to communicate with the management components within the Primary site.

In the Secondary VCF site, the 172.17.0.0 range was not in use, so there was no such issue. However, I subsequently realised that the Cloud Proxies in the secondary site would also need to be reachable via the SDDC Manager in the Primary site (where the Aria Suite was deployed in VCF-aware mode), so that SDDC Manager could validate passwords and certificates as part of its built-in Health Checks.

Before I modified the Docker networks in the secondary site, the Password and Certificate health checks were failing in SDDC Manager. It was also preventing Aria Suite Lifecycle Manager from completing Inventory Sync requests with an error stating that ‘SDDC Manager vcf01.domain has timed out while giving API response.’

Digging into the operationsmanager.log file on the Primary site SDDC Manager, I could see it was continuously failing to connect to the secondary site Cloud Proxies via SSH to validate their passwords, causing the timeout seen in vRSLCM.

Customizing the Docker networks to ensure that 172.17.0.0/16 was not being used by Docker resolved all these issues and the Health Checks and Inventory Sync operations all succeeded.