In the previous post, I covered how Docker uses linux virtual interfacs and bridge interfaces to facilitate communication between containers over bridge networks. In this post, I will be discussing how Docker utilizes vxlan technology to create overlay networks that are used in swarm clusters, and how it is possible to view and inspect this configuration. I will also discuss the various network types used to facilitiate different connectivity needs for containers launched in swarm clusters.
It is assumed that the readers are already familiar with setting up swarm clusters, and launching services in Docker Swarm. I will also link to a number of helpful resources at the end of the post that will provide more details and context around the topics discussed here. Again, I appreciate any feedback provided.
Contents
Docker Swarm and Overlay Networks
Docker overlay networks are used in the context of docker clusters (Docker Swarm), where a virtual network used by containers needs to span multiple physical hosts running the docker engine. When a container is launched in a swarm cluster (as part of a service), multiple networks are attached by default, each to facilitate different connectivity requirements.
For example, I have a three node docker swarm cluster:
$ docker node ls ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS 50ogqwz3vkweor5i35eneygmi * swarm-manager Ready Active Leader 7zg6c719vaj8az2tmiga4twju swarm02 Ready Active b3p76fz1zga5njh5cr91wszfi swarm01 Ready Active
I will first create an overlay network named my-overlay-network:
$ docker network create --driver overlay --subnet 10.10.10.0/24 my-overlay-network bwii51jaglps6a3xkohm57xe6
Then launch a service with a container running a simple web server that exposes port 8080 to the outside world. This service will have 3 replicas running, and I specify that it is connected to one network only (my-overlay-network):
$ docker service create --name webapp --replicas=3 --network my-overlay-network -p 8080:80 akittana/dockerwebapp:1.1 8zpocbn9mv8gb2hqwjpa1stuq $ docker service ls ID NAME REPLICAS IMAGE COMMAND 8zpocbn9mv8g webapp 3/3 akittana/dockerwebapp:1.1
If I then list the interfaces available to any of the running containers, I see three interfaces as opposed to only 1 interface as we would have expected when running a container on a single docker host:
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 9ec607870ed9 akittana/dockerwebapp:1.1 "/usr/sbin/apache2ctl" 58 seconds ago Up 57 seconds 80/tcp webapp.2.ebrd6mf98r4baogleca60ckjf $ docker exec 9ec607870ed9 ifconfig eth0 Link encap:Ethernet HWaddr 02:42:0a:ff:00:06 inet addr:10.255.0.6 Bcast:0.0.0.0 Mask:255.255.0.0 inet6 addr: fe80::42:aff:feff:6/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:8 errors:0 dropped:0 overruns:0 frame:0 TX packets:8 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:648 (648.0 B) TX bytes:648 (648.0 B) eth1 Link encap:Ethernet HWaddr 02:42:ac:13:00:03 inet addr:172.19.0.3 Bcast:0.0.0.0 Mask:255.255.0.0 inet6 addr: fe80::42:acff:fe13:3/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:8 errors:0 dropped:0 overruns:0 frame:0 TX packets:8 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:648 (648.0 B) TX bytes:648 (648.0 B) eth2 Link encap:Ethernet HWaddr 02:42:0a:0a:0a:03 inet addr:10.10.10.3 Bcast:0.0.0.0 Mask:255.255.255.0 inet6 addr: fe80::42:aff:fe0a:a03/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:15 errors:0 dropped:0 overruns:0 frame:0 TX packets:8 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:1206 (1.2 KB) TX bytes:648 (648.0 B) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
The container is connected to my-overlay-network through eth2 (as you can tell by the IP address). eth0 and eth1 are connected to other networks. If we run docker network ls, we can see that there are an extra two networks that were added: docker_gwbridge and ingress, and from the subnets addresses we can see that those are connected to eth0 and eth1:
$ docker network ls NETWORK ID NAME DRIVER SCOPE e96b3056294c bridge bridge local b85b2b9fdadf docker_gwbridge bridge local 9f0dd1556cd0 host host local 3kuba8yq3c27 ingress overlay swarm bwii51jaglps my-overlay-network overlay swarm 2b9d08c6067a none null local
$ docker network inspect ingress | grep Subnet "Subnet": "10.255.0.0/16", $ docker network inspect docker_gwbridge | grep Subnet "Subnet": "172.19.0.0/16",
Overlay
The overlay networks creates a subnet that can be used by containers across multiple hosts in the swarm cluster. Containers running on different physical hosts can talk to each other on the overlay network (if they are all attached to the same network).
For example, for the webapp service we started, we can see that there is one container running on each host in the swarm cluster:
$ docker service ps webapp ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR agng82g4qm19ascc1udlnvy1k webapp.1 akittana/dockerwebapp:1.1 swarm-manager Running Running 3 minutes ago 0vxnym0djag47o94dcmi8yptk webapp.2 akittana/dockerwebapp:1.1 swarm01 Running Running 3 minutes ago d38uyent358pm02jb7inqq8up webapp.3 akittana/dockerwebapp:1.1 swarm02 Running Running 3 minutes ago
I can get the overlay IP address for each container by executing ifconfig eth2 (eth2 is the interface connected to the overlay network)
On swarm01:
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 2b0abe2956c6 akittana/dockerwebapp:1.1 "/usr/sbin/apache2ctl" 5 minutes ago Up 5 minutes 80/tcp webapp.2.0vxnym0djag47o94dcmi8yptk $ docker exec 2b0abe2956c6 ifconfig eth2 | grep addr eth2 Link encap:Ethernet HWaddr 02:42:0a:0a:0a:05 inet addr:10.10.10.5 Bcast:0.0.0.0 Mask:255.255.255.0 inet6 addr: fe80::42:aff:fe0a:a05/64 Scope:Link
Then from the conatiner running on swarm02, I should be able to ping 10.10.10.5 (the IP of the container on swarm01):
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES a1ca9a0d2364 akittana/dockerwebapp:1.1 "/usr/sbin/apache2ctl" 55 minutes ago Up 55 minutes 80/tcp webapp.3.d38uyent358pm02jb7inqq8up $ docker exec a1ca9a0d2364 ping 10.10.10.5 PING 10.10.10.5 (10.10.10.5) 56(84) bytes of data. 64 bytes from 10.10.10.5: icmp_seq=1 ttl=64 time=0.778 ms 64 bytes from 10.10.10.5: icmp_seq=2 ttl=64 time=0.823 ms
vxlan
Docker’s overlay network uses vxlan technology which encapsulates layer 2 frames into layer 4 packets (UDP/IP). This allows docker to create create virtual networks on top of existing connections between hosts that may or may not be in the same subnet. Any network endpoints participating in this virtual network, see each other as if they’re connected over a switch, without having to care about the underlying physical network.
To see this in action, we can do a traffic capture on the docker hosts particpating in the overlay network. So in the example above, a traffic capture on the swarm01 or swarm02 will show us the icmp traffic between the containers running on them (vxlan uses udp port 4789):
ubuntu@swarm01:~$ sudo tcpdump -i eth0 udp and port 4789 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes 01:20:37.030201 IP 172.17.1.50.40368 > 172.17.1.142.4789: VXLAN, flags [I] (0x08), vni 257 IP 10.10.10.5 > 10.10.10.4: ICMP echo request, id 16, seq 19157, length 64 01:20:37.030289 IP 172.17.1.142.49108 > 172.17.1.50.4789: VXLAN, flags [I] (0x08), vni 257 IP 10.10.10.4 > 10.10.10.5: ICMP echo reply, id 16, seq 19157, length 64
You can see two layers in the packets above, the first is the udp vxlan tunnel traffic between the docker hosts over port 4789, and inside you can see the icmp traffic with the container IP addresses.
Encryption
The traffic capture showed above showed that anyone who can see the traffic between the docker hosts, is able to also see inter-container traffic going over an overlay network. This is why docker includes an encryption option which enables automatic IPSec encryption of the vxlan tunnels simply by adding --opt encrypted when creating the network.
Doing the same test above, but by using an encrypted overlay network, we will only see encrypted packets between the docker hosts:
$ docker network create --driver overlay --opt encrypted --subnet 10.20.20.0/24 enc-overlay-network 0aha4giv5yylp6l5nzgev4tel $ docker service create --name webapp --replicas=3 --network enc-overlay-network -p 8080:80 akittana/dockerwebapp:1.1 5hjbv2mmqemto5krgylmen076 $ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 6ba03d127212 akittana/dockerwebapp:1.1 "/usr/sbin/apache2ctl" 20 seconds ago Up 19 seconds 80/tcp webapp.1.3axjnerwc6 $ docker exec 6ba03d127212 ifconfig | grep 10.20.20 inet addr:10.20.20.3 Bcast:0.0.0.0 Mask:255.255.255.0 $ docker exec 6ba03d127212 ping 10.20.20.5 PING 10.20.20.5 (10.20.20.5) 56(84) bytes of data. 64 bytes from 10.20.20.5: icmp_seq=1 ttl=64 time=0.781 ms 64 bytes from 10.20.20.5: icmp_seq=2 ttl=64 time=0.785 ms ubuntu@swarm02:~$ sudo tcpdump -i eth0 esp tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes 01:37:38.342817 IP 172.17.1.50 > 172.17.1.65: ESP(spi=0x378b5e6f,seq=0xe9), length 140 01:37:38.342936 IP 172.17.1.65 > 172.17.1.50: ESP(spi=0x29ade773,seq=0x49), length 140
Inspecting vxlan tunnel interfaces
Similar to bridge networks, docker creates a bridge interface for each overlay network, which connect the virtual tunnel interfaces that make the vxlan tunnel connections between the hosts. However, these bridge and vxlan tunnel interfaces are not created directly on the tunnel host, but instead they are in separate containers that docker launches for each overlay network that is created.
To actually inspect these interfaces, you have to use nsenter to run commands within the network space of the container managing the tunnels and virtual interfaces. This has to be run on the docker hosts that have containers that participate in the overlay network.
Also, you have to edit /etc/systemd/system/multi-user.target.wants/docker.service on the docker host and comment out MountFlags=slave as discussed here.
ubuntu@swarm02:~$ sudo ls -l /run/docker/netns/ total 0 -r--r--r-- 1 root root 0 Dec 15 19:52 1-3kuba8yq3c -r--r--r-- 1 root root 0 Dec 16 02:00 1-bwii51jagl -r--r--r-- 1 root root 0 Dec 16 02:00 4d950aa1386e -r--r--r-- 1 root root 0 Dec 15 19:52 ingress_sbox ubuntu@swarm02:~$ sudo nsenter --net=/run/docker/netns/1-bwii51jagl ifconfig br0 Link encap:Ethernet HWaddr 22:14:63:f9:8b:f5 inet addr:10.10.10.1 Bcast:0.0.0.0 Mask:255.255.255.0 inet6 addr: fe80::b0f7:dfff:fe61:6098/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:8 errors:0 dropped:0 overruns:0 frame:0 TX packets:8 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:536 (536.0 B) TX bytes:648 (648.0 B) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) veth2 Link encap:Ethernet HWaddr 7a:48:1b:9a:ef:ec inet6 addr: fe80::7848:1bff:fe9a:efec/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:8 errors:0 dropped:0 overruns:0 frame:0 TX packets:13 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:648 (648.0 B) TX bytes:1038 (1.0 KB) vxlan1 Link encap:Ethernet HWaddr 22:14:63:f9:8b:f5 inet6 addr: fe80::2014:63ff:fef9:8bf5/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:20 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) ubuntu@swarm02:~$ sudo nsenter --net=/run/docker/netns/1-bwii51jagl bridge fdb show 33:33:00:00:00:01 dev br0 self permanent 01:00:5e:00:00:01 dev br0 self permanent 33:33:ff:61:60:98 dev br0 self permanent 22:14:63:f9:8b:f5 dev vxlan1 vlan 1 master br0 permanent 22:14:63:f9:8b:f5 dev vxlan1 master br0 permanent 02:42:0a:0a:0a:05 dev vxlan1 dst 172.17.1.142 link-netnsid 0 self permanent 02:42:0a:0a:0a:04 dev vxlan1 dst 172.17.1.50 link-netnsid 0 self permanent 7a:48:1b:9a:ef:ec dev veth2 master br0 permanent 7a:48:1b:9a:ef:ec dev veth2 vlan 1 master br0 permanent 02:42:0a:0a:0a:03 dev veth2 master br0 33:33:00:00:00:01 dev veth2 self permanent 01:00:5e:00:00:01 dev veth2 self permanent 33:33:ff:9a:ef:ec dev veth2 self permanent
Finally, running a traffic capture on the veth interface will show us the traffic as it leaves the container, and before it is routed into the vxlan tunnel (the ping from earlier was still running):
ubuntu@swarm02:~$ sudo nsenter --net=/run/docker/netns/1-bwii51jagl tcpdump -i veth2 icmp tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on veth2, link-type EN10MB (Ethernet), capture size 262144 bytes 02:04:06.653684 IP 10.10.10.3 > 10.10.10.5: ICMP echo request, id 16, seq 21, length 64 02:04:06.654426 IP 10.10.10.5 > 10.10.10.3: ICMP echo reply, id 16, seq 21, length 64 02:04:06.958298 IP 10.10.10.3 > 10.10.10.4: ICMP echo request, id 20, seq 3, length 64 02:04:06.959198 IP 10.10.10.4 > 10.10.10.3: ICMP echo reply, id 20, seq 3, length 64
ingress
The second network that the containers where connected to was the ingress network. Ingress is an overlay network but it is installed by default once a swarm cluster is initiated. This network is used to provide connectivity when connections are made to containers from the outside world. It is also where the load balancing feature provided by the swarm cluster happens.
The load balancing is handling by IPVS which runs on a container that docker swarm launches by default. We can see this container attached to the ingress network (I used the same web service as before which will expose port 8080 which is then mapped to port 80 on the containers):
$ docker service create --name webapp --replicas=3 --network my-overlay-network -p 8080:80 akittana/dockerwebapp:1.1 3ncich3lzh9nj3upcir39mxzm $ docker network inspect ingress [ { "Name": "ingress", "Id": "3kuba8yq3c27p2vwo7hcm987i", "Scope": "swarm", "Driver": "overlay", "EnableIPv6": false, "IPAM": { "Driver": "default", "Options": null, "Config": [ { "Subnet": "10.255.0.0/16", "Gateway": "10.255.0.1" } ] }, "Internal": false, "Containers": { "02f1067a00b3a83d5f03cec17265bf7d7925fc6a326cc23cd46c5ab73cf57f20": { "Name": "webapp.1.a4s04msunrycihp4pk7hc9kmy", "EndpointID": "8f4bcf1fafa1e058d920ccadbec3b0172cf0432759377b442ac2331a388ad144", "MacAddress": "02:42:0a:ff:00:06", "IPv4Address": "10.255.0.6/16", "IPv6Address": "" }, "ingress-sbox": { "Name": "ingress-endpoint", "EndpointID": "f3030184ba93cd214c77b5811db7a20209bfd6a8daad4b56e7dea00bd022d536", "MacAddress": "02:42:0a:ff:00:04", "IPv4Address": "10.255.0.4/16", "IPv6Address": "" } }, "Options": { "com.docker.network.driver.overlay.vxlanid_list": "256" }, "Labels": {} } ]
First, let’s take a look at the docker host (any of the hosts participating in the swarm cluster):
ubuntu@swarm01:~$ sudo iptables -t nat -n -L Chain PREROUTING (policy ACCEPT) target prot opt source destination DOCKER-INGRESS all -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL DOCKER all -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL Chain INPUT (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination DOCKER-INGRESS all -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL DOCKER all -- 0.0.0.0/0 !127.0.0.0/8 ADDRTYPE match dst-type LOCAL Chain POSTROUTING (policy ACCEPT) target prot opt source destination MASQUERADE all -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match src-type LOCAL MASQUERADE all -- 172.18.0.0/16 0.0.0.0/0 MASQUERADE all -- 172.26.0.0/16 0.0.0.0/0 MASQUERADE all -- 172.19.0.0/16 0.0.0.0/0 MASQUERADE all -- 172.25.0.0/16 0.0.0.0/0 MASQUERADE all -- 10.0.3.0/24 !10.0.3.0/24 Chain DOCKER (2 references) target prot opt source destination RETURN all -- 0.0.0.0/0 0.0.0.0/0 RETURN all -- 0.0.0.0/0 0.0.0.0/0 RETURN all -- 0.0.0.0/0 0.0.0.0/0 RETURN all -- 0.0.0.0/0 0.0.0.0/0 Chain DOCKER-INGRESS (2 references) target prot opt source destination DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.19.0.2:8080 RETURN all -- 0.0.0.0/0 0.0.0.0/0
You can see the rule that matches traffic destined to port 8080 and forwards it 172.19.0.2. We can see that 172.19.0.2 belongs to the ingress-sbox container if we inspect its interfaces:
ubuntu@swarm01:~$ sudo ls -l /run/docker/netns total 0 -rw-r--r-- 1 root root 0 Dec 16 16:03 1-3kuba8yq3c -rw-r--r-- 1 root root 0 Dec 16 16:05 1-bwii51jagl -rw-r--r-- 1 root root 0 Dec 16 16:05 fb3e041d72b7 -r--r--r-- 1 root root 0 Dec 16 15:54 ingress_sbox ubuntu@swarm01:~$ sudo nsenter --net=/run/docker/netns/ingress_sbox ifconfig eth0 Link encap:Ethernet HWaddr 02:42:0a:ff:00:04 inet addr:10.255.0.4 Bcast:0.0.0.0 Mask:255.255.0.0 inet6 addr: fe80::42:aff:feff:4/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1 RX packets:23 errors:0 dropped:0 overruns:0 frame:0 TX packets:8 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:1854 (1.8 KB) TX bytes:648 (648.0 B) eth1 Link encap:Ethernet HWaddr 02:42:ac:13:00:02 inet addr:172.19.0.2 Bcast:0.0.0.0 Mask:255.255.0.0 inet6 addr: fe80::42:acff:fe13:2/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:24 errors:0 dropped:0 overruns:0 frame:0 TX packets:8 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:1944 (1.9 KB) TX bytes:648 (648.0 B) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
docker then uses iptables mangle rules to mark packets to port 8080 with a certain number, that will then be used by IPVS to load balance to the appropriate containers:
ubuntu@swarm01:~$ sudo nsenter --net=/run/docker/netns/ingress_sbox iptables -t mangle -L -n Chain PREROUTING (policy ACCEPT) target prot opt source destination MARK tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 MARK set 0x102 Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination MARK all -- 0.0.0.0/0 10.255.0.2 MARK set 0x102
More details on how docker swarm using iptables and IPVS to load balance to containers are presented in this talk.
docker_gwbridge
Finally, there is the docker_gwbridge network. This is a bridge network and has a corresponding interface with the name docker_gwbridge created on each host participating in the swarm cluster. The docker_gwbridge provides connectivity to the outside world for traffic originating on the containers in the swarm cluster (For example, if we do a ping to google, that traffic goes through the docker_gwbridge network).
I won’t go into details of the internals of this network as this is the same as the bridge networks covered in the previous post.
Summary
When launching a container in a swarm cluster, the container can be attached to three (or more) networks by default. First there is the docker_gwbridge network which is used to allow containers to communicate with the outside world, then the ingress network which is only used if the containers need to allow inbound connections from the outside world, and finally there are the overlay networks that are user created and can be attached to containers. The overlay networks serve as a shared subnet between containers launched into the same network in which they can communicate directly (even if they are launched on different physical hosts).
We also saw that there separate network spaces that are created by default by docker in a swarm cluster that help manage the vxlan tunnels used for the overlay networks, as well as the load balancing rules for inbound connections to containers.
Excellent write up - and precisely what I was looking for
ReplyDeleteThanks for the comment!
Deletegreat work on this article !
ReplyDeleteThank you. Very good article
ReplyDeleteGreat Article. Nice and clear explanation. Thanks.
ReplyDeleteAppreciated. Nice work.
ReplyDelete