neuca-guest-tools 1.7: VM configuration at instantiation and “Picking yourself up by the bootstraps”

Do I dare
Disturb the universe?
In a minute there is time
For decisions and revisions which a minute will reverse.

— T. S. Eliot, “The Love Song of J. Alfred Prufrock”

ExoGENI experimenters!

Have you ever wanted to reboot your VMs, but found yourself unable to log back into them after having done so?

If so – then, fear the reboot no more; we here at ExoGENI Central have heard your laments, and have worked hard to address them!

We proudly announce the availability of neuca-guest-tools 1.7.
For those who are unaware – the neuca-guest-tools are included in most ExoGENI VM images, and handle the business of performing certain types of configuration (e.g. network address assignment) when VMs are created. In this respect, they are similar to cloud-init – but perform several additional tasks.

In this latest release, we have performed a significant clean-up and re-organization of the code. Several known and latent bugs were fixed (though, others may well have been introduced), and all python code has been PEP8-ified (for ease of reading and modification).

As to new features and changes in behavior?

  • We ensure that network interfaces retain their device names across reboots. This is accomplished by generating udev files for network interface devices on Linux. By doing this, we are able to prevent the management interface from being subverted by the kernel’s probe order during a reboot (which was the primary reason for VMs with multiple interfaces becoming unreachable after a reboot).
  • In Linux VMs that use NetworkManager (I’m looking at you, CentOS 7), NetworkManager is not allowed to interfere with the configuration of interfaces that are meant to be under the management of the neuca-guest-tools. This is done by having neuca-guest-tools mark dataplane interfaces as “unmanaged” within the context of NetworkManager.
  • Dataplane interface address configurations are only modified when the neuca-guest-tools are restarted, or when a change has been made to the request. Therefore, if you make manual changes to a dataplane interface while the VM is running (for example, via ifconfig), that change should persist until you either: reboot the VM, restart neuca-guest-tools, or make a change to your request that alters that interface. Dataplane interfaces can still be excluded from any address configuration changes by adding their MAC addresses (comma-separated, if there are multiple interfaces you wish to ignore) to the “dataplane-macs-to-ignore” configuration item in: /etc/neuca/config
  • Both System V init scripts and systemd unit files should now be named: neuca-guest-tools

As we have traditionally done, the neuca-guest-tools primarily target various flavors of Linux. Any Unix-like OS should be supportable, and contributions to enable those OSes are welcome. Support for Windows is on the long-term horizon.

The source repository can be found here:
https://github.com/RENCI-NRIG/neuca-guest-tools

Following Ubuntu’s lead, however, we’re no longer packaging (or supporting) 12.04; while the python code should still work with the distributed version of python in 12.04 (2.6!) – maintaining packages for distributions without vendor support seemed somewhat counter-productive.

Packages for recent and supported versions of CentOS, Fedora, and Ubuntu can be found here:
http://software.exogeni.net/repo/exogeni/neuca-guest-tools/

A source release can also be found at that location, for those who wish to attempt installing on versions of Linux for which we do not have packages.

We have also provided new VM images that contain the latest release of neuca-guest-tools; these are:

  • Centos 6.9 v1.0.2
  • Centos 7.4 v1.0.2
  • Fedora 25 v1.0.6
  • Fedora 26 v1.0.1
  • Ubuntu 14.04 v1.0.3
  • Ubuntu 16.04 v1.0.3

As a reminder – if you’d like to check what version of the neuca-guest-tools you’re running in your VM, you can run:

neuca-version

A sample run might look like the following:


[root@Node0 ~]# neuca-version
NEuca version 1.7

Remember: unless you’re running neuca-guest-tools 1.7, any VMs having multiple interfaces are unlikely to survive a reboot.

Finally – if you’d like to create your own images from one of the ones that already has already been provided for you, we suggest taking a look at the “image capture” script, which can be found here:
http://geni-images.renci.org/images/tools/imgcapture.sh

We’ve recently made some changes to it as well, so that custom images are captured more reliably. We’ve also added the ability to capture xattrs (if they are set on filesystems within your image); this should enable the ability to boot SELinux-enabled images. If interest is expressed in performing experiments using SELinux-enabled images, we will provide base VM images that have SELinux-enabled (from which customized images can be derived).

If you would like an example of how to use the image capture script, please take a look at the following fine ExoBlog entry:
Creating a Custom Image from an Existing Virtual Machine

We hope you enjoy the new release of neuca-guest-tools!

Lies, damn lies, and iperf: Dataplane network tuning in ExoGENI today

Welcome, intrepid ExoGENI experimenters, to another episode of “Performing awesome, high-bandwidth experiments for fame, fortune, and profit”!

Ahem. We may have to use our imaginations, if we’re expecting any of that to be true. 🙂

Let’s focus, instead, on what it takes to ensure that your high-bandwidth experiments proceed successfully.

First, let’s set the scene.

To test out how things perform for your experiment, you’ve allocated a pair of ExoGENI VMs in separate racks, connected by a 5 gigabit link. You’ve also allocated a pair of ExoGENI bare metal machines, that are also in separate racks from one another and are connected by a 5 gigabit link.

Once these two topologies have successfully become active, you log into the VMs and do what anyone who’s interested in testing networking performance does: you run iperf in server mode on one VM, and run iperf as a client in the other VM.

-bash-4.1# iperf -t 60 -i 10 -c 172.16.0.1
------------------------------------------------------------
Client connecting to 172.16.0.1, TCP port 5001
TCP window size: 19.3 KByte (default)
------------------------------------------------------------
[ 3] local 172.16.0.2 port 35805 connected with 172.16.0.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 2.10 GBytes 1.80 Gbits/sec
[ 3] 10.0-20.0 sec 1.81 GBytes 1.55 Gbits/sec
[ 3] 20.0-30.0 sec 1.69 GBytes 1.45 Gbits/sec
[ 3] 30.0-40.0 sec 1.68 GBytes 1.44 Gbits/sec
[ 3] 40.0-50.0 sec 1.68 GBytes 1.44 Gbits/sec
[ 3] 50.0-60.0 sec 1.97 GBytes 1.69 Gbits/sec
[ 3] 0.0-60.0 sec 10.9 GBytes 1.56 Gbits/sec

Gee. That’s TCP, and the performance is pretty awful. Let’s see how bad UDP is, just to compare.

-bash-4.1# iperf -t 60 -i 10 -c 172.16.0.1 -u
------------------------------------------------------------
Client connecting to 172.16.0.1, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size: 122 KByte (default)
------------------------------------------------------------
[ 3] local 172.16.0.2 port 35132 connected with 172.16.0.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 1.25 MBytes 1.05 Mbits/sec
[ 3] 10.0-20.0 sec 1.25 MBytes 1.05 Mbits/sec
[ 3] 20.0-30.0 sec 1.25 MBytes 1.05 Mbits/sec
[ 3] 30.0-40.0 sec 1.25 MBytes 1.05 Mbits/sec
[ 3] 40.0-50.0 sec 1.25 MBytes 1.05 Mbits/sec
[ 3] 50.0-60.0 sec 1.25 MBytes 1.05 Mbits/sec
[ 3] 0.0-60.0 sec 7.50 MBytes 1.05 Mbits/sec
[ 3] Sent 5351 datagrams
[ 3] Server Report:
[ 3] 0.0-59.8 sec 7.48 MBytes 1.05 Mbits/sec 0.023 ms 14/ 5351 (0.26%)

That’s weird; it looks better, but there’s not much bandwidth on the link. Let’s increase it from the default, using iperf’s “-b” flag.

-bash-4.1# iperf -t 60 -i 10 -c 172.16.0.1 -u -b 100M
------------------------------------------------------------
Client connecting to 172.16.0.1, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size: 122 KByte (default)
------------------------------------------------------------
[ 3] local 172.16.0.2 port 58836 connected with 172.16.0.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 120 MBytes 101 Mbits/sec
[ 3] 10.0-20.0 sec 120 MBytes 101 Mbits/sec
[ 3] 20.0-30.0 sec 120 MBytes 101 Mbits/sec
[ 3] 30.0-40.0 sec 120 MBytes 101 Mbits/sec
[ 3] 40.0-50.0 sec 120 MBytes 101 Mbits/sec
[ 3] 50.0-60.0 sec 120 MBytes 101 Mbits/sec
[ 3] 0.0-60.0 sec 719 MBytes 101 Mbits/sec
[ 3] Sent 512817 datagrams
[ 3] Server Report:
[ 3] 0.0-60.0 sec 719 MBytes 101 Mbits/sec 0.139 ms 0/512816 (0%)
[ 3] 0.0-60.0 sec 1 datagrams received out-of-order

A packet arrived out of order, but that looks much better. Let’s try something closer to what we asked for in the slice:

-bash-4.1# iperf -t 60 -i 10 -c 172.16.0.1 -u -b 4000M
------------------------------------------------------------
Client connecting to 172.16.0.1, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size: 122 KByte (default)
------------------------------------------------------------
[ 3] local 172.16.0.2 port 38042 connected with 172.16.0.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 965 MBytes 809 Mbits/sec
[ 3] 10.0-20.0 sec 965 MBytes 809 Mbits/sec
[ 3] 20.0-30.0 sec 965 MBytes 809 Mbits/sec
[ 3] 30.0-40.0 sec 965 MBytes 809 Mbits/sec
[ 3] 40.0-50.0 sec 965 MBytes 809 Mbits/sec
[ 3] 50.0-60.0 sec 964 MBytes 809 Mbits/sec
[ 3] 0.0-60.0 sec 5.65 GBytes 809 Mbits/sec
[ 3] Sent 4128674 datagrams
[ 3] Server Report:
[ 3] 0.0-60.0 sec 5.62 GBytes 805 Mbits/sec 0.009 ms 20078/4128673 (0.49%)
[ 3] 0.0-60.0 sec 526 datagrams received out-of-order

OK – there’s something seriously wrong. The bandwidth is far below what we specified (both on the command line, and in our slice request), but only 0.49% of the packets were lost?

We’ve learned our first lesson: iperf is full of lies.

This has got to be a bug; let’s check http://iperf.sourceforge.net

Oh. Wow. Iperf was last released in 2008. It’s not only full of lies, it’s old.

There is good news. The fine folks at ESnet have re-implemented iperf as iperf3, and have worked hard to clean it up, make it honest, and keep it maintained. We at ExoGENI Central use and recommend it.

ESnet’s page comparing the old and new versions is here:

https://fasterdata.es.net/performance-testing/network-troubleshooting-tools/iperf-and-iperf3/

Their github repository containing iperf3’s source is linked off of that page (if you want to compile iperf3 from source), or you can install it on most RPM-based Linux distributions by issuing the following command:

yum -y install iperf3

Let’s try testing again in our pair of VMs, first with TCP:

-bash-4.1# iperf3 -t 60 -i 10 -c 172.16.0.1
Connecting to host 172.16.0.1, port 5201
[ 4] local 172.16.0.2 port 52374 connected to 172.16.0.1 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-10.00 sec 1.71 GBytes 1.46 Gbits/sec 0 3.05 MBytes
[ 4] 10.00-20.00 sec 1.90 GBytes 1.63 Gbits/sec 0 3.05 MBytes
[ 4] 20.00-30.00 sec 1.75 GBytes 1.50 Gbits/sec 0 3.05 MBytes
[ 4] 30.00-40.00 sec 1.71 GBytes 1.47 Gbits/sec 0 3.05 MBytes
[ 4] 40.00-50.00 sec 1.92 GBytes 1.65 Gbits/sec 0 3.05 MBytes
[ 4] 50.00-60.01 sec 1.79 GBytes 1.53 Gbits/sec 0 3.05 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-60.01 sec 10.8 GBytes 1.54 Gbits/sec 0 sender
[ 4] 0.00-60.01 sec 10.8 GBytes 1.54 Gbits/sec receiver

Still terrible, but consistent with the results we got from iperf, and we have information about the window size that TCP decided to use. Let’s try UDP again, starting at 100 megabits:

-bash-4.1# iperf3 -t 60 -i 10 -c 172.16.0.1 -u -b 100M
Connecting to host 172.16.0.1, port 5201
[ 4] local 172.16.0.2 port 44764 connected to 172.16.0.1 port 5201
[ ID] Interval Transfer Bandwidth Total Datagrams
[ 4] 0.00-10.00 sec 124 MBytes 104 Mbits/sec 15844
[ 4] 10.00-20.00 sec 125 MBytes 105 Mbits/sec 16000
[ 4] 20.00-30.00 sec 125 MBytes 105 Mbits/sec 16000
[ 4] 30.00-40.00 sec 125 MBytes 105 Mbits/sec 16000
[ 4] 40.00-50.00 sec 125 MBytes 105 Mbits/sec 16000
[ 4] 50.00-60.00 sec 125 MBytes 105 Mbits/sec 16000
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
[ 4] 0.00-60.00 sec 749 MBytes 105 Mbits/sec 0.056 ms 25208/95844 (26%)
[ 4] Sent 95844 datagrams

Ouch. That’s terrible, but believable. Let’s try UDP at 4 gigabits again, just to see how bad it can be:

-bash-4.1# iperf3 -t 60 -i 10 -c 172.16.0.1 -u -b 4000M
Connecting to host 172.16.0.1, port 5201
[ 4] local 172.16.0.2 port 34263 connected to 172.16.0.1 port 5201
[ ID] Interval Transfer Bandwidth Total Datagrams
[ 4] 0.00-10.00 sec 4.42 GBytes 3.80 Gbits/sec 579139
[ 4] 10.00-20.00 sec 4.63 GBytes 3.98 Gbits/sec 607206
[ 4] 20.00-30.00 sec 4.95 GBytes 4.26 Gbits/sec 649286
[ 4] 30.00-40.00 sec 4.61 GBytes 3.96 Gbits/sec 603633
[ 4] 40.00-50.00 sec 4.85 GBytes 4.17 Gbits/sec 635802
[ 4] 50.00-60.00 sec 4.92 GBytes 4.23 Gbits/sec 645493
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
[ 4] 0.00-60.00 sec 28.4 GBytes 4.06 Gbits/sec 0.081 ms 2354862/3720471 (63%)
[ 4] Sent 3720471 datagrams

Wowzers. 63% loss. What can we do?

Luckily, the ESnet folks have got our backs again.

If we go to: https://fasterdata.es.net/host-tuning/linux/

we find a wealth of information on tuning options for Linux.

In our experience here in ExoGENI, some of these values apply to VMs, and some don’t. Let’s try adding the sysctl options from that page to /etc/sysctl.conf on both of our VMs; we’ll use the values that are specified for the 100 ms RTT.

Besides the values specified there, we at ExoGENI also add the following:

net.core.rmem_default = 16777216
net.core.wmem_default = 16777216

In order to apply those values “live”, we have to issue the following command in each VM, after adding the new values to /etc/sysctl.conf:

sysctl -p

One of the things that *doesn’t* apply inside VMs, in our experience, is increasing the length of the transmission queue with ifconfig. So we won’t do that.

Increasing the MTU would also be useful, but we can’t do that yet in ExoGENI VMs; we’re working hard to rectify that in the near future.

Now, let’s re-try our tests, with the new values. First TCP:

-bash-4.1# iperf3 -t 60 -i 10 -c 172.16.0.1
Connecting to host 172.16.0.1, port 5201
[ 4] local 172.16.0.2 port 52574 connected to 172.16.0.1 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-10.00 sec 3.25 GBytes 2.79 Gbits/sec 0 8.97 MBytes
[ 4] 10.00-20.00 sec 4.90 GBytes 4.21 Gbits/sec 0 12.0 MBytes
[ 4] 20.00-30.00 sec 4.90 GBytes 4.21 Gbits/sec 434 7.02 MBytes
[ 4] 30.00-40.00 sec 4.77 GBytes 4.10 Gbits/sec 0 12.0 MBytes
[ 4] 40.00-50.00 sec 5.27 GBytes 4.52 Gbits/sec 1717 7.73 MBytes
[ 4] 50.00-60.00 sec 5.63 GBytes 4.84 Gbits/sec 2247 6.20 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-60.00 sec 28.7 GBytes 4.11 Gbits/sec 4398 sender
[ 4] 0.00-60.00 sec 28.7 GBytes 4.11 Gbits/sec receiver

That’s much better. Not perfect, because there is some obvious variance. Some of that stems from how ExoGENI implements policing of link bandwidth for VMs; we’re working on trying to correct that.

Having gotten good results from TCP, let’s try UDP again at 100 megabits:

-bash-4.1# iperf3 -t 60 -i 10 -c 172.16.0.1 -u -b 100M
Connecting to host 172.16.0.1, port 5201
[ 4] local 172.16.0.2 port 37860 connected to 172.16.0.1 port 5201
[ ID] Interval Transfer Bandwidth Total Datagrams
[ 4] 0.00-10.00 sec 124 MBytes 104 Mbits/sec 15844
[ 4] 10.00-20.00 sec 125 MBytes 105 Mbits/sec 16000
[ 4] 20.00-30.00 sec 125 MBytes 105 Mbits/sec 16000
[ 4] 30.00-40.00 sec 125 MBytes 105 Mbits/sec 16000
[ 4] 40.00-50.00 sec 125 MBytes 105 Mbits/sec 16000
[ 4] 50.00-60.00 sec 125 MBytes 105 Mbits/sec 16000
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
[ 4] 0.00-60.00 sec 749 MBytes 105 Mbits/sec 0.006 ms 0/95844 (0%)
[ 4] Sent 95844 datagrams

Wow. That’s nearly perfect, but it’s only 100 megabits. Let’s try 4 gigabits again:

-bash-4.1# iperf3 -t 60 -i 10 -c 172.16.0.1 -u -b 4000M
Connecting to host 172.16.0.1, port 5201
[ 4] local 172.16.0.2 port 33143 connected to 172.16.0.1 port 5201
[ ID] Interval Transfer Bandwidth Total Datagrams
[ 4] 0.00-10.00 sec 4.67 GBytes 4.01 Gbits/sec 612436
[ 4] 10.00-20.00 sec 4.81 GBytes 4.13 Gbits/sec 629906
[ 4] 20.00-30.00 sec 4.70 GBytes 4.04 Gbits/sec 616525
[ 4] 30.00-40.00 sec 4.39 GBytes 3.77 Gbits/sec 575003
[ 4] 40.00-50.00 sec 4.92 GBytes 4.22 Gbits/sec 644481
[ 4] 50.00-60.00 sec 5.60 GBytes 4.81 Gbits/sec 734351
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
[ 4] 0.00-60.00 sec 29.1 GBytes 4.16 Gbits/sec 0.067 ms 2549105/3812619 (67%)
[ 4] Sent 3812619 datagrams

Still terrible. That leads us to a dirty little secret: the folks developing the virtio “hardware” in KVM (the software providing the virtual machines in ExoGENI) are still working hard on improving the host side of the virtio paravirtualized network device, when it comes to UDP performance. Getting even close to line rate with UDP in ExoGENI VMs is unlikely, at this time. We’re keeping an eye on developments in KVM, and as well as investigating other avenues for improving UDP performance (like allowing jumbograms to be used in VMs, and implementing SR-IOV support for ExoGENI VMs).

That brings us to the wonderful world of bare metal nodes, and the pair that we allocated earlier.

First, let’s see how iperf3 measures the untuned TCP performance of our pair of bare metal nodes.

[root@ufl-w10 ~]# iperf3 -t 60 -i 10 -c 172.16.0.1
Connecting to host 172.16.0.1, port 5201
[ 4] local 172.16.0.2 port 50276 connected to 172.16.0.1 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-10.00 sec 1.79 GBytes 1.54 Gbits/sec 0 2.56 MBytes
[ 4] 10.00-20.00 sec 1.92 GBytes 1.65 Gbits/sec 0 2.56 MBytes
[ 4] 20.00-30.00 sec 1.92 GBytes 1.65 Gbits/sec 0 2.56 MBytes
[ 4] 30.00-40.00 sec 1.92 GBytes 1.65 Gbits/sec 0 2.56 MBytes
[ 4] 40.00-50.00 sec 1.92 GBytes 1.65 Gbits/sec 0 2.56 MBytes
[ 4] 50.00-60.00 sec 1.92 GBytes 1.65 Gbits/sec 0 2.56 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-60.00 sec 11.4 GBytes 1.63 Gbits/sec 0 sender
[ 4] 0.00-60.00 sec 11.4 GBytes 1.63 Gbits/sec receiver

Not great. Let’s see untuned UDP performance next, at 4 gigabits.

[root@ufl-w10 ~]# iperf3 -t 60 -i 10 -c 172.16.0.1 -u -b 4000M
Connecting to host 172.16.0.1, port 5201
[ 4] local 172.16.0.2 port 34828 connected to 172.16.0.1 port 5201
[ ID] Interval Transfer Bandwidth Total Datagrams
[ 4] 0.00-10.00 sec 4.87 GBytes 4.18 Gbits/sec 638511
[ 4] 10.00-20.00 sec 4.87 GBytes 4.19 Gbits/sec 638739
[ 4] 20.00-30.00 sec 4.88 GBytes 4.19 Gbits/sec 639684
[ 4] 30.00-40.00 sec 4.89 GBytes 4.20 Gbits/sec 640903
[ 4] 40.00-50.00 sec 4.89 GBytes 4.20 Gbits/sec 640537
[ 4] 50.00-60.00 sec 4.87 GBytes 4.19 Gbits/sec 638848
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
[ 4] 0.00-60.00 sec 29.3 GBytes 4.19 Gbits/sec 0.006 ms 983521/3837222 (26%)
[ 4] Sent 3837222 datagrams

Far better than the VMs, but also not great. We need to do some tuning.

First, we advise setting the same options in sysctl.conf on both bare metal nodes as we set on the VMs, and then applying them. Next, we advise increasing the MTU on the dataplane interfaces to 9000. After that, the length of the transmit queue on the dataplane interface should be increased to 10000. There are, however, some caveats.

As a result of GENI’s VLAN sliced model, the dataplane interface in a bare metal node is VLAN tagged; you will therefore need to make these changes on the parent interface first, and then on the VLAN tagged child interface. So, if the parent interface for your dataplane interface was named p2p1 and your interface was tagged onto VLAN 1445, you would achieve this by issuing the following commands:

ifconfig p2p1 mtu 9000
ifconfig p2p1.1445 mtu 9000
ifconfig p2p1 txqueuelen 10000
ifconfig p2p1.1445 txqueuelen 10000

Having done this on both our bare metal nodes, we now re-try our TCP measurements:

[root@ufl-w10 ~]# iperf3 -t 60 -i 10 -c 172.16.0.1
Connecting to host 172.16.0.1, port 5201
[ 4] local 172.16.0.2 port 50279 connected to 172.16.0.1 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-10.01 sec 8.40 GBytes 7.21 Gbits/sec 0 12.0 MBytes
[ 4] 10.01-20.00 sec 8.61 GBytes 7.40 Gbits/sec 0 12.0 MBytes
[ 4] 20.00-30.00 sec 8.62 GBytes 7.40 Gbits/sec 0 12.0 MBytes
[ 4] 30.00-40.00 sec 8.62 GBytes 7.41 Gbits/sec 0 12.0 MBytes
[ 4] 40.00-50.00 sec 8.61 GBytes 7.40 Gbits/sec 0 12.0 MBytes
[ 4] 50.00-60.00 sec 8.60 GBytes 7.39 Gbits/sec 0 12.0 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-60.00 sec 51.5 GBytes 7.37 Gbits/sec 0 sender
[ 4] 0.00-60.00 sec 51.5 GBytes 7.37 Gbits/sec receiver

Over 7 gigabits? Great, but I thought our slice only provisioned 5 gigabits to the link.

Time for another dirty secret: ExoGENI doesn’t, at this time, police the interfaces on bare metal nodes.

Intra-rack, you should be able to achieve a full 10 gigabits per second between two bare metal nodes.

Inter-rack, you’ll achieve whatever the provider of the link (BEN, Internet2, FLR, etc.) provides. In the example above, FLR is providing us with a link capable of over 7 gigabits between the FIU and UFL racks.

With that secret out in the open, let’s try UDP, first at 4 gigabits per second:

[root@ufl-w10 ~]# iperf3 -t 60 -i 10 -c 172.16.0.1 -u -b 4000M
Connecting to host 172.16.0.1, port 5201
[ 4] local 172.16.0.2 port 37348 connected to 172.16.0.1 port 5201
[ ID] Interval Transfer Bandwidth Total Datagrams
[ 4] 0.00-10.00 sec 4.85 GBytes 4.17 Gbits/sec 636207
[ 4] 10.00-20.00 sec 4.88 GBytes 4.19 Gbits/sec 639999
[ 4] 20.00-30.00 sec 4.88 GBytes 4.19 Gbits/sec 639963
[ 4] 30.00-40.00 sec 4.88 GBytes 4.19 Gbits/sec 640004
[ 4] 40.00-50.00 sec 4.88 GBytes 4.20 Gbits/sec 640128
[ 4] 50.00-60.00 sec 4.88 GBytes 4.19 Gbits/sec 639873
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
[ 4] 0.00-60.00 sec 29.3 GBytes 4.19 Gbits/sec 0.004 ms 1/3836174 (2.6e-05%)
[ 4] Sent 3836174 datagrams

Perfect. 🙂 Now let’s try UDP at 7 gigabits, since that’s what the TCP test showed us was possible:

[root@ufl-w10 ~]# iperf3 -t 60 -i 10 -c 172.16.0.1 -u -b 7000M
Connecting to host 172.16.0.1, port 5201
[ 4] local 172.16.0.2 port 56723 connected to 172.16.0.1 port 5201
[ ID] Interval Transfer Bandwidth Total Datagrams
[ 4] 0.00-10.00 sec 8.52 GBytes 7.32 Gbits/sec 1116849
[ 4] 10.00-20.00 sec 8.52 GBytes 7.32 Gbits/sec 1116800
[ 4] 20.00-30.00 sec 8.56 GBytes 7.36 Gbits/sec 1122464
[ 4] 30.00-40.00 sec 8.55 GBytes 7.35 Gbits/sec 1120834
[ 4] 40.00-50.00 sec 8.52 GBytes 7.32 Gbits/sec 1116522
[ 4] 50.00-60.00 sec 8.56 GBytes 7.36 Gbits/sec 1122586
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
[ 4] 0.00-60.00 sec 51.2 GBytes 7.34 Gbits/sec 0.004 ms 1/6716055 (1.5e-05%)
[ 4] Sent 6716055 datagrams

Again, perfect.

So, congratulations if you made it this far! Your prize is having learned what it takes, in depth, to make your high bandwidth experiments successful on ExoGENI.

For the rest of you?

TL;DR:

  1. iperf is full of lies; use iperf3.
  2. Bare metal performs much better than VMs.
  3. VMs perform better with TCP than UDP.
  4. Tuning options in /etc/sysctl.conf should contain items from appendix “A” below.
  5. Apply the options in appendix “A” with: sysctl -p
  6. On bare metal only, increase the txqueuelen and MTU of the dataplane interface.
  7. To make (6) happen, you’ll need to do the same to the parent interface first, as in appendix “B”.

That concludes today’s adventure, experimenters! Take this knowledge, and use it well!

Appendicies:

A:

net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.netdev_max_backlog = 30000
net.ipv4.tcp_congestion_control = htcp
net.ipv4.tcp_mtu_probing = 1

B:
For a dataplane interface named p2p1.1445, increase the MTU and txqueuelen with the following commands:

ifconfig p2p1 mtu 9000
ifconfig p2p1.1445 mtu 9000
ifconfig p2p1 txqueuelen 10000
ifconfig p2p1.1445 txqueuelen 10000