Debugging sluggish performance of OOD 2.0 on virtual instance

added 1 deleted label

assigned to @louistw

I've uploaded images of the network load times in a Firefox session from on campus for both the OOD2.0 dev site and the current OOD1.6 prod site.

You can see the sluggish loads are 2+ second load for multiple resources on the initial page load for the OOD 2.0 site.

By comparison the load times for the production OOD 1.6 site is about 1.8 seconds for the core document but then almost nothing for the remaining content.

added Debugging label

I set up an iperf3 test point on the Ubuntu20.04 debugger node on the same network segment as the v007 OOD2.0 instance.

iperf3 -s

Testing the performance from the OOD2.0 instance shows acceptable throughput. The speeds are multiple Gbps which is the minimum we should expect from the underlying 10+Gbps networking. The variation in speed is a little concerning since that can occur as a result of packet loss, but the TCP channel isn't showing any retransmits so it doesn't seem like the network is chewing up packets.

[root@v007 ~]# iperf3 -c 10.250.0.93
Connecting to host 10.250.0.93, port 5201
[  5] local 10.250.0.180 port 52454 connected to 10.250.0.93 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   567 MBytes  4.75 Gbits/sec    0   3.08 MBytes       
[  5]   1.00-2.00   sec   550 MBytes  4.61 Gbits/sec    0   3.08 MBytes       
[  5]   2.00-3.00   sec  1.11 GBytes  9.56 Gbits/sec    0   3.08 MBytes       
[  5]   3.00-4.00   sec  1.17 GBytes  10.0 Gbits/sec    0   3.08 MBytes       
[  5]   4.00-5.00   sec   275 MBytes  2.31 Gbits/sec    0   3.08 MBytes       
[  5]   5.00-6.00   sec   250 MBytes  2.10 Gbits/sec    0   3.08 MBytes       
[  5]   6.00-7.00   sec  1.07 GBytes  9.16 Gbits/sec    0   3.08 MBytes       
[  5]   7.00-8.00   sec  1.22 GBytes  10.5 Gbits/sec    0   3.08 MBytes       
[  5]   8.00-9.00   sec  1.42 GBytes  12.2 Gbits/sec    0   3.08 MBytes       
[  5]   9.00-10.00  sec  1.21 GBytes  10.4 Gbits/sec    0   3.08 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  8.81 GBytes  7.57 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  8.81 GBytes  7.56 Gbits/sec                  receiver

iperf Done.

Running a test from the OOD2.0 node to the production OOD1.6 node also shows acceptable performance. This demonstrates that the route between the cluster network segment (in TIC) and the openstack-cheaha-internal segment (in DCB) is decent. It varies some but not as much as on the local segment.

[root@v007 ~]# iperf3 -c 172.20.0.30
Connecting to host 172.20.0.30, port 5201
[  5] local 10.250.0.180 port 44324 connected to 172.20.0.30 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   394 MBytes  3.31 Gbits/sec    0   3.04 MBytes       
[  5]   1.00-2.00   sec   511 MBytes  4.29 Gbits/sec    0   3.04 MBytes       
[  5]   2.00-3.00   sec   728 MBytes  6.10 Gbits/sec    0   3.04 MBytes       
[  5]   3.00-4.00   sec   798 MBytes  6.69 Gbits/sec    0   3.04 MBytes       
[  5]   4.00-5.00   sec  1.12 GBytes  9.60 Gbits/sec    0   3.04 MBytes       
[  5]   5.00-6.00   sec   542 MBytes  4.55 Gbits/sec    0   3.04 MBytes       
[  5]   6.00-7.00   sec  1.08 GBytes  9.28 Gbits/sec    0   3.04 MBytes       
[  5]   7.00-8.00   sec  1.00 GBytes  8.61 Gbits/sec    0   3.04 MBytes       
[  5]   8.00-9.00   sec  1.06 GBytes  9.07 Gbits/sec    0   3.04 MBytes       
[  5]   9.00-10.00  sec  1.14 GBytes  9.78 Gbits/sec    0   3.04 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  8.30 GBytes  7.13 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  8.30 GBytes  7.13 Gbits/sec                  receiver

iperf Done.

I ran a test between my laptop on UAB wifi and the debugger iperf3 test point in the cheaha-cloud project and the performance is much lower, as expected from a wifi link.

root@laptop:/etc/apt/sources.list.d# iperf3 -c 138.26.48.250
Connecting to host 138.26.48.250, port 5201
[  5] local 172.24.253.70 port 49054 connected to 138.26.48.250 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  5.74 MBytes  48.1 Mbits/sec    0    290 KBytes       
[  5]   1.00-2.00   sec  9.30 MBytes  78.0 Mbits/sec    0    706 KBytes       
[  5]   2.00-3.00   sec  14.3 MBytes   120 Mbits/sec    0   1.37 MBytes       
[  5]   3.00-4.00   sec  12.5 MBytes   105 Mbits/sec    0   1.85 MBytes       
[  5]   4.00-5.00   sec  16.2 MBytes   136 Mbits/sec    0   2.45 MBytes       
[  5]   5.00-6.00   sec  10.0 MBytes  83.9 Mbits/sec    0   2.91 MBytes       
[  5]   6.00-7.00   sec  16.2 MBytes   136 Mbits/sec    0   3.12 MBytes       
[  5]   7.00-8.00   sec  7.50 MBytes  62.9 Mbits/sec    0   3.12 MBytes       
[  5]   8.00-9.00   sec  5.00 MBytes  41.9 Mbits/sec    0   3.12 MBytes       
[  5]   9.00-10.00  sec  8.75 MBytes  73.4 Mbits/sec    0   3.12 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   106 MBytes  88.6 Mbits/sec    0             sender
[  5]   0.00-10.01  sec   103 MBytes  86.1 Mbits/sec                  receiver

iperf Done.

Even though the wifi speed is about a factor of 100 slower that doesn't mean the network is to blame for the sluggish UI.

I can't run an iperf3 test to the production OOD1.6 endpoint because we don't have that port open on that node.

These tests are mainly to provide a baseline for the network so we can take that into consideration for the debug.

I suspect the root cause may be relate to how fast the OOD2.0 instance is able to read and serve the files to the browser. That is, the delays are related to the file system.

Does the site rely on NFS to serve up content or is all the site content on the local volume?

We explored these results in our zoom meeting. It looks like an issue related to loading of the dashboard page and its artifacts (css & png) via the PUN processes. We suspect this has something to do with the processing of the artifacts through the Ruby engine.

We will deploy a v1.6.5 ood instance to compare performance to see if this is a new issue in OOD 2.0.2. We also need to compare this to the deployments on the personal bright dev clusters.

These diagrams of the OOD architecture should help isolate debugging efforts to the different components involved. We still suspect that the passenger/ruby space is involved, since the delays are in the time it take the server to process the request after it is received from the browser.

https://osc.github.io/ood-documentation/latest/architecture.html

added 1 design

The sketch of the network paths between virtual instances and storage:

This is just to highlight the components involved in the data flows aren't expected to suffer from bandwidth issues, at least at the physical/ethernet layer. That doesn't mean we don't have issue with application (ceph/openstack/centos-instance/ood-app) performance issues. This is the purpose of further debugging.

marked this issue as related to #11 (closed)

added 1 design

I'm still seeing prod (login005) as an approximately 2s load time for dashboard but the remainder of the artifacts are fast.

added 1 design

So in incognito tab against production does show significant delays for all the artifacts. This seems consistent with performance observed on the 2.x and 1.6 in the virtual instance.

marked this issue as related to #19

A joint debug session tracked this down to a no-store cache setting in the apache conf file template.

https://github.com/OSC/ondemand/blob/v2.0.27/ood-portal-generator/templates/ood-portal.conf.erb#L95

This prevents the browser from caching artifacts and in the source of the sluggish page load experience. Disabling this Header line resolves the issues with poor performance.

This was reported in Nov 2021.

The upstream fix is expected in OOD2.1. In the mean time we will maintain the patch locally.

closed

added Sprint 22-14 label and removed Debugging label

mentioned in issue #46 (closed)

mentioned in issue #167 (closed)

mentioned in issue #485

Debugging sluggish performance of OOD 2.0 on virtual instance

Designs

Child items ...

Activity