We are seeing very slow performance on the OOD 2.0 UI deployed to the virtual instance. This is most apparent on the initial load of the page but affects all interaction generally. The instance is provisioned with 16 cores and 64g RAM which should be enough especially without any real load.
We need to figure out why the site is performing poorly.
I've uploaded images of the network load times in a Firefox session from on campus for both the OOD2.0 dev site and the current OOD1.6 prod site.
You can see the sluggish loads are 2+ second load for multiple resources on the initial page load for the OOD 2.0 site.
By comparison the load times for the production OOD 1.6 site is about 1.8 seconds for the core document but then almost nothing for the remaining content.
I set up an iperf3 test point on the Ubuntu20.04 debugger node on the same network segment as the v007 OOD2.0 instance.
iperf3 -s
Testing the performance from the OOD2.0 instance shows acceptable throughput. The speeds are multiple Gbps which is the minimum we should expect from the underlying 10+Gbps networking. The variation in speed is a little concerning since that can occur as a result of packet loss, but the TCP channel isn't showing any retransmits so it doesn't seem like the network is chewing up packets.
Running a test from the OOD2.0 node to the production OOD1.6 node also shows acceptable performance. This demonstrates that the route between the cluster network segment (in TIC) and the openstack-cheaha-internal segment (in DCB) is decent. It varies some but not as much as on the local segment.
I ran a test between my laptop on UAB wifi and the debugger iperf3 test point in the cheaha-cloud project and the performance is much lower, as expected from a wifi link.
I suspect the root cause may be relate to how fast the OOD2.0 instance is able to read and serve the files to the browser. That is, the delays are related to the file system.
Does the site rely on NFS to serve up content or is all the site content on the local volume?
We explored these results in our zoom meeting. It looks like an issue related to loading of the dashboard page and its artifacts (css & png) via the PUN processes. We suspect this has something to do with the processing of the artifacts through the Ruby engine.
We will deploy a v1.6.5 ood instance to compare performance to see if this is a new issue in OOD 2.0.2. We also need to compare this to the deployments on the personal bright dev clusters.
These diagrams of the OOD architecture should help isolate debugging efforts to the different components involved. We still suspect that the passenger/ruby space is involved, since the delays are in the time it take the server to process the request after it is received from the browser.
The sketch of the network paths between virtual instances and storage:
This is just to highlight the components involved in the data flows aren't expected to suffer from bandwidth issues, at least at the physical/ethernet layer. That doesn't mean we don't have issue with application (ceph/openstack/centos-instance/ood-app) performance issues. This is the purpose of further debugging.
So in incognito tab against production does show significant delays for all the artifacts. This seems consistent with performance observed on the 2.x and 1.6 in the virtual instance.
This prevents the browser from caching artifacts and in the source of the sluggish page load experience. Disabling this Header line resolves the issues with poor performance.