CephFS frozen/MDS issues during large file copy

During the .oldscratch migration from GPFS to LTS CephFS, the cephfs mount froze. The process went sommthly until it got to a user directory with 300 million+ files. I found the following messages on the Ceph cluster:

ansible@ar-saltmgr:~> sudo ceph -s
  cluster:
    id:     f3bf2186-40db-11eb-9ecb-1c34da011540
    health: HEALTH_WARN
            1 filesystem is degraded
            1 MDSs report oversized cache
            1 MDSs report slow metadata IOs
ansible@ar-saltmgr:~> sudo ceph health detail
HEALTH_WARN 1 filesystem is degraded; 1 MDSs report oversized cache; 1 MDSs report slow metadata IOs
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs cephfs-lts is degraded
[WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
    mds.cephfs-lts.ar-mon-02.kuxulv(mds.0): MDS cache is too large (65GB/4GB); 1 inodes in use by clients, 0 stray files
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
    mds.cephfs-lts.ar-mon-02.kuxulv(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 89 secs

The MDS_CACHE_OVERSIZED and MDS_SLOW_METADATA_IO messages clear periodically, then will reappear and the cache size will quickly grow to 100GB+.

These errors occur whenever rsyncs or find commands are ran on the large directory.

Edited Jan 24, 2023 by Clyde Allan McClung