CephFS frozen/MDS issues during large file copy
During the .oldscratch migration from GPFS to LTS CephFS, the cephfs mount froze. The process went sommthly until it got to a user directory with 300 million+ files. I found the following messages on the Ceph cluster:
ansible@ar-saltmgr:~> sudo ceph -s
cluster:
id: f3bf2186-40db-11eb-9ecb-1c34da011540
health: HEALTH_WARN
1 filesystem is degraded
1 MDSs report oversized cache
1 MDSs report slow metadata IOs
ansible@ar-saltmgr:~> sudo ceph health detail
HEALTH_WARN 1 filesystem is degraded; 1 MDSs report oversized cache; 1 MDSs report slow metadata IOs
[WRN] FS_DEGRADED: 1 filesystem is degraded
fs cephfs-lts is degraded
[WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
mds.cephfs-lts.ar-mon-02.kuxulv(mds.0): MDS cache is too large (65GB/4GB); 1 inodes in use by clients, 0 stray files
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
mds.cephfs-lts.ar-mon-02.kuxulv(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 89 secs
The MDS_CACHE_OVERSIZED and MDS_SLOW_METADATA_IO messages clear periodically, then will reappear and the cache size will quickly grow to 100GB+.
These errors occur whenever rsyncs or find commands are ran on the large directory.
Edited by Clyde Allan McClung