Newer
Older
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example GPFS Workflow\n",
"\n",
"This notebook aims to give an example GPFS workflow using an early version of the `rc_gpfs` package. This will start with a raw GPFS log file moving through stages for splitting and conversion to parquet using the Python API, although both steps could be done using the built-in CLI as well. Once the parquet dataset is created, it will create the basic `tld` summarized report on file count and storage use. From there, some example plots showing breakdowns of storage used by file age will be created.\n",
"\n",
"~~This notebook assumes being run in a job with at least one GPU, although some options can be changed to use a CPU-only backend. A GPU is recommended for analyzing larger logs, such as for `/data/project`.~~\n",
"\n",
"**Update: Polars is now the backend for the entire package. No GPU is required for processing**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Suggested Compute Resources\n",
"\n",
"For analyzing a `/data/project` report, I suggest the following resources:\n",
"\n",
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Package and Input Setup"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"\n",
"# Import the three main Python API submodules for analyzing a report\n",
"from rc_gpfs import process, policy\n",
"from rc_gpfs.report import plotting\n",
"import polars as pl"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
"# File patch setup\n",
"log_dir = '/data/rc/gpfs-policy/data/list-policy_data-project_list-path-external_slurm-32856491_2025-04-14T00:00:15'\n",
"gpfs_log_root = Path(log_dir)\n",
"raw_log = list(gpfs_log_root.joinpath('raw').glob('*.gz'))[0]\n",
"acq = re.search(r\"\\d{4}-\\d{2}-\\d{2}\",log_dir).group()"
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Log Preprocessing\n",
"\n",
"This section shows how to use the `policy` module to split the large GPFS log into smaller parts and then convert each part to a separate parquet file for analysis elsewhere. You can specify different filepaths for the outputs of each stage if you want, but it's generally easier to let the functions use the standard log directory structure seen below:\n",
"\n",
"``` text\n",
".\n",
"└── log_root/\n",
" ├── raw/\n",
" │ └── raw_log.gz\n",
" ├── chunks/\n",
" │ ├── list-000.gz\n",
" │ ├── list-001.gz\n",
" │ └── ...\n",
" ├── parquet/\n",
" │ ├── list-000.parquet\n",
" │ ├── list-001.parquet\n",
" │ └── ...\n",
" ├── reports/\n",
" └── misc/\n",
"```\n",
"\n",
"The directory structure will be automatically created as needed by each function. It's generally easier to not specify output paths for consistency in organization."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Split"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Split raw log\n",
"policy.split(log=raw_log)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Convert\n",
"\n",
"Opposed to split, it's much faster to submit the parquet conversion as an array job (built into the `cli` submodule), but it's possible to do here via the multiprocessing library as well."
]
},
{
"cell_type": "code",
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
"metadata": {},
"outputs": [],
"source": [
"from multiprocessing import Pool\n",
"from rc_gpfs.utils import parse_scontrol\n",
"\n",
"cores,_ = parse_scontrol()\n",
"split_files = list(gpfs_log_root.joinpath('chunks').glob('*.gz'))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with Pool(cores) as pool:\n",
" pool.map(policy.convert,split_files)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Aggregate"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
"pq_path = gpfs_log_root.joinpath('parquet')"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
"df = pl.scan_parquet(pq_path.joinpath('*.parquet'))"
]
},
{
"cell_type": "code",
"df_agg = process.aggregate_gpfs_dataset(\n",
" df,\n",
" acq = acq,\n",
" time_breakpoints=[30,60,90,180,365],\n",
" time_unit='D',\n",
" time_val='access'\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Plotting"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
"df_agg_grp = (\n",
" df_agg\n",
" .group_by('dt_grp')\n",
" .agg(pl.col('bytes','file_count').sum())\n",
" .sort('dt_grp',descending=True)\n",
")"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
"exp,unit = plotting.choose_appropriate_storage_unit(df_agg['bytes'])\n",
"df_agg_grp = (\n",
" df_agg_grp\n",
" .with_columns((pl.col('bytes')/(1024**exp)).alias(unit))\n",
" .with_columns([\n",
" pl.cum_sum('file_count').alias('file_count_cum'),\n",
" pl.col(unit).cum_sum().alias(f'{unit}_cum'),\n",
" ])\n",
")"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
"f = plotting.pareto_chart(df_agg_grp,x='dt_grp',y=unit)"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}