Newer
Older
{
"cells": [
{
"cell_type": "markdown",
"id": "5fb66d11",
"metadata": {},
"source": [
"# run report on pickled list policy data\n",
"\n",
"The script reads pickled files that match the `glob_pattern` from the `pickledir` derived from `dirname` and runs the report saving it as a csv to the subdir \"`dirname`/reports\" dir by default.\n",
"\n",
"Some progress info is available via the `verbose` flag.\n",
"\n",
"The current report aggrates storage stats by top-level-dir and age (year) of data's last access. The goal of this report is to understand the distribution of lesser used data."
]
},
{
"cell_type": "markdown",
"id": "51c07f66",
"metadata": {},
"source": [
"!conda info --envs"
]
},
{
"cell_type": "markdown",
"id": "15997b7d",
"metadata": {},
"source": [
"!conda list"
]
},
{
"cell_type": "markdown",
"id": "c740ad5f",
"metadata": {},
"source": [
"!pip list -freeze"
]
},
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
{
"cell_type": "code",
"execution_count": null,
"id": "5059337b",
"metadata": {},
"outputs": [],
"source": [
"import datetime\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from urllib.parse import unquote\n",
"import sys\n",
"import os\n",
"import pathlib\n",
"import re"
]
},
{
"cell_type": "markdown",
"id": "5f4c10d1",
"metadata": {},
"source": [
"## input vars"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "92ddc402",
"metadata": {},
"outputs": [],
"source": [
"dirname=\"\" # directory to fine files to pickle\n",
"glob_pattern = \"*.gz\" # file name glob pattern to match, can be file name for individual file\n",
"line_regex_filter = \".*\" # regex to match lines of interest in file\n",
"pickledir=f\"{dirname}/pickles\"\n",
"reportdir=f\"{dirname}/reports\"\n",
"tldpath=\"/\"\n",
"\n",
"verbose = False"
]
},
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
{
"cell_type": "code",
"execution_count": null,
"id": "f6e28615",
"metadata": {},
"outputs": [],
"source": [
"dirname=\"data/list-17404604.list.gather-info.d\" # directory to fine files to pickle\n",
"glob_pattern = \"*.gz\" # file name glob pattern to match, can be file name for individual file\n",
"line_regex_filter = \".*\" # regex to match lines of interest in file\n",
"pickledir=f\"{dirname}/pickles\"\n",
"reportdir=f\"{dirname}/reports\"\n",
"tldpath=\"/data/projects\"\n",
"\n",
"verbose = True\n",
"limit = 0"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "89a07a7f",
"metadata": {},
"outputs": [],
"source": [
"dirname=\"data/list-20859348.list.gather-info.d\" # directory to fine files to pickle\n",
"glob_pattern = \"*.gz\" # file name glob pattern to match, can be file name for individual file\n",
"line_regex_filter = \".*\" # regex to match lines of interest in file\n",
"pickledir=f\"{dirname}/pickles\"\n",
"reportdir=f\"{dirname}/reports\"\n",
"tldpath=\"/data/project/datascienceteam\"\n",
"\n",
"verbose = True\n",
"limit = 0"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d9533a4c",
"metadata": {},
"outputs": [],
"source": [
"dirname=\"data/list-20191520.list.gather-info.d\" # directory to fine files to pickle\n",
"glob_pattern = \"*.gz\" # file name glob pattern to match, can be file name for individual file\n",
"line_regex_filter = \".*\" # regex to match lines of interest in file\n",
"pickledir=f\"{dirname}/pickles\"\n",
"reportdir=f\"{dirname}/reports\"\n",
"tldpath=\"/data/project/thymelab\"\n",
"\n",
"verbose = True\n",
"limit = 0"
]
},
{
"cell_type": "markdown",
"id": "a28d0f15",
"metadata": {},
"source": [
"## Utilities"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ed367712",
"metadata": {},
"outputs": [],
"source": [
"# get top level dir on which to aggregate\n",
"\n",
"def get_tld(df, dirname):\n",
" dirpaths = dirname.split(\"/\")\n",
" new=df[\"path\"].str.split(\"/\", n=len(dirpaths)+1, expand=True)\n",
" df[\"tld\"] = new[len(dirpaths)]\n",
" \n",
" return df"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7bc11b96",
"metadata": {},
"outputs": [],
"source": [
"def report_tld_year(df):\n",
" report = df.groupby(['tld', df.access.dt.year]).agg({\"size\": [\"sum\", \"count\"]})\n",
" return report"
]
},
{
"cell_type": "markdown",
"id": "dd92dd03",
"metadata": {},
"source": [
"## Read and parse the files according to glob_pattern"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "20315d88",
"metadata": {},
"outputs": [],
"source": [
"dirpath = pathlib.Path(pickledir)\n",
"\n",
"files = list()\n",
"for file in list(dirpath.glob(glob_pattern)):\n",
" files.append(str(file))"
]
},
{
"cell_type": "markdown",
"id": "4352f00c",
"metadata": {},
"source": [
"## Aggregate stats into running totals"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#report = pd.DataFrame()\n",
"\n",
"reports=[]\n",
"\n",
"for count, file in enumerate(files):\n",
" if (verbose): print(f\"parse: {file}\")\n",
" filename=os.path.basename(file)\n",
" df = get_tld(pd.read_pickle(file), tldpath)\n",
" df = report_tld_year(df)\n",
" if (limit and count == limit):\n",
" break\n",
" # roll up into running total https://stackoverflow.com/a/55828762/8928529\n",
" reports.append(df) \n",
" del(df)\n",
" \n",
"report=pd.concat(reports)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"report=report.groupby(['tld', 'access']).sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create final report\n",
"\n",
"Create summary format for gigabyte and terabyte columns https://stackoverflow.com/a/20937592/8928529"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"report.columns = [col[1] for col in report.columns.values]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"report[\"average_size\"] = report[\"sum\"]/report[\"count\"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"report[\"terabytes\"] = report[\"sum\"]/(10**12)\n",
"report[\"terabytes\"] = report[\"terabytes\"].map('{:,.2f}'.format)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"report[\"gigabytes\"] = report[\"sum\"]/(10**9)\n",
"report[\"gigabytes\"] = report[\"gigabytes\"].map('{:,.2f}'.format)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d4de0256",
"metadata": {},
"outputs": [],
"source": [
"if (verbose): print(report)"
]
},
{
"cell_type": "markdown",
"id": "b5472320",
"metadata": {},
"source": [
"## Save report as CSV"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ffc99a54",
"metadata": {},
"outputs": [],
"source": [
"# only create dir if there is data to pickle\n",
"if (len(report) and not os.path.isdir(reportdir)):\n",
" os.mkdir(reportdir)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9d0ec8cf",
"metadata": {},
"outputs": [],
"source": [
"reportdir"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "12d02352",
"metadata": {},
"outputs": [],
"source": [
"if (verbose): print(f\"report: groupby-tld\")\n",
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
"report.to_csv(f\"{reportdir}/groupby-tld.csv.gz\")\n",
"report.to_pickle(f\"{reportdir}/groupby-tld-year.pkl.gz\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ae4a6d7a",
"metadata": {},
"outputs": [],
"source": [
"report[report[\"sum\"] == report[\"sum\"].max()]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6d2f464a",
"metadata": {},
"outputs": [],
"source": [
"report[(report[\"sum\"] > 5*10**13)]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bea3b7a5",
"metadata": {},
"outputs": [],
"source": [
"report=report.reset_index()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "35ebda85",
"metadata": {},
"outputs": [],
"source": [
"summer = report.groupby(\"tld\").agg(\"sum\", \"sum\") #[report[\"sum\"] > 10**13"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2122a150",
"metadata": {},
"outputs": [],
"source": [
"summer[\"terabytes\"] = summer[\"sum\"]/(10**12)\n",
"summer[\"terabytes\"] = summer[\"terabytes\"].map('{:,.2f}'.format)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0929c902",
"metadata": {},
"outputs": [],
"source": [
"print(summer[summer[\"sum\"] > 10**13].sort_values(\"sum\", ascending=False)[['count', 'terabytes']])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "06e80054",
"metadata": {},
"outputs": [],
"source": [
"report[(report[\"sum\"] > 10**13) & (report[\"access\"] <= 2021)]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "504235d8",
"metadata": {},
"outputs": [],
"source": [
"report[(report[\"sum\"] > 10**13) & (report[\"access\"] <= 2021)][\"sum\"].sum()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "30373be5",
"metadata": {},
"outputs": [],
"source": [
"report[(report[\"sum\"] <= 10**13) & (report[\"access\"] <= 2021)][\"sum\"].sum()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3e98ab4a",
"metadata": {},
"outputs": [],
"source": [
"report[(report[\"sum\"] > 10**13) & (report[\"access\"] < 2023)][\"sum\"].sum()/10**12"