Skip to content
Snippets Groups Projects
Commit 2ff61dbf authored by John-Paul Robinson's avatar John-Paul Robinson
Browse files

Merge branch '4-create-report-to-aggregate-stored-data-by-year-of-last-access' into 'main'

Resolve "Create report to aggregate stored data by year of last access"

Closes #4

See merge request !6
parents 3dec66b4 317934e1
No related branches found
No related tags found
1 merge request!6Resolve "Create report to aggregate stored data by year of last access"
%% Cell type:markdown id:5fb66d11 tags:
# run report on pickled list policy data
The script reads pickled files that match the `glob_pattern` from the `pickledir` derived from `dirname` and runs the report saving it as a csv to the peer "`dirname`-reports" dir by default.
Some progress info is available via the `verbose` flag.
The current report aggrates storage stats by top-level-dir and age (year) of data's last access. The goal of this report is to understand the distribution of lesser used data.
%% Cell type:code id:5059337b tags:
```
import datetime
import pandas as pd
import matplotlib.pyplot as plt
from urllib.parse import unquote
import sys
import os
import pathlib
import re
```
%% Cell type:markdown id:5f4c10d1 tags:
## input vars
%% Cell type:code id:92ddc402 tags:
```
dirname="" # directory to fine files to pickle
glob_pattern = "*.gz" # file name glob pattern to match, can be file name for individual file
line_regex_filter = ".*" # regex to match lines of interest in file
pickledir=f"{dirname}/pickles"
reportdir=f"{dirname}-reports"
tldpath="/"
verbose = False
```
%% Cell type:code id:ed367712 tags:
```
# get top level dir on which to aggregate
def get_tld(df, dirname):
dirpaths = dirname.split("/")
new=df["path"].str.split("/", n=len(dirpaths)+1, expand=True)
df["tld"] = new[len(dirpaths)]
return df
```
%% Cell type:markdown id:dd92dd03 tags:
## Read and parse the files according to glob_pattern
%% Cell type:code id:20315d88 tags:
```
dirpath = pathlib.Path(pickledir)
files = list()
for file in list(dirpath.glob(glob_pattern)):
files.append(str(file))
```
%% Cell type:code id:cbad833f tags:
```
parsedfiles = list()
for file in files:
if (verbose): print(f"parse: {file}")
filename=os.path.basename(file)
parsedfiles.append(pd.read_pickle(file))
```
%% Cell type:code id:4ed9ca1b tags:
```
df=pd.concat(parsedfiles)
del(parsedfiles)
else:
return
```
%% Cell type:code id:b69c9fde tags:
```
df = get_tld(df, tldpath)
```
%% Cell type:markdown id:4352f00c tags:
## Run report
%% Cell type:code id:e3fe4e71 tags:
```
report = df.groupby(['tld', df.access.dt.year]).agg({"size": ["sum", "mean", "median", "min", "max", "std", "count"]})
```
%% Cell type:code id:329bc196 tags:
```
del(df)
```
%% Cell type:code id:754fcc89 tags:
```
report.columns.values
```
%% Cell type:code id:f279c061 tags:
```
report.columns = [col[1] for col in report.columns.values]
```
%% Cell type:code id:8ef9b007 tags:
```
report["gigabytes"] = report["sum"]/1000/1000/1000
```
%% Cell type:code id:d4de0256 tags:
```
if (verbose): print(report)
```
%% Cell type:code id:ffc99a54 tags:
```
# only create dir if there is data to pickle
if (len(report) and not os.path.isdir(reportdir)):
os.mkdir(reportdir)
```
%% Cell type:code id:12d02352 tags:
```
if (verbose): print(f"report: groupby-tld")
report.to_csv(f"{reportdir}/groupby-tld.csv.gz")
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment