Skip to content
Snippets Groups Projects
Commit 6e5ebe6c authored by John-Paul Robinson's avatar John-Paul Robinson
Browse files

Add uid to username resolution for dataframe

Add column and update logic to add username values from the
the system password database.
parent c247a661
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# Notebook to explore parsing of the gpfs policy outputs
This is a collection of cells to understand data.
No particular endpoint in mind.
%% Cell type:code id: tags:
```
import pandas as pd
import matplotlib.pyplot as plt
```
%% Cell type:markdown id: tags:
This is the format of each line in the policy output;
5001:000fffffffffffff:0000000000004741:4b8f012b:0:2c172b:10002:0:40!basedir/path/to/file:13!scratch_tier1;253!|size=444|kballoc=0|access=2022-01-01 06:58:37.177440|create=2022-01-01 06:21:33.356110|modify=2022-01-01 06:23:47.011273|uid=10973|gid=10973|heat=+0.00000000000000E+000|pool=scratch_tier1|path=/rootdir/basedir/path/to/file|misc=FAu|
%% Cell type:code id: tags:
```
file="data/mmapplypolicy.61746.962D9400.list.no_extern_list_list-30day-with-excludes_slurm-12551165_2022-03-03-04:00:09"
```
%% Cell type:code id: tags:
```
file="data/mmapplypolicy.54197.413B7AB5.list.no_extern_list_list-only-temporary-scratch_slurm-12790116_2022-03-14-18:47:51"
```
%% Cell type:code id: tags:
```
file="data/mmapplypolicy.120904.9DBFF7E6.list.no_extern_list_list-30day-with-excludes_slurm-13113652_2022-04-05-04:00:28"
```
%% Cell type:markdown id: tags:
## Parser functions
First we define the stucture of the file then the columns we want to use.
%% Cell type:code id: tags:
```
fields=['ignore', 'size', 'kballoc', 'atime', 'ctime', 'mtime', 'uid', 'gid', 'heat', 'pool', 'path', 'misc']
usecols=['size', 'kballoc', 'atime', 'ctime', 'mtime', 'uid', 'gid', 'heat', 'pool', 'path', 'misc']
```
%% Cell type:code id: tags:
```
def splitter(x):
'''
split each name=value field on = and return the value
'''
return x.split("=", 1)[1]
```
%% Cell type:markdown id: tags:
Set up a splitters dictionary to process all the used fields with the splitter function.
https://realpython.com/python-defaultdict/
%% Cell type:code id: tags:
```
splitters = {}
for name in usecols:
splitters.setdefault(name, splitter)
```
%% Cell type:code id: tags:
```
%%time
df = pd.read_csv(file,
lineterminator='\n',
sep="|", header=0,
#on_bad_lines="warn",
index_col=False,
#nrows=1000000,
names=fields,
usecols=usecols,
converters=splitters,
parse_dates=['atime', 'ctime', 'mtime'],
)
```
%% Cell type:code id: tags:
```
df.info()
```
%% Cell type:markdown id: tags:
Clean up data types for numeric values
%% Cell type:code id: tags:
```
for intcol in ['size', 'kballoc', 'uid', 'gid']:
df[intcol] = df[intcol].astype("int")
```
%% Cell type:code id: tags:
```
df.head(3)
```
%% Cell type:markdown id: tags:
Quick summary of total storage allocated used by 30+day files
%% Cell type:code id: tags:
```
df["kballoc"].sum()/1024
```
%% Cell type:code id: tags:
```
df["size"].sum()/1024/1024
```
%% Cell type:code id: tags:
```
df["atime"].min()
```
%% Cell type:code id: tags:
```
df[["atime","uid"]].sort_values(by="atime")
```
%% Cell type:code id: tags:
```
df[["uid","size"]].groupby("uid").sum()/1000/1000/1000/1000
```
%% Cell type:code id: tags:
```
(df[["uid","size"]].groupby("uid").sum()/1000/1000/1000/1000).sum()
```
%% Cell type:code id: tags:
```
df["atime"].sort_values().head()
```
%% Cell type:code id: tags:
```
df["uid"].head()
```
%% Cell type:code id: tags:
```
df["misc"].unique()
```
%% Cell type:code id: tags:
```
df["isfile"]=df["misc"].str.contains('F')
```
%% Cell type:code id: tags:
```
len(df["uid"].unique())
```
%% Cell type:code id: tags:
```
df["uid"].unique()
```
%% Cell type:markdown id: tags:
Get usernames from uid values via the pwd password db iteration module https://stackoverflow.com/a/421670/8928529
%% Cell type:code id: tags:
```
import pwd
```
%% Cell type:code id: tags:
```
pwd.getpwuid(12137)[0].split(":")
```
%% Cell type:code id: tags:
```
def getuser(uid):
return pwd.getpwuid(int(uid))[0].split(":")[0]
```
%% Cell type:code id: tags:
```
getuser(10973)
```
%% Cell type:code id: tags:
```
# add new column for resolved uids
df["uname"]=""
```
%% Cell type:code id: tags:
```
# set uname for uid
for uid in sorted(df["uid"].unique()):
print("uid: {} name: {}".format(uid, pwd.getpwuid(int(uid))[0].split(":")[0]))
uname = pwd.getpwuid(int(uid))[0].split(":")[0]
print("uid: {} name: {}".format(uid, uname))
df.loc[df["uid"]==uid, ["uname"]] = uname
```
%% Cell type:code id: tags:
```
df[df["uid"]==10005]
```
%% Cell type:code id: tags:
```
sorted(df["heat"].unique())
```
%% Cell type:code id: tags:
```
df["path"] = df["path"].astype("str")
```
%% Cell type:code id: tags:
```
df = pd.concat([df, df["path"].apply("str").split("/", 4, expand=True)[[1,3,4]].rename(columns={1: "fs", 3:"scratchdir", 4:"filename"})], axis=1)
```
%% Cell type:code id: tags:
```
df = df.rename(columns={"sratchdir": "scratchdir"})
```
%% Cell type:code id: tags:
```
df.columns
```
%% Cell type:code id: tags:
```
userdata = df[["scratchdir", "size", "kballoc", "isfile"]].groupby(["scratchdir"]).sum()
```
%% Cell type:code id: tags:
```
userdata
```
%% Cell type:code id: tags:
```
userdata["size"]/1000/1000/1000
```
%% Cell type:code id: tags:
```
df["path"].apply("str").split("/", 4, expand=True)[[3,4]]
```
%% Cell type:code id: tags:
```
df["path"].apply("str").split("/", 4, expand=True)
```
%% Cell type:code id: tags:
```
bytesdays=df[["atime","size"]]
```
%% Cell type:code id: tags:
```
bd=bytesdays.set_index("atime")
```
%% Cell type:code id: tags:
```
bd=bd.resample('D').sum()
```
%% Cell type:code id: tags:
```
bd["sum"]=bd.cumsum()
```
%% Cell type:code id: tags:
```
bd[:"2022-02-15"]
```
%% Cell type:code id: tags:
```
```
%% Cell type:code id: tags:
```
size, gb = bd[bd["size"]>0].loc[:"2022-01-01"].sum()
```
%% Cell type:code id: tags:
```
gb
```
%% Cell type:code id: tags:
```
bd.loc[:"2021-12-31"].sum()
```
%% Cell type:code id: tags:
```
bd.loc[:"2022-01-01"].sum()
```
%% Cell type:code id: tags:
```
bd.loc["2022-01-01":]
```
%% Cell type:code id: tags:
```
bd[bd["size"]>0]/1024/1024/1024 #.plot()
```
%% Cell type:code id: tags:
```
bd["gb"] = bd["sum"]/1024/1024/1024
```
%% Cell type:code id: tags:
```
bd["gb"]
```
%% Cell type:code id: tags:
```
b2d=bd["2021-10-01":]
```
%% Cell type:code id: tags:
```
1024*1024*1024*1024
```
%% Cell type:code id: tags:
```
bd7=b2d[["gb"]].rolling(7, center=True).sum()
```
%% Cell type:code id: tags:
```
# Plot houry, daily, 7-day rolling mean
fig, ax = plt.subplots()
#ax.plot(kW, marker='.', markersize=2, color='gray', linestyle='None', label='Hourly Average')
ax.plot(b2d["gb"], color='brown', linewidth=2, label='1-day Average')
ax.plot(bd7["gb"], color='black', linewidth=1, label='7-day Rolling Average')
label='Trend (7 day Rolling Sum)'
ax.legend()
ax.set_ylabel('Size (GBytes)')
ax.set_title('Cheaha Trends in Scratch Usage');
```
%% Cell type:code id: tags:
```
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment