Add uid to username resolution for dataframe

Add column and update logic to add username values from the the system password database.

Add uid to username resolution for dataframe
Add column and update logic to add username values from the the system password database.
6e5ebe6c · John-Paul Robinson · c247a661 · 6e5ebe6c
Commit 6e5ebe6c authored 2 years ago by John-Paul Robinson
--- a/scratch-log-explorations.ipynb
+++ b/scratch-log-explorations.ipynb
@@ -329,8 +329,30 @@
   "metadata": {},
   "outputs": [],
   "source": [
+    "# add new column for resolved uids\n",
+    "df[\"uname\"]=\"\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# set uname for uid\n",
    "for uid in sorted(df[\"uid\"].unique()):\n",
-    "    print(\"uid: {} name: {}\".format(uid, pwd.getpwuid(int(uid))[0].split(\":\")[0]))"
+    "    uname = pwd.getpwuid(int(uid))[0].split(\":\")[0]\n",
+    "    print(\"uid: {} name: {}\".format(uid, uname))\n",
+    "    df.loc[df[\"uid\"]==uid, [\"uname\"]] = uname"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df[df[\"uid\"]==10005]"
   ]
  },
  {

 %% Cell type:markdown id: tags:

 # Notebook to explore parsing of the gpfs policy outputs

 This is a collection of cells to understand data.
 No particular endpoint in mind.

 %% Cell type:code id: tags:

 ``` 
 import pandas as pd
 import matplotlib.pyplot as plt
 ```

 %% Cell type:markdown id: tags:

 This is the format of each line in the policy output;

    5001:000fffffffffffff:0000000000004741:4b8f012b:0:2c172b:10002:0:40!basedir/path/to/file:13!scratch_tier1;253!|size=444|kballoc=0|access=2022-01-01 06:58:37.177440|create=2022-01-01 06:21:33.356110|modify=2022-01-01 06:23:47.011273|uid=10973|gid=10973|heat=+0.00000000000000E+000|pool=scratch_tier1|path=/rootdir/basedir/path/to/file|misc=FAu|

 %% Cell type:code id: tags:

 ``` 
 file="data/mmapplypolicy.61746.962D9400.list.no_extern_list_list-30day-with-excludes_slurm-12551165_2022-03-03-04:00:09"
 ```

 %% Cell type:code id: tags:

 ``` 
 file="data/mmapplypolicy.54197.413B7AB5.list.no_extern_list_list-only-temporary-scratch_slurm-12790116_2022-03-14-18:47:51"
 ```

 %% Cell type:code id: tags:

 ``` 
 file="data/mmapplypolicy.120904.9DBFF7E6.list.no_extern_list_list-30day-with-excludes_slurm-13113652_2022-04-05-04:00:28"
 ```

 %% Cell type:markdown id: tags:

 ## Parser functions

 First we define the stucture of the file then the columns we want to use.

 %% Cell type:code id: tags:

 ``` 
 fields=['ignore', 'size', 'kballoc', 'atime', 'ctime', 'mtime', 'uid', 'gid', 'heat', 'pool', 'path', 'misc']

 usecols=['size', 'kballoc', 'atime', 'ctime', 'mtime', 'uid', 'gid', 'heat', 'pool', 'path', 'misc']
 ```

 %% Cell type:code id: tags:

 ``` 
 def splitter(x):
    '''
    split each name=value field on = and return the value
    '''
    return x.split("=", 1)[1]
 ```

 %% Cell type:markdown id: tags:

 Set up a splitters dictionary to process all the used fields with the splitter function.
 https://realpython.com/python-defaultdict/

 %% Cell type:code id: tags:

 ``` 
 splitters = {}

 for name in usecols:
    splitters.setdefault(name, splitter)
 ```

 %% Cell type:code id: tags:

 ``` 
 %%time
 df = pd.read_csv(file,
                 lineterminator='\n',
                 sep="|", header=0,
                 #on_bad_lines="warn",
                 index_col=False,
                 #nrows=1000000,
                 names=fields,
                 usecols=usecols,
                 converters=splitters,
                 parse_dates=['atime', 'ctime', 'mtime'],
                )
 ```

 %% Cell type:code id: tags:

 ``` 
 df.info()
 ```

 %% Cell type:markdown id: tags:

 Clean up data types for numeric values

 %% Cell type:code id: tags:

 ``` 
 for intcol in ['size', 'kballoc', 'uid', 'gid']:
    df[intcol] = df[intcol].astype("int")
 ```

 %% Cell type:code id: tags:

 ``` 
 df.head(3)
 ```

 %% Cell type:markdown id: tags:

 Quick summary of total storage allocated used by 30+day files

 %% Cell type:code id: tags:

 ``` 
 df["kballoc"].sum()/1024
 ```

 %% Cell type:code id: tags:

 ``` 
 df["size"].sum()/1024/1024
 ```

 %% Cell type:code id: tags:

 ``` 
 df["atime"].min()
 ```

 %% Cell type:code id: tags:

 ``` 
 df[["atime","uid"]].sort_values(by="atime")
 ```

 %% Cell type:code id: tags:

 ``` 
 df[["uid","size"]].groupby("uid").sum()/1000/1000/1000/1000
 ```

 %% Cell type:code id: tags:

 ``` 
 (df[["uid","size"]].groupby("uid").sum()/1000/1000/1000/1000).sum()
 ```

 %% Cell type:code id: tags:

 ``` 
 df["atime"].sort_values().head()
 ```

 %% Cell type:code id: tags:

 ``` 
 df["uid"].head()
 ```

 %% Cell type:code id: tags:

 ``` 
 df["misc"].unique()
 ```

 %% Cell type:code id: tags:

 ``` 
 df["isfile"]=df["misc"].str.contains('F')
 ```

 %% Cell type:code id: tags:

 ``` 
 len(df["uid"].unique())
 ```

 %% Cell type:code id: tags:

 ``` 
 df["uid"].unique()
 ```

 %% Cell type:markdown id: tags:

 Get usernames from uid values via the pwd password db iteration module https://stackoverflow.com/a/421670/8928529

 %% Cell type:code id: tags:

 ``` 
 import pwd
 ```

 %% Cell type:code id: tags:

 ``` 
 pwd.getpwuid(12137)[0].split(":")
 ```

 %% Cell type:code id: tags:

 ``` 
 def getuser(uid):
    return pwd.getpwuid(int(uid))[0].split(":")[0]
 ```

 %% Cell type:code id: tags:

 ``` 
 getuser(10973)
 ```

 %% Cell type:code id: tags:

 ``` 
+# add new column for resolved uids
+df["uname"]=""
+```
+
+%% Cell type:code id: tags:
+
+``` 
+# set uname for uid
 for uid in sorted(df["uid"].unique()):
-    print("uid: {} name: {}".format(uid, pwd.getpwuid(int(uid))[0].split(":")[0]))
+    uname = pwd.getpwuid(int(uid))[0].split(":")[0]
+    print("uid: {} name: {}".format(uid, uname))
+    df.loc[df["uid"]==uid, ["uname"]] = uname
+```
+
+%% Cell type:code id: tags:
+
+``` 
+df[df["uid"]==10005]
 ```

 %% Cell type:code id: tags:

 ``` 
 sorted(df["heat"].unique())
 ```

 %% Cell type:code id: tags:

 ``` 
 df["path"] = df["path"].astype("str")
 ```

 %% Cell type:code id: tags:

 ``` 
 df = pd.concat([df, df["path"].apply("str").split("/", 4, expand=True)[[1,3,4]].rename(columns={1: "fs", 3:"scratchdir", 4:"filename"})], axis=1)
 ```

 %% Cell type:code id: tags:

 ``` 
 df = df.rename(columns={"sratchdir": "scratchdir"})
 ```

 %% Cell type:code id: tags:

 ``` 
 df.columns
 ```

 %% Cell type:code id: tags:

 ``` 
 userdata = df[["scratchdir", "size", "kballoc", "isfile"]].groupby(["scratchdir"]).sum()
 ```

 %% Cell type:code id: tags:

 ``` 
 userdata
 ```

 %% Cell type:code id: tags:

 ``` 
 userdata["size"]/1000/1000/1000
 ```

 %% Cell type:code id: tags:

 ``` 
 df["path"].apply("str").split("/", 4, expand=True)[[3,4]]
 ```

 %% Cell type:code id: tags:

 ``` 
 df["path"].apply("str").split("/", 4, expand=True)
 ```

 %% Cell type:code id: tags:

 ``` 
 bytesdays=df[["atime","size"]]
 ```

 %% Cell type:code id: tags:

 ``` 
 bd=bytesdays.set_index("atime")
 ```

 %% Cell type:code id: tags:

 ``` 
 bd=bd.resample('D').sum()
 ```

 %% Cell type:code id: tags:

 ``` 
 bd["sum"]=bd.cumsum()
 ```

 %% Cell type:code id: tags:

 ``` 
 bd[:"2022-02-15"]
 ```

 %% Cell type:code id: tags:

 ``` 
 ```

 %% Cell type:code id: tags:

 ``` 
 size, gb = bd[bd["size"]>0].loc[:"2022-01-01"].sum()
 ```

 %% Cell type:code id: tags:

 ``` 
 gb
 ```

 %% Cell type:code id: tags:

 ``` 
 bd.loc[:"2021-12-31"].sum()
 ```

 %% Cell type:code id: tags:

 ``` 
 bd.loc[:"2022-01-01"].sum()
 ```

 %% Cell type:code id: tags:

 ``` 
 bd.loc["2022-01-01":]
 ```

 %% Cell type:code id: tags:

 ``` 
 bd[bd["size"]>0]/1024/1024/1024 #.plot()
 ```

 %% Cell type:code id: tags:

 ``` 
 bd["gb"] = bd["sum"]/1024/1024/1024
 ```

 %% Cell type:code id: tags:

 ``` 
 bd["gb"]
 ```

 %% Cell type:code id: tags:

 ``` 
 b2d=bd["2021-10-01":]
 ```

 %% Cell type:code id: tags:

 ``` 
 1024*1024*1024*1024
 ```

 %% Cell type:code id: tags:

 ``` 
 bd7=b2d[["gb"]].rolling(7, center=True).sum()
 ```

 %% Cell type:code id: tags:

 ``` 
 # Plot houry, daily, 7-day rolling mean
 fig, ax = plt.subplots()
 #ax.plot(kW, marker='.', markersize=2, color='gray', linestyle='None', label='Hourly Average')
 ax.plot(b2d["gb"], color='brown', linewidth=2, label='1-day Average')
 ax.plot(bd7["gb"], color='black', linewidth=1, label='7-day Rolling Average')
 label='Trend (7 day Rolling Sum)'
 ax.legend()
 ax.set_ylabel('Size (GBytes)')
 ax.set_title('Cheaha Trends in Scratch Usage');
 ```

 %% Cell type:code id: tags:

 ``` 
 ```