Add time magic to csv parsing cell to get familar with parsing times

c247a661 · John-Paul Robinson · be11fd51 · c247a661
Commit c247a661 authored 2 years ago by John-Paul Robinson
--- a/scratch-log-explorations.ipynb
+++ b/scratch-log-explorations.ipynb
@@ -115,6 +115,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
+    "%%time\n",
    "df = pd.read_csv(file,\n",
    "                 lineterminator='\\n',\n",
    "                 sep=\"|\", header=0, \n",

 %% Cell type:markdown id: tags:

 # Notebook to explore parsing of the gpfs policy outputs

 This is a collection of cells to understand data.
 No particular endpoint in mind.

 %% Cell type:code id: tags:

 ``` 
 import pandas as pd
 import matplotlib.pyplot as plt
 ```

 %% Cell type:markdown id: tags:

 This is the format of each line in the policy output;

    5001:000fffffffffffff:0000000000004741:4b8f012b:0:2c172b:10002:0:40!basedir/path/to/file:13!scratch_tier1;253!|size=444|kballoc=0|access=2022-01-01 06:58:37.177440|create=2022-01-01 06:21:33.356110|modify=2022-01-01 06:23:47.011273|uid=10973|gid=10973|heat=+0.00000000000000E+000|pool=scratch_tier1|path=/rootdir/basedir/path/to/file|misc=FAu|

 %% Cell type:code id: tags:

 ``` 
 file="data/mmapplypolicy.61746.962D9400.list.no_extern_list_list-30day-with-excludes_slurm-12551165_2022-03-03-04:00:09"
 ```

 %% Cell type:code id: tags:

 ``` 
 file="data/mmapplypolicy.54197.413B7AB5.list.no_extern_list_list-only-temporary-scratch_slurm-12790116_2022-03-14-18:47:51"
 ```

 %% Cell type:code id: tags:

 ``` 
 file="data/mmapplypolicy.120904.9DBFF7E6.list.no_extern_list_list-30day-with-excludes_slurm-13113652_2022-04-05-04:00:28"
 ```

 %% Cell type:markdown id: tags:

 ## Parser functions

 First we define the stucture of the file then the columns we want to use.

 %% Cell type:code id: tags:

 ``` 
 fields=['ignore', 'size', 'kballoc', 'atime', 'ctime', 'mtime', 'uid', 'gid', 'heat', 'pool', 'path', 'misc']

 usecols=['size', 'kballoc', 'atime', 'ctime', 'mtime', 'uid', 'gid', 'heat', 'pool', 'path', 'misc']
 ```

 %% Cell type:code id: tags:

 ``` 
 def splitter(x):
    '''
    split each name=value field on = and return the value
    '''
    return x.split("=", 1)[1]
 ```

 %% Cell type:markdown id: tags:

 Set up a splitters dictionary to process all the used fields with the splitter function.
 https://realpython.com/python-defaultdict/

 %% Cell type:code id: tags:

 ``` 
 splitters = {}

 for name in usecols:
    splitters.setdefault(name, splitter)
 ```

 %% Cell type:code id: tags:

 ``` 
+%%time
 df = pd.read_csv(file,
                 lineterminator='\n',
                 sep="|", header=0,
                 #on_bad_lines="warn",
                 index_col=False,
                 #nrows=1000000,
                 names=fields,
                 usecols=usecols,
                 converters=splitters,
                 parse_dates=['atime', 'ctime', 'mtime'],
                )
 ```

 %% Cell type:code id: tags:

 ``` 
 df.info()
 ```

 %% Cell type:markdown id: tags:

 Clean up data types for numeric values

 %% Cell type:code id: tags:

 ``` 
 for intcol in ['size', 'kballoc', 'uid', 'gid']:
    df[intcol] = df[intcol].astype("int")
 ```

 %% Cell type:code id: tags:

 ``` 
 df.head(3)
 ```

 %% Cell type:markdown id: tags:

 Quick summary of total storage allocated used by 30+day files

 %% Cell type:code id: tags:

 ``` 
 df["kballoc"].sum()/1024
 ```

 %% Cell type:code id: tags:

 ``` 
 df["size"].sum()/1024/1024
 ```

 %% Cell type:code id: tags:

 ``` 
 df["atime"].min()
 ```

 %% Cell type:code id: tags:

 ``` 
 df[["atime","uid"]].sort_values(by="atime")
 ```

 %% Cell type:code id: tags:

 ``` 
 df[["uid","size"]].groupby("uid").sum()/1000/1000/1000/1000
 ```

 %% Cell type:code id: tags:

 ``` 
 (df[["uid","size"]].groupby("uid").sum()/1000/1000/1000/1000).sum()
 ```

 %% Cell type:code id: tags:

 ``` 
 df["atime"].sort_values().head()
 ```

 %% Cell type:code id: tags:

 ``` 
 df["uid"].head()
 ```

 %% Cell type:code id: tags:

 ``` 
 df["misc"].unique()
 ```

 %% Cell type:code id: tags:

 ``` 
 df["isfile"]=df["misc"].str.contains('F')
 ```

 %% Cell type:code id: tags:

 ``` 
 len(df["uid"].unique())
 ```

 %% Cell type:code id: tags:

 ``` 
 df["uid"].unique()
 ```

 %% Cell type:markdown id: tags:

 Get usernames from uid values via the pwd password db iteration module https://stackoverflow.com/a/421670/8928529

 %% Cell type:code id: tags:

 ``` 
 import pwd
 ```

 %% Cell type:code id: tags:

 ``` 
 pwd.getpwuid(12137)[0].split(":")
 ```

 %% Cell type:code id: tags:

 ``` 
 def getuser(uid):
    return pwd.getpwuid(int(uid))[0].split(":")[0]
 ```

 %% Cell type:code id: tags:

 ``` 
 getuser(10973)
 ```

 %% Cell type:code id: tags:

 ``` 
 for uid in sorted(df["uid"].unique()):
    print("uid: {} name: {}".format(uid, pwd.getpwuid(int(uid))[0].split(":")[0]))
 ```

 %% Cell type:code id: tags:

 ``` 
 sorted(df["heat"].unique())
 ```

 %% Cell type:code id: tags:

 ``` 
 df["path"] = df["path"].astype("str")
 ```

 %% Cell type:code id: tags:

 ``` 
 df = pd.concat([df, df["path"].apply("str").split("/", 4, expand=True)[[1,3,4]].rename(columns={1: "fs", 3:"scratchdir", 4:"filename"})], axis=1)
 ```

 %% Cell type:code id: tags:

 ``` 
 df = df.rename(columns={"sratchdir": "scratchdir"})
 ```

 %% Cell type:code id: tags:

 ``` 
 df.columns
 ```

 %% Cell type:code id: tags:

 ``` 
 userdata = df[["scratchdir", "size", "kballoc", "isfile"]].groupby(["scratchdir"]).sum()
 ```

 %% Cell type:code id: tags:

 ``` 
 userdata
 ```

 %% Cell type:code id: tags:

 ``` 
 userdata["size"]/1000/1000/1000
 ```

 %% Cell type:code id: tags:

 ``` 
 df["path"].apply("str").split("/", 4, expand=True)[[3,4]]
 ```

 %% Cell type:code id: tags:

 ``` 
 df["path"].apply("str").split("/", 4, expand=True)
 ```

 %% Cell type:code id: tags:

 ``` 
 bytesdays=df[["atime","size"]]
 ```

 %% Cell type:code id: tags:

 ``` 
 bd=bytesdays.set_index("atime")
 ```

 %% Cell type:code id: tags:

 ``` 
 bd=bd.resample('D').sum()
 ```

 %% Cell type:code id: tags:

 ``` 
 bd["sum"]=bd.cumsum()
 ```

 %% Cell type:code id: tags:

 ``` 
 bd[:"2022-02-15"]
 ```

 %% Cell type:code id: tags:

 ``` 
 ```

 %% Cell type:code id: tags:

 ``` 
 size, gb = bd[bd["size"]>0].loc[:"2022-01-01"].sum()
 ```

 %% Cell type:code id: tags:

 ``` 
 gb
 ```

 %% Cell type:code id: tags:

 ``` 
 bd.loc[:"2021-12-31"].sum()
 ```

 %% Cell type:code id: tags:

 ``` 
 bd.loc[:"2022-01-01"].sum()
 ```

 %% Cell type:code id: tags:

 ``` 
 bd.loc["2022-01-01":]
 ```

 %% Cell type:code id: tags:

 ``` 
 bd[bd["size"]>0]/1024/1024/1024 #.plot()
 ```

 %% Cell type:code id: tags:

 ``` 
 bd["gb"] = bd["sum"]/1024/1024/1024
 ```

 %% Cell type:code id: tags:

 ``` 
 bd["gb"]
 ```

 %% Cell type:code id: tags:

 ``` 
 b2d=bd["2021-10-01":]
 ```

 %% Cell type:code id: tags:

 ``` 
 1024*1024*1024*1024
 ```

 %% Cell type:code id: tags:

 ``` 
 bd7=b2d[["gb"]].rolling(7, center=True).sum()
 ```

 %% Cell type:code id: tags:

 ``` 
 # Plot houry, daily, 7-day rolling mean
 fig, ax = plt.subplots()
 #ax.plot(kW, marker='.', markersize=2, color='gray', linestyle='None', label='Hourly Average')
 ax.plot(b2d["gb"], color='brown', linewidth=2, label='1-day Average')
 ax.plot(bd7["gb"], color='black', linewidth=1, label='7-day Rolling Average')
 label='Trend (7 day Rolling Sum)'
 ax.legend()
 ax.set_ylabel('Size (GBytes)')
 ax.set_title('Cheaha Trends in Scratch Usage');
 ```

 %% Cell type:code id: tags:

 ``` 
 ```