This notebook is for the clustering analysis of ReqMemCPU, AllocCPUS, and Elapsed.
ReqMemCPU is the amount of RAM in gigs for each job as requested by the user.
AllocCPUS is the amount of cores that were used for each job.
Elapsed is the amount of time, in hours, that job took to run.
%% Cell type:markdown id: tags:
# Assumptions and Restrictions
%% Cell type:markdown id: tags:
Based on extensiive data and clustering exploration, this Notebook is set to graph up to 4 clusters (n_clusters = 4 in kmeans). In order to raise the number of clusters, more code will have to be added to add more 2d histograms of those extra cluster groups. And in order to lower the number of clusters, the code would have to be modified to expect fewer than 4 clusters as an input.
%% Cell type:markdown id: tags:
# Data Setup Options
%% Cell type:markdown id: tags:
There are 6 decisions to make in the set up of the data.
Date Range: Choose a start date and an end date of data that you want to see cluster analysis of.
The format is yyyy-mm-dd
Bracketing Values: Choose a minimum and maximum value for ReqMemCPU, AllocCPUS, and Elapsed.
These values will allow you to "zoom in" or "zoom out" on your data.
1. Upper/LowerGB - min/max ReqMemCPU: most of the data lies between 1 and 150 gigs.
Most of the ReqMemCPU above 150 are outliers
2. Upper/LowerAllocCPU - min/max AllocCPUS: most of the data lies between 1 and 260 cores.
Most of the AllocCPUS above 260 are outliers
3. Upper/LowerElapsed - min/max Elapsed: 150.02 hours is the highest Elapsed goes to.
Data Normalization: There are three choices for normalization of the data - 'none', '0-1', or 'log'
1. 'none' - no data normalization. Data is clustered and graphed as is.
2. '0-1'- all data in the date range and bracketing ranges chosen will be scaled to have values between 0 and 1.
This would be useful if your bracketing ranges differ greatly from each other.
3. 'log' - all data in the date range and bracketing ranges chosen will be scaled to have log values.
This would be useful if your bracketing ranges create data that is very large and would be easier to
visualize with log values.
2D Histogram X and Y Axes: This will set a min and max for the x and y axes in the 2D histograms of each of the four clusters. All the x and y axes are the same across the 2d histograms. This allows the user to "zoom" in or out of the data.
%% Cell type:markdown id: tags:
## Date Range
%% Cell type:code id: tags:
```
start_date = '2020-11-01'
end_date = '2020-11-23'
start_date = '2021-01-01'
end_date = '2021-01-08'
```
%% Cell type:markdown id: tags:
## Bracketing Values
%% Cell type:code id: tags:
```
# sets min and max parameters for ReqMemCPU - user requested
LowerlimitGB = 0
UpperlimitGB = 150
```
%% Cell type:code id: tags:
```
# sets min and max parameters for AllocCPUS - allocated by slurm
LowerlimitAllocCPU = 0
UpperlimitAllocCPU = 260
```
%% Cell type:code id: tags:
```
# sets min and max parameters for Elapsed
LowerlimitElapsed = 0
UpperlimitElapsed = 150.02 # = 6.25 days
```
%% Cell type:markdown id: tags:
## Data Normalization
%% Cell type:code id: tags:
```
# Enter 'none', '0-1', or 'log' as a choice for data nomralization
Data_Normalization_Choice = 'none'
```
%% Cell type:markdown id: tags:
# 2D Histogram X and Y Axes
%% Cell type:code id: tags:
```
xaxis_min = 0
xaxis_max = 140
xaxis_max = 80
yaxis_min = 0
yaxis_max = 100
yaxis_max = 20
```
%% Cell type:markdown id: tags:
# Imports
%% Cell type:code id: tags:
```
# must run
import sqlite3
import slurm2sql
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import seaborn as sb
import plotly.express as px
import matplotlib.ticker as ticker
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
import os
from RC_styles import rc_styles as style
from sklearn.cluster import KMeans
```
%% Cell type:markdown id: tags:
# Database Creation
%% Cell type:code id: tags:
```
#connecting to database
db = sqlite3.connect('cluster_analysis_from_'+str(start_date)+'to'+str(end_date)+'.db')
```
%% Cell type:code id: tags:
```
#creating a database based on the start date
slurm2sql.slurm2sql(db, ['-S',start_date, '-E', end_date,'-X', '-a']) #-X is allocations, -a is all users