"This notebook is for the clustering analysis of ReqMemCPU, AllocCPUS, and Elapsed.\n",
"This notebook is for clustering analysis of ReqMemCPU, AllocCPUS, and Elapsed.\n",
"ReqMemCPU is the amount of RAM in gigs for each job as requested by the user.\n",
"ReqMemCPU is the amount of RAM in gigs for each job as requested by the user.\n",
"AllocCPUS is the amount of cores that were used for each job.\n",
"AllocCPUS is the amount of cores that were used for each job.\n",
"Elapsed is the amount of time, in hours, that job took to run."
"Elapsed is the amount of time, in hours, that job took to run."
...
@@ -81,8 +81,8 @@
...
@@ -81,8 +81,8 @@
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
"start_date = '2021-01-01'\n",
"start_date = 'yyyy-mm-dd'\n",
"end_date = '2021-01-08'"
"end_date = 'yyyy-mm-dd'"
]
]
},
},
{
{
...
...
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
# Purpose
# Purpose
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
This notebook is for the clustering analysis of ReqMemCPU, AllocCPUS, and Elapsed.
This notebook is for clustering analysis of ReqMemCPU, AllocCPUS, and Elapsed.
ReqMemCPU is the amount of RAM in gigs for each job as requested by the user.
ReqMemCPU is the amount of RAM in gigs for each job as requested by the user.
AllocCPUS is the amount of cores that were used for each job.
AllocCPUS is the amount of cores that were used for each job.
Elapsed is the amount of time, in hours, that job took to run.
Elapsed is the amount of time, in hours, that job took to run.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
# Assumptions and Restrictions
# Assumptions and Restrictions
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Based on extensiive data and clustering exploration, this Notebook is set to graph up to 4 clusters (n_clusters = 4 in kmeans). In order to raise the number of clusters, more code will have to be added to add more 2d histograms of those extra cluster groups. And in order to lower the number of clusters, the code would have to be modified to expect fewer than 4 clusters as an input.
Based on extensiive data and clustering exploration, this Notebook is set to graph up to 4 clusters (n_clusters = 4 in kmeans). In order to raise the number of clusters, more code will have to be added to add more 2d histograms of those extra cluster groups. And in order to lower the number of clusters, the code would have to be modified to expect fewer than 4 clusters as an input.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
# Data Setup Options
# Data Setup Options
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
There are 6 decisions to make in the set up of the data.
There are 6 decisions to make in the set up of the data.
Date Range: Choose a start date and an end date of data that you want to see cluster analysis of.
Date Range: Choose a start date and an end date of data that you want to see cluster analysis of.
The format is yyyy-mm-dd
The format is yyyy-mm-dd
Bracketing Values: Choose a minimum and maximum value for ReqMemCPU, AllocCPUS, and Elapsed.
Bracketing Values: Choose a minimum and maximum value for ReqMemCPU, AllocCPUS, and Elapsed.
These values will allow you to "zoom in" or "zoom out" on your data.
These values will allow you to "zoom in" or "zoom out" on your data.
1. Upper/LowerGB - min/max ReqMemCPU: most of the data lies between 1 and 150 gigs.
1. Upper/LowerGB - min/max ReqMemCPU: most of the data lies between 1 and 150 gigs.
Most of the ReqMemCPU above 150 are outliers
Most of the ReqMemCPU above 150 are outliers
2. Upper/LowerAllocCPU - min/max AllocCPUS: most of the data lies between 1 and 260 cores.
2. Upper/LowerAllocCPU - min/max AllocCPUS: most of the data lies between 1 and 260 cores.
Most of the AllocCPUS above 260 are outliers
Most of the AllocCPUS above 260 are outliers
3. Upper/LowerElapsed - min/max Elapsed: 150.02 hours is the highest Elapsed goes to.
3. Upper/LowerElapsed - min/max Elapsed: 150.02 hours is the highest Elapsed goes to.
Data Normalization: There are three choices for normalization of the data - 'none', '0-1', or 'log'
Data Normalization: There are three choices for normalization of the data - 'none', '0-1', or 'log'
1. 'none' - no data normalization. Data is clustered and graphed as is.
1. 'none' - no data normalization. Data is clustered and graphed as is.
2. '0-1'- all data in the date range and bracketing ranges chosen will be scaled to have values between 0 and 1.
2. '0-1'- all data in the date range and bracketing ranges chosen will be scaled to have values between 0 and 1.
This would be useful if your bracketing ranges differ greatly from each other.
This would be useful if your bracketing ranges differ greatly from each other.
3. 'log' - all data in the date range and bracketing ranges chosen will be scaled to have log values.
3. 'log' - all data in the date range and bracketing ranges chosen will be scaled to have log values.
This would be useful if your bracketing ranges create data that is very large and would be easier to
This would be useful if your bracketing ranges create data that is very large and would be easier to
visualize with log values.
visualize with log values.
2D Histogram X and Y Axes: This will set a min and max for the x and y axes in the 2D histograms of each of the four clusters. All the x and y axes are the same across the 2d histograms. This allows the user to "zoom" in or out of the data.
2D Histogram X and Y Axes: This will set a min and max for the x and y axes in the 2D histograms of each of the four clusters. All the x and y axes are the same across the 2d histograms. This allows the user to "zoom" in or out of the data.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Date Range
## Date Range
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
start_date = '2021-01-01'
start_date = 'yyyy-mm-dd'
end_date = '2021-01-08'
end_date = 'yyyy-mm-dd'
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Bracketing Values
## Bracketing Values
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
# sets min and max parameters for ReqMemCPU - user requested
# sets min and max parameters for ReqMemCPU - user requested
LowerlimitGB = 0
LowerlimitGB = 0
UpperlimitGB = 150
UpperlimitGB = 150
```
```
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
# sets min and max parameters for AllocCPUS - allocated by slurm
# sets min and max parameters for AllocCPUS - allocated by slurm
LowerlimitAllocCPU = 0
LowerlimitAllocCPU = 0
UpperlimitAllocCPU = 260
UpperlimitAllocCPU = 260
```
```
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
# sets min and max parameters for Elapsed
# sets min and max parameters for Elapsed
LowerlimitElapsed = 0
LowerlimitElapsed = 0
UpperlimitElapsed = 150.02 # = 6.25 days
UpperlimitElapsed = 150.02 # = 6.25 days
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Data Normalization
## Data Normalization
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
# Enter 'none', '0-1', or 'log' as a choice for data nomralization
# Enter 'none', '0-1', or 'log' as a choice for data nomralization
Data_Normalization_Choice = 'none'
Data_Normalization_Choice = 'none'
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
# 2D Histogram X and Y Axes
# 2D Histogram X and Y Axes
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
xaxis_min = 0
xaxis_min = 0
xaxis_max = 80
xaxis_max = 80
yaxis_min = 0
yaxis_min = 0
yaxis_max = 20
yaxis_max = 20
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
# Imports
# Imports
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
# must run
# must run
import sqlite3
import sqlite3
import slurm2sql
import slurm2sql
import pandas as pd
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
%matplotlib inline
%matplotlib inline
import seaborn as sns
import seaborn as sns
import seaborn as sb
import seaborn as sb
import plotly.express as px
import plotly.express as px
import matplotlib.ticker as ticker
import matplotlib.ticker as ticker
import numpy as np
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits.mplot3d import Axes3D
import os
import os
from RC_styles import rc_styles as style
from RC_styles import rc_styles as style
from sklearn.cluster import KMeans
from sklearn.cluster import KMeans
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
# Database Creation
# Database Creation
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
#connecting to database
#connecting to database
db = sqlite3.connect('cluster_analysis_from_'+str(start_date)+'to'+str(end_date)+'.db')
db = sqlite3.connect('cluster_analysis_from_'+str(start_date)+'to'+str(end_date)+'.db')
```
```
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
#creating a database based on the start date
#creating a database based on the start date
slurm2sql.slurm2sql(db, ['-S',start_date, '-E', end_date,'-X', '-a']) #-X is allocations, -a is all users
slurm2sql.slurm2sql(db, ['-S',start_date, '-E', end_date,'-X', '-a']) #-X is allocations, -a is all users
This initial version creates the dataset, kmeans clustering, and resulting graphs to analize how our users are utilizing the cluster.
Features included:
- User input to choose date range of data to analyze
- User input to choose min and max values for ReqMemCPU, AllocCPUS, and Elapsed
- User input to choose how data is normalized: 0-1, log, or no normalization
- User input to choose min and max x and y axes for 2D histogram graphs
# Next Release Planned Features (coming December 2020)
- data on job counts for each density spot in 2d histograms
- summary statistics for each cluster in the form of the count of jobs and the count of users per cluster
# Release Notes version - 1.1 Bug Fix (12/15/2020)
Dataset for completed jobs orginally had all jobs and each of their job steps. This skewed the clustering graphs, as there were more data points than individual jobs ran. The data is now being pulled into the dataset using only allocated jobs (done with -X in the slurm2sql.slurm2sql command), which results in each row of the dataset being a different job.
# Release Notes verion - 2.0 (12/22/2020)
Added summary stats for each cluster. This includes the count for both jobs ran and users running those jobs for each of the four clusters.
- summary statistics in the form of a table showing the job and user count for each cluster
* Data on stats for each density spot in the 2d histograms will come in another notebook. This notebook will be a deeper analysis of each 2d histogram for each cluster. This notebook should be released by end of January 2021.