explore-power-stats.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Explore Cluster Power Stats \n",
    "\n",
    "Use rest api for now to avoid RPC issues with python client library.\n",
    "\n",
    "Convert a curl command for currnet power to rest query\n",
    "https://curl.trillworks.com/\n",
    "\n",
    "This is the curl command that confirms authenticated access to Bright's CMDaeamon. It was derived from the [RestAPI intro in the Bright Developer Manual](https://support.brightcomputing.com/manuals/8.2/developer-manual.pdf).\n",
    "\n",
    "```\n",
    "curl --cert ~/.cm/cert.pem --key ~/.cm/cert.key --cacert pythoncm/etc/cacert.pem \\\n",
    "     \"https://master:8081/rest/v1/monitoring/latest?measurable=Pwr_Consumption&indent=1\"\n",
    "```\n",
    "\n",
    "Getting the total instantaneous power used from all nodes monitored by cmd.  The sum on the deals with weird outlier data from one of the nods:\n",
    "\n",
    "```\n",
    "curl --cert ~/.cm/cert.pem --key ~/.cm/cert.key --cacert pythoncm/etc/cacert.pem \\\n",
    "     \"https://master:8081/rest/v1/monitoring/latest?measurable=Pwr_Consumption&indent=1\" | \\\n",
    "     jq '.data[] | select(.raw < 10000) .raw' | \\\n",
    "     awk '{sum=$1+sum} END {print sum}'\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "import pprint\n",
    "import datetime\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.dates as mdates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pp = pprint.PrettyPrinter(indent=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set up credentials to query RestAPI. Bright controls access based on the user identity.  The user's cert.pem and cert.key are automatically generated but the cacert.pem needs to be constructed from the  certs returned by the master."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cert_file='/home/jpr/.cm/cert.pem'\n",
    "key_file='/home/jpr/.cm/cert.key'\n",
    "ca_file='/home/jpr/projects/power-study/pythoncm/etc/cacert.pem'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "params = (\n",
    "    ('measurable', 'Pwr_Consumption'),\n",
    "    ('indent', '1'),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cert=(cert_file, key_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define the client certs with the cert line, note the order is (cert, key)\n",
    "# https://requests.readthedocs.io/en/master/user/advanced/#client-side-certificates\n",
    "#\n",
    "# define the verify bundle via verify, note False means do not verify\n",
    "# https://stackoverflow.com/a/48636689/8928529\n",
    "response = requests.get('https://master:8081/rest/v1/monitoring/latest', params=params, cert=cert, verify='cheaha-cmd-cabundle.pem')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Manually construct a pthyon datastructure hash of nodes and tuples of power sample."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "debug=False\n",
    "power=0.0\n",
    "count=0\n",
    "for num, doc in enumerate(response.json()[\"data\"]):\n",
    "    if doc[\"age\"] < 1000:\n",
    "       if debug: print(\"{}: {}\\n\".format(num, pp.pprint(doc)))\n",
    "       if doc[\"value\"] != \"no data\":\n",
    "         power=power + float(doc[\"raw\"])\n",
    "         count+=1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "power"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "power/count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the time is in milleseconds so the unix converstion needs to drop the last three digits."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get power use history\n",
    "\n",
    "From beginning of july for starters. Based on the blog https://www.dataquest.io/blog/tutorial-time-series-analysis-with-pandas/."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "params = (\n",
    "    ('start', '2020/01/01 00:00'),\n",
    "    #('entity', 'c0109'),\n",
    "    ('measurable', 'Pwr_Consumption'),\n",
    "    ('indent', '1'),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = requests.get('https://master:8081/rest/v1/monitoring/dump', params=params, cert=cert, verify='cheaha-cmd-cabundle.pem')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "response.json()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's easy to [convert a list of dictionaries to a pandas data frame](https://pbpython.com/pandas-list-dict.html). This conversion path is the default for the pandas DataFrame constructor and serves our currnt needs well. We can create data frame from the power dump response and easily enhance the data to serve our plotting needs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(response.json()[\"data\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "convert result date to unix time stamp for easier plotting and time comparison."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```entity \tmeasurable \traw \ttime \tvalue \tdatetime \tutime\n",
    "428473 \tc0063 \tPwr_Consumption \t742849.191667 \t2020/05/17 23:22:00 \t742KW \t2020-05-17 23:22:00 \t1589757720\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# remove problematic entries, like unrealistic data points.  Nodes can't consume 100's kW, like c0063 on 2020/05/17\n",
    "# https://www.interviewqs.com/ddi_code_snippets/rows_cols_python\n",
    "df.loc[df['raw'] >  100000]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.loc[df['raw'] <  100000]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# improve performace by providing format string to avoid per-entry parsing deduction\n",
    "\n",
    "df.loc['datetime'] = pd.to_datetime(df.time, format=\"%Y/%m/%d %H:%M:%S\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Initial Data Viz with Seaborn Plots"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# add column with datetime converted to unix time (in seconds) \n",
    "# to preserve spacial relationships on the axis\n",
    "# note: the original data has no time zone so our reference time stamp needs to be timezone free\n",
    "df['utime'] = (df['datetime'] - pd.Timestamp(\"1970-01-01T00:00:00.000\")) // pd.Timedelta('1s') "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our default utility function to fix labels on the x-axis of time series seaborn plots"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def timeticks(ax, tformat=\"%H:%M:%S\\n%Y-%m-%d\"):\n",
    "    xticks = ax.get_xticks()\n",
    "    xticks_dates = [datetime.datetime.fromtimestamp(x).strftime(tformat) for x in xticks]\n",
    "    hush = ax.set_xticklabels(xticks_dates)\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot each data point for power used. Observe that this plot does not aggregate power use across nodes.  It simply plots power used at all available time plots.\n",
    "\n",
    "Also note the time outliers.  We requested data since July 1, 2020 but the results include information from 2018."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# build the replot and capture the handle\n",
    "g = sns.relplot(x=\"utime\", y=\"raw\",\n",
    "            palette=\"bright\",\n",
    "            #height=5,\n",
    "            aspect=2,\n",
    "            data=df,\n",
    "            s=100)\n",
    "# update the axis labels\n",
    "g = (g.set_axis_labels(\"Date\", \"Power (Watts)\"))\n",
    "# update the x tickmarks from unix time to hour minute seconds\n",
    "ax = g.axes\n",
    "ax = ax[0,0]\n",
    "timeticks(ax)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explore Resampling to Hourly Sample"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are more interested in a plot of total power used over time.\n",
    "\n",
    "We can [resample a data frame on a time interval](https://stackoverflow.com/a/52057318/8928529), which is our interest at this point.  In particular we would find it interesting to see the hourly total (sum) of the raw power used across the cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly = df.resample('H', on='datetime').size().reset_index(name='sum')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is isn't quite the sum we wanted to add up.\n",
    "\n",
    "Understand how setting the datatime index moves it out of the column collection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "daily = df.set_index('datetime')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "daily.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "daily"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "daily.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "daily['hourly'] = daily.index.hour"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "daily"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The date indexing by setting the datetime index is helpful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "daily.loc['2020-07-05']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use seaborn style defaults and set the default figure size\n",
    "sns.set(rc={'figure.figsize':(11, 4)})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "daily['raw'].plot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "axes = daily['raw'].plot(marker='.', alpha=0.5, linestyle='None', figsize=(11, 4), subplots=True)\n",
    "for ax in axes:\n",
    "    ax.set_ylabel('Usage (Watts)')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "daily.loc['2020-06', 'raw'].plot(marker='.', alpha=0.5, linestyle='None')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "daily.loc['2020-06-14', 'raw'].plot(marker='.', alpha=0.5, linestyle='None')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "daily.loc['2020-06-15', 'raw'].plot(marker='.', alpha=0.5, linestyle='None')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "daily.loc['2020-06-16', 'raw'].plot(marker='.', alpha=0.5, linestyle='None')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly = daily['raw'].resample('H').sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly.plot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "daily['raw'].resample('H').mean().plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These plots are interesting but don't seem to capture the data accurately.\n",
    "\n",
    "It seems more appopriate to work with each inidividual node and structure its values into a standard bin arrangement."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Explore performance of an individual node."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "node=df[df.entity=='c0002']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "node"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "node.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "node = node.set_index(\"datetime\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "node.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "node.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "node"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A series of plots for the single node shows times when the node was computing vs when not. Notice the saw tooth graph after the high load events."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "node['raw'].plot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "node['raw'].plot(marker='.', alpha=0.5, linestyle='None')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice the dip in power use right at the the interesting time point."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "node.loc['2020-06-15':'2020-06-16', 'raw'].plot(marker='o', linestyle='-')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly=pd.date_range('2020-05-01', 'now', freq='H')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly=node[['raw']].resample('H').mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly['raw'].plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Understand summing over the nodes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly['raw'].isNull()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly_idx=pd.date_range('2020-05-01', '2020-07-09', freq='H')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(hourly_idx)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.zeros((1,10)).T"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly_pwr=pd.DataFrame(np.zeros((1,len(hourly_idx))).T, index=hourly_idx, columns=['raw'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly_pwr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly_pwr.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly_pwr.index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get first or last entry from index. https://stackoverflow.com/a/31269098/8928529\n",
    "\n",
    "It was easier to append columns to the data frame than to try an add them in a loop.\n",
    "Do the addition after the columns are built."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly_pwr=pd.DataFrame(np.zeros((1,len(hourly_idx))).T, index=hourly_idx, columns=['raw'])\n",
    "\n",
    "for num, entity in enumerate(df.entity.unique()):\n",
    "    if entity not in ['c0108', 'c0009']:\n",
    "        node_pwr=df[df.entity==entity].set_index(\"datetime\")\n",
    "        node_pwr=node_pwr[['raw']].resample('H').mean()\n",
    "        node_pwr=node_pwr['2020-05-01':'2020-07-09'].fillna(method=\"ffill\")\n",
    "        print(node_pwr)\n",
    "        missing = node_pwr['raw'].isnull().sum()\n",
    "        print(\"{}: {} missing {}\\n\".format(num, entity, missing))\n",
    "        if num < 149:\n",
    "            #hourly_pwr.add(node_pwr['2020-05-01':'2020-07-09'], ['raw'])#, axis='columns', fill_value=0.0)\n",
    "            #hourly_pwr+=node_pwr['2020-05-01':'2020-07-09']\n",
    "            # its easier to add columns and do the sum later\n",
    "            # https://www.geeksforgeeks.org/adding-new-column-to-existing-dataframe-in-pandas/\n",
    "            hourly_pwr[entity]= node_pwr['2020-05-01':'2020-07-09']\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly_pwr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly_pwr.sum(axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly_pwr.sum(axis=1).plot()"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}