When System Performance meets Data Science
February 3, 2016

When System Performance meets Data Science

Data Science combines mathematics, statistics, programming applied to collected data and activities to clean, prepare – stage the data. In few words, it is the scientific approach to knowledge extraction from data.  Hopefully, the knowledge extracted is aligned with business needs and of real value!

Done correctly, Data Science provides actionable, valuable intelligence from massive volumes of data and delivers predictive and prescriptive analytics to make organizations make better decisions.

Corevalue has been working with DataScience in a number of scenarios to help our clients gain actionable insights.  In one instance, our data team was asked to analysis a complicated data pipeline for a large organization.  They were challenged meeting data delivery SLA on a daily basis caused by various nightly data cleansing and processing jobs running long, failing all together or behaving in unexpected ways.  Over time, as scale and complexity increased with computing moving to the cloud,  and leveraging microservice architectures,  their current monitoring techniques and tools needed to be extended.

Challenges: data comes from different data sources (Netezza, Oracle, RedShift, and PostgreSQL), reporting should be similar for all data sources. With lots of metrics, applications, and high performance systems, keeping track of performance became a difficult task.

Outcomes from the Data Science Engagement:   

  • Databases Capacity Prediction. Predictive analysis that uses statistic techniques and historical data was applied to make predictions about the future capacity.
    One of the most common methods employed in predictive modeling is linear regression. Unfortunately, application of regression to storage capacity time series data is challenging because behavior changes. System administrators may change retention policies, or simply delete data. Therefore blind application of regression to the entire data set often leads to poor predictions. Significantly more accurate models was obtained by finding the optimal subset of “clean” data for each database and applying linear regression to only that subset of the data. Capacity prediction is performed for databases in Netezza and RedShift clusters. R programing language was used for modeling and Tableau was used to represent results on weekly basis.
  • Automated anomaly detection. Big data needs effective anomaly detection, i.e. deviations from what can be expected based on past history, so engineers may focus on real issues. Thus enhancements that enable real-time identification of anomalies were proposed. Alarm System allows users to immediately react to issues. The Alarm System was built for a set of jobs from Appworx Scheduler. There are three main components of the system: Storage, Model and Shell script that is used to perform check and to send alarm if any anomaly is detected. PostgreSQL is used as central repository for historical data and model as well as for alarms history. R programming language (packages: dplyr, tidyr, lubridate, foreach, jsonlite, PivotalR) is used for modeling. The idea for the model is to build upper and lower thresholds for module’s start time and for module’s duration for every weekday. This idea works as there is no visible trend in the data. The model is designed to find start time and duration thresholds for a single job, for the sequence of jobs, or jobs that are executed in parallel. For every weekday a number of historical observations is selected and statistics is calculated. We perform lookup in the given sequence to peak such parameters that reduces the amount of errors, that is maximizes true positives and minimizes false alerts. Shell script is scheduled to run every 5 minutes starting from 12 am to 9 am.
  • Reports. We created meaningful and uniform dashboards for all systems that allow understanding correlations between different metrics. In the reports we monitor:
  • KPIs
  • Metrics crossing into unacceptable ranges
  • Unexpected changes or trends
  • Variance in metrics data
  • Sliced data by host, cluster, geographical tags, etc.

Reports are built in Tableau and spread between mailing distribution on the daily basis.

Thus we proposed full cycle of System Performance Data Analysis: data collection, data preprocessing, data modeling/thresholds identification, anomaly notification/alarms, and data visualization/reporting solution.


CoreValueData ScienceSystem performance


Recent Articles

SAFE: Winning the Digital Transformation at the speed of startups and the scale of enterprise

June 20, 2019 | Kostiantyn Polosukhin, VP Strategic Accounts, IT Services Competence Platform, Account Management Director, CoreValue

Working with the Clients from various industries, we’ve observed certain challenges; a central item in their top list of business improvements. Is there a program, process or anything else that can make their organization work towards a better value that is delivered quickly, but aligned with all sections of the involved parts? Something, which would […]

Hot and Trendy: Google Teams up with Looker

June 10, 2019

The IT world is stirred up with the latest plans of Google to acquire Looker, a data analytics platform. Google LLC announced its agreement with Looker to join forces in the development of a comprehensive data analytics solution for customers.   Looker, a business analytics and data intelligence platform, empowers organizations to draw insights from […]

Contact Us

By submitting this form you acknowledge that you agreed to our Cookies and Privacy Policy.