Make​ ​Data​ ​Science​ ​an​ ​Essential​ ​Part​ ​of​ ​Your​ ​Advanced​ ​Data​ ​Quality​ ​Program

September 8, 2016

Data quality is key to the productivity of any business. Bad data can have a significant negative impact on decision-making, slow down progress, and cost a fortune to fix.

Based on work pioneered by Thomas C. Redman, CoreValue understands the importance of implementing an advanced data quality program, that uses data quality metrics to identify areas for improvement, and can ensure uninterrupted resiliency and enhanced data quality — especially in large environments, where even small wins can add up to large savings. In order to provide an effective program for its clients, CoreValue embraces the following elements:

  • Identify a problem
  • Identify the impact of the problem
  • Build a model for reprocessing data
  • Reintegrate data
  • Update all of reports

To illustrate the point, one of CoreValue’s clients was faced with the need to continuously download and evaluate huge quantities of data.  The data was generated by customers’ activities across a wide spectrum of services, as well as on-premise equipment provided by the company. Due to the complexity and quantity of data processed daily, data processing often led to missed service level agreements (SLAs), which in turn resulted in a negative impact on their inability to understand and support their customers. It was critical for the client to establish effective data management and consistent reporting for all data sources.

By utilizing the five elements of an advanced data quality program, as noted above, CoreValue delivered a 360-degree view of their system’s data quality that identified data problems and their impact; allowed for the reprocessing and reintegration of all data; and an updating of all reports based on the newly cleaned data.  

This allowed us to also address in-depth aspects of these individual items. For example, data modeling encompasses the following:

  • Database capacity predictionPredictive analysis uses statistical techniques and historical data to make predictions about future capacity. One of the most common methods employed in predictive modeling is linear regression. Unfortunately, application of regression is challenging because behavior changes. System administrators may change retention policies, or simply delete data, which can lead to poor predictions. Significantly more accurate models were obtained by finding the optimal subset of “clean” data for each database and applying linear regression to only that subset of the data.
  • Automated anomaly detection. Because ‘big data” needs effective anomaly detection, we proposed enhancements that enable real-time anomaly identification. Using R and PostgreSQL, we built an alarm system to monitor jobs from Scheduler so that users could immediately react to issues. The alarm system utilized Storage, Model and Shell script to perform checks, and then send an alarm if any anomaly is detected. Basically, these alarms monitor upper and lower thresholds for the start time and module’s duration for every weekday.
  • Uniform dashboards.  By using Tableau for reporting, we created uniform dashboards for all systems so that correlations between different metrics became easier to understand. In the reports we monitor:

          – KPIs

          – Metrics crossing into unacceptable ranges

          – Unexpected changes or trends

          – Variance in metrics data

          – Sliced data by host, cluster, geographical tags, etc.

As a result of the newly implemented data quality program, our client was able to save almost half a million dollars in processing time.

Obviously, processing time depends on the complexity of an environment, the volume of data, and overall amount of processing efforts of the data. In the best case scenario, a data quality program can save you a few hundred thousand dollars, while in the worst case up to millions.

By any means, data quality analysis is crucial to the success of any system of record used for official reporting.


Data Scientist, CoreValue



1 reply
  1. Gayathiri
    Gayathiri says:

    It’s interesting that many of the bloggers to helped clarify a few things for me as well as giving.Most of ideas can be nice content.The people to give them a good shake to get your point and across the command .


Trackbacks & Pingbacks

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *

Recent Articles

Part 2: Optimizing and scaling microservices. Organic growth of eco-systems.

October 11, 2017 | Nikola Krastev

A microservices approach is not a silver bullet for all software architecture problems. It introduces tradeoffs and challenges of its own. However, process gains and improvements in human performance have been considered to be worth the overhead in technology. Here are some general arguments against using sophisticated SOA. Server performance and overhead in communication By […]

CoreValue President at IT Arena 2017

September 27, 2017

We are pleased to announce that CoreValue president Igor Kruglyak will participate in IT Arena 2017, to be held in Lviv, Ukraine, September 29 – October 1, 2017.