Dirty data is a term used to describe data that is incomplete, incorrect, or inconsistent. It can lead to bad business decisions, missed opportunities, lost revenue, and dissatisfied customers.
Data cleansing is an essential process for any organization that uses data to make decisions. It helps to ensure that the data is accurate and up-to-date, and reduces the risk of errors in decision-making.
The need for timely, trusted data in a centralized location for reporting and analytics is more important now than ever before. The relentless move to AI and the petabytes of data feeding into machine learning models have put a spotlight on data quality. This is especially true as unstructured data is added to the mix.
Proper data Integration is fundamental to enterprise information, management, data warehousing, and lake house initiatives. It must continue throughout the system lifecycle. The increased use of AI and machine learning raises the expectation of seamless visibility into data insights. This is why Destiny focuses on data quality, as it is the cornerstone of any AI or machine learning initiative.
Organizations need to ask the question, ‘Is the information correct’?
Do Legacy IT processes continue to produce accurate results?
Many clients have programs written for a set of quick reporting numbers, without full validation of data or logic. System documentation is out of date and required updates are difficult to implement.
Even if the data is incorrect, its use through unvetted algorithms and reports are unreliable, even though the results may look correct.
6 Steps to Cleaner Systems
There are several standard practices that are used in industry to assist with data cleansing.
Step 1: Remove duplicate or unneeded rows
When data sets are combined from multiple sources, data is scraped, or received from a client or multiple individuals in the organization, there is a high probability of data duplication. The deduplication process may include rows that are outside the desired range of values as well.
Step 2: Fix data errors and anomalies
Data errors are evident when there are misspellings, word fragments, inconsistent capitalization or abbreviation. These values usually mean the data was read incorrectly or not stored uniformly. For example, one address has “Lane” but another “Ln”.
Step 3: Filter Outliers
Collecting a specific subset of data may include extra rows. This may include undesired endpoints and unexpected rows due to a lack of frequency analysis on the source data. An outlier may not be incorrect, but it can be excluded if it does not belong in the analysis.
Step 4: Ensure Join Accuracy
A logical row of data is often created by combining data elements from multiple tables. It is possible to create problem data if many to many keys are the basis of a join. Care must be taken to correctly select the intersection or union of rows. Without correct logic the results will be unusable.
Step 5: Resolve Missing Data
Many algorithms will not accept missing values. There are several ways to deal with missing data.
1. Remove observations that have missing values, but this results in a loss of information.
2. Place average or assumed values in place of missing, although there is a loss of data integrity due to these assumptions.
3. As a third option, data may be used through effective navigation of null values.
Step 6: Validation and Quality Assurance
At the end of the data and logic improvement, validation should be confirmed by examining comparable sources within the systems to objective proven values. This should be an on-going effort rather than a one-time exercise.
Bank Success Story
With more than 210 billion pounds in assets, this bank had frequent server crashes, lacked standard processes and logic, along with many long running jobs (hours/days) that failed often. The challenges in their data led to challenges in their reporting, which prompted a ‘Dear CEO’ letter from the regulators. As the organization struggled for years with this, it was the threat of fines and loss of employment at the executive level that got the organization to engage in a proper solution.
Destiny designed and implemented superior infrastructure as a starting point. The second phase examined the application logic, which uncovered many issues. The code was incorrect and there were data problems. Instead of outsourcing a large team for manual code review, which is costly and time consuming, they implemented a small team for automated analysis and process change.
22 million lines of code were found, analyzed, and corrected based on client-provided business rules. All automatically modified code was tested for accuracy and quality controls. Very little business user intervention was required. Jobs ran five times faster with a single, large server instead of several that constantly crashed. The proper infrastructure design yielded reduced throughput of job processing. Job completion times reduced from hours and days to a few minutes. Job crashes ceased. The project completed within a year and minimal consulting hours.