What is data merging?
Data merging is the process of combining two or more similar records into a single one. Merging is done to add variables to a dataset, append or add cases or observations to a dataset, or remove duplicates and other incorrect information.
Done correctly, this process makes it easier and faster to analyze data stored in multiple locations, worksheets, or data tables. Merging data into a single point is necessary in certain situations, especially when an organization needs to add new cases, variables, or data based on the lookup values. However, data merging needs to be performed with caution; otherwise, it can lead to duplication, inaccuracy, or inconsistency issues.
Data from multiple sources is merged in a number of scenarios:
- Digital transformation initiatives
- Driving business intelligence
- Integration after mergers and acquisitions, when data from different organizations are merged into one dataset
- Different applications including customer relationship management, marketing automation tools, and website analytics tools are merged for analysis, processing, and predictions
Text files, comma-separated value files, Excel sheets, structured query language datasets, and other digital file formats are merged into one whole system for hosting and management. This is done to enable automated workflow, enhance search capability, and control information access. It also must be made compatible with existing structures, any newly merged projects, legacy systems, and internal workflows.
Stages of data merging
The process of merging can be categorized into three stages: pre merging, merging, and post-merging processes.
1. Pre-merging process
Data Profiling: Before merging, it is crucial to profile the data, analyzing the different parts of data sources. This step helps an organization understand the outcomes of merging and prevent any potential errors that may occur. Data profiling consists of two important steps:
- Analyzing the list of attributes that each data source possesses. This step helps an organization understand how the merged data will scale, what attributes are intended to be merged, and what may need to be supplemented.
- Analysis of the data values in each part of a source to assess the completeness, uniqueness, and distribution of attributes. In short, a data profile validates the attributes of a predefined pattern and helps to identify invalid values.
Standardize and Transform Data: Data sources may contain incomplete and invalid values. These datasets cannot be merged before they are standardized. In addition to errors, data attributes from different sources may contain the same information, such as customer names. However, the format of these data values can be entirely different. Due to the lexical and structural differences in datasets, data loss and errors may occur. In order to standardize data, certain factors need to be managed.
- Invalid characters that cannot be printed, void values, and trailing spaces should be replaced with valid values. An example is not allowing more than one space when data is entered or reducing all multiple spaces to one when transforming data.
- To standardize long fields of data, records should be parsed into smaller parts in different source files. This helps to ensure data accuracy remains even after data sources are merged.
- Constraints for integration should be defined. For example, the maximum or minimum number of characters in a certain field should be defined, or a hyphenated surname should contain no spaces.
Data Filtering: A part or subset of the original data sources can be merged instead of an entire data source. Such horizontal slicing of the data is done when data in a constrained period of time needs to be merged, or when only a subgroup of rows meets the conditional criteria. Vertical slicing can only be done when a data source contains attributes that do not have any valuable information.
Data Uniqueness: Many times, the information from a single entity may be stored across a number of sources. Merging data becomes more complex if the datasets contain duplicates. Therefore, before beginning any merging process, it is important to run data matching algorithms. These help to apply conditional rules to identify and delete duplicates and result in the uniqueness of records across all sources.
2. Merging: Integration and aggregation of data
The process of merging can either be an integration or an aggregation. Once all the previous steps have been completed, the data is ready for merging. There are a number of ways this process can be achieved.
According to specific use cases, appending rows, columns or both can be done. This can be quite simple if the datasets do not contain many null values and are reasonably complete. But there could be problems if there are vacant spaces in the datasets that need to be looked up and filled. Often, data merging techniques are used to bring the data together. It is also possible to perform a conditional merge initially, and then finish the merge by appending columns and rows.
Append Rows: Appending rows is done when records sourced from different datasets need to be combined in one place. The data sources to be joined need to have an identical structure. The data types, pattern validations, and integrity constraints of corresponding columns also need to be the same to avoid invalid formatting problems. Data matching should be performed with or before merging if the data of an entity is from different sources.
Appending Columns: This process is done when more dimensions need to be added to an existing record. In such scenarios, all columns from different sources must be made unique. Every single record should be uniquely identifiable across all sets of data, making it easier for records with the same identifier to be merged. If the merging column does not contain any data, then null values should be specified for all records from that dataset. However, if many datasets contain the same or highly similar dimension information, then these dimensions can be merged together in one field.
Conditional Merging: Conditional merging is used when there are incomplete datasets that need to be united. In this type of merge, values from one dataset need to be looked up and the other datasets need to be filled in accordingly. The source dataset from which values are looked up should contain all unique records. However, the dataset to which data is being appended or the target dataset does not need to have unique values.
3. Post-merging process
Once the merge process is finished, it is important to do a final profile audit of the merged source, as is done at the start of the merge process. This will help find errors that may have occurred during the merge. Any inaccurate, incomplete, or invalid values can also be spotted.
Challenges during the data merging process
There are several challenges an organization may face during the data merge process.
The lexical and structural differences across datasets can make it difficult to merge without error.
- Structural heterogeneity occurs when datasets do not contain the same type or number of attributes (columns).
- Lexical heterogeneity happens when fields from different datasets are the same structurally, but have the same information in an objectively different manner.
Another major issue is scalability. Data merges are usually planned and actioned not based on the ability to scale up but by the number and types of sources. Over time, systems that integrate more data sources with a range of structures and mechanisms for storage will be required. To overcome this, an organization must design a system that is scalable in size, structures, and mechanisms. Instead of hardcoding the integration to be a set process, the data integration system needs to be reusable, with a scalable architecture.
There is also the challenge of data duplication. There are different ways in which data duplication can happen in the dataset. To start with, there may be multiple records of the same entity. Further, there may be many attributes storing exactly the same information about an individual entity. These duplicate attributes or records can be found in the same dataset or across multiple datasets. The solution to this problem is using data matching algorithms and conditional rules.
Lengthy merging processes are another common issue. Many times, data integration processes take far longer than anticipated. This can be prevented through more realistic planning including data merging experts in the planning and avoiding last-minute changes and amendments. Scope creep is to be avoided.
Types of merging
There are a range of merge options, depending on the organization’s legacy datasets, and software options.
One to one merge
One-to-one is the most basic and simple type of merge. Both the master dataset and the dataset being merged should have a common key variable to enable a smooth merge. Variables in both datasets will be merged together into the same file. This might include any missing values for observations that do not match in both datasets.
Many to one merge
Many-to-one merges may contain duplicate entries for one of the two key columns.
Each unique identifier in the master dataset corresponds to one row. Unique identifiers in the dataset may correspond to multiple rows.
Many to many merge
Many-to-many joins are complex compared to the other merges. These are done when the key column contains duplicates in both the right and left array. The usage of dataset values should be mapped in this case.
New case merging
The merging of new cases is done by appending data in different ways. It can be achieved by adding more rows of data in every column. This is possible when the two files’ variables are the same. For example, if the variable is numeric in one file, it needs to be numeric in the other. It cannot be numeric in one and string variable in the other. In case of an automated merge, the tool matches the data on the name of the variable. This makes it important to use the same names in the two files. If one file has a variable that does not have a match in another, the missing data or blank values will be inserted accordingly.
Merging new variables
When merging new variables, while the identifiers for each case in both files are required to be the same, the variable names need to be different. This process is also called augmenting the data. In structured query language, it is done using the join function. When merging data column by column or adding more columns of data to each row, the user adds new variables for each current case in the data file.
When new variables are merged where all of the variables are not present, the missing cases should be replaced with blank values, such as in merging new cases. If there are new files with new variables and new cases, the merge depends on which software is being used for the merge. Sometimes it cannot handle merging cases and variables simultaneously. In such scenarios, first augment or merge in only the new variables. Then, the new cases can be appended to all variables.
Merging data using lookup values
Merging works best when there are complete, whole data sets to be combined. However, when data must be augmented with information from other datasets or sources, there are certain factors to be considered. Organizations must survey data in one file with corresponding values of the other. The lookup code is used as an identifier, adding values as new variables in the data file. The data is paired up for each case by using the lookup code. Then the data from the original file is augmented with the merging data for the matching lookup code.
This look-up code should be unique in the file with the additional data, but the same value can be present many times in the file that needs to be augmented.
Practices that ensure smooth data merging
Merging and appending can be quite a complex process. However, by following certain practices, the process can be less prone to error.
- Prior to initializing data integration/merge, the organization must invest time to evaluate the data sources and decide where to host the merged data. This can make the process faster, more accurate, and more efficient.
- Make sure the variables are unique in both the files while appending and augmenting data.
- Automated software and integration tools are available that make the process of merging and integration much easier and more accurate.
Data merging is an essential, but complex process. However, when done correctly, with planning and care, and with the right solution, it results in datasets that can be cleansed of duplication, errors, or problems, providing a single source of truth and enabling a data-driven analytics business. This is typically the results that you can expect, by implementing a Master Data initiative.