What is data masking?
Data masking is a data security technique that scrambles data to create an inauthentic copy for various non-production purposes. Data masking retains the characteristics and integrity of the original production data and helps organizations minimize data security issues while utilizing data in a non-production environment. This masked data can be used for analytics, training, or testing.
A simple example of data masking is hiding personally identifiable information. Assume that an organization has an employee table in its database. It has the employee ID and full name of each of its employees. Through data masking, the organization might create a replica of the original database that uses a common first name and last name.
Why do organizations need data masking?
In recent years, data security regulations have become very strict. The introduction of regulations like the General Data Protection Regulation (GDPR) has forced organizations to protect their data fiercely. This has put a significant restriction on the usage of the organization’s data for testing or analysis.
Assume that a healthcare company wants to analyze and study their customer behavior. They might want to outsource the analytics job to a third-party vendor. If they pass the authentic health information of their customers to a vendor, there is a chance of a data breach. Data masking helps in such scenarios.
Data is one of the most significant assets of an organization. Data masking helps organizations extract the maximum benefits of data without compromising data security.
What are the common data masking methods?
Substitution
In the substitution method, the original data value in a data record is replaced with an inauthentic value. For example, in a customer database, every male name might be replaced with a standard value. Every female name might be replaced with another value. Substitution makes sure that the format of the inauthentic data is precisely the same as the original data. Also, in the above example, the data masking system maintains the male-female customer ratio by separately substituting the male and female names.
Shuffling
This is a common data masking technique where the values are vertically shuffled in a column of a database table. If we need to mask a table that stores the balance of each bank account using shuffling, we randomly shuffle the account balances column. In this way, the account numbers will have a random balance and not the authentic data. An advantage of shuffling is that the aggregate value of the column remains the same even after data masking.
Averaging
Averaging replaces all the numerical values in a table column with an average value. In the account balance example above, each account balance is replaced by the average of all balances. This makes it impossible to find out the balance of individual accounts. This process also maintains the aggregate value.
Redaction and nulling
Redaction is the most straightforward data masking method. The sensitive data is replaced with a generic value like “X.” It is a common practice to mask phone numbers or credit card numbers. Nulling is a similar process, but instead of the generic value, a NULL is placed in the data field. This method has various drawbacks. Nulling may lead to various data inconsistencies. It also highlights the fact that the data is masked.
Format preserving encryption
Encryption turns data into an unreadable array of symbols. Standard encryption methods usually turn a data point into a random length string. For data masking, the encryption should retain the length and format of the original data to preserve data integrity. Hence, a format-preserving encryption method is used to mask data. Unlike the above methods, encrypted data can be reversed if the encryption key is available, which can be a security risk.Still, many organizations use encryption for data masking.
What are the general rules of data masking?
Data masking techniques need to follow a few rules so that the transformed data remains useful.
Data masking must be non-reversible
Once the data masking technique transforms the authentic data, it should be impossible to retrieve the original data from masked data. If the data is reversible, then it is a severe security issue.
The data must be representative
The data masking technique should not alter the nature of the data. Data masking should use the transformations in such a way that geographic distribution, gender distribution, readability, and numeric distributions of the original data are preserved.
Integrity should not be compromised
Data masking should not affect the integrity of the database. For example, if the credit card number is the primary key of a table, and if it is scrambled for masking, every instance of that credit card number should be identically scrambled. In short, data masking should not affect referential integrity.
Masking non-sensitive data if it affects sensitive data
Data masking does not necessarily mask every field in a data record. For example, in a customer record, it may not be required to mask the gender of the customer as all sensitive information is already masked. If the non-sensitive data can be used to reconstruct the sensitive data, they need to be masked for security.
Data masking should be automated
Data masking is not a one-time process. As the production data changes often, the data masking system should create a masked replica of the new data. If the data masking is not automated, it could be expensive, inefficient, and ineffective.
Data masking workflow options
Static data masking
In static data masking workflow, a copy of the original data is made, and masking is applied to this copy. There are two popular static data masking methods.
Extract – transform – load (ETL)
ETL is a commonly used data masking workflow. The first step in this workflow is to extract data from a production database. This step may create an exact copy of the production database or extract only a subset of the data using SELECT queries. In the transformation step, a data masking system applies one of the above-discussed data masking methods. In the last step, the masked data is loaded into a test database.
In-place masking
In this workflow, the data is masked within the production/original database. The masking system works on a “copy” of the data present within the same database. This eliminates the steps of extracting and loading from the ETL workflow. In-place data masking utilizes the high-end facilities of a production database. One of the disadvantages of this method is the computational overhead for the production database. Also, creating a copy within the production database and users accessing this masked data may create security threats.
Dynamic data masking
In dynamic data masking, the mask is applied on a copy of the data whenever the system receives a user request.
View-based data masking
In this data masking technique, when a user requests data, based on the access rights of the user, a mask is applied and the user gets a “masked view” of the original data. The masked view is a virtual table. View-based dynamic masking is suitable in test environments where every test user may not have the same data privileges.
Proxy-based data masking
Proxy-based data masking is a newer method of dynamic data masking. In this model, all the data requests go through a proxy system, which runs the data masking as a service. An example of proxy-based masking is the data transaction between an application and a database. If the application issues too many queries for sensitive data like credit card numbers, the proxy system might mask the data. This is to protect the data in the event of hacking or any unauthorized access. In this model, the query result is substituted by the masked data. In a different implementation, the query itself is re-written to run against the masked data copy. The results are then selected from the masked columns of the database.
What are the challenges of data masking?
While the process of masking seems simple, a data masking system faces many challenges in making a meaningful, masked copy of production data.
Format preservation
The data masking system should understand what the data represents. When replacing with inauthentic data, the masking system should preserve the format. This is particularly important for dates and strings of data where the order and format are essential.
Referential integrity
In a relational database, the tables are interconnected with primary keys. When the masking system scrambles or replaces the values of the primary key of a table, the same value should be consistently changed throughout the database.
Gender preservation
While replacing people’s names in a database, the masking system should be aware of male and female names. If the masking system randomly changes the name, the gender distribution in the table will be affected.
Semantic integrity
Most databases enforce rules about the range of permitted values. For example, there might be a range of salaries. The masked data should fall in this range to preserve the meaning (semantics) of the data.
Uniqueness
If the original data in a table is unique, the masking system should supply unique values for each data element. For example, if a table stores the SSNs of employees, after masking, each employee should still have a unique SSN.
The masked data must retain any meaningful frequency distribution—for example, geographic distribution. The average value of columns in the masked data should also be close to the original data.
What are the benefits of data masking?
Protects against data security threats
Data masking is an effective solution for various data security threats like data leaks, hacking, insecure data interfaces, or intentional data misuse.
Allows business data to be used for testing
Data masking lets companies use valuable business data for testing and training purposes, without having to worry about leaking original data.
Allows information sharing
Organizations can outsource their data-related tasks and provide production data to third-party vendors.
Preserves data format and structure
Data masking preserves the structure and format of the original data, which makes it an ideal technique to assist non-production procedures and research.