Novel algorithm for efficient privacy preservation in data analytics

Objective: To address the modern privacy threats in data analytics by designing an efficient privacy preserving data analytics technique. Methods: The method applied is a non anonymized method that uses the concepts of synthesizing quasi identifiers and application of differential privacy. The proposed method was applied to three data sets viz. Adult data set, Statlog data set and Indian Liver Patient data set. All the data sets are freely available in the UCI repository. Findings: The study presents “Synthesize Quasi Identifiers and apply Differential Privacy” (SQIDP) which is proved to be a more efficient and scalable algorithm. Compared to anonymity based algorithms SQIDP is not prone to similarity attacks, background knowledge attacks, attribute disclosure, and inference attacks. Anonymization, cryptographic, SWARM, and randomization methods will reduce data utility whereas SQIDP offers 100% data utility. Hence it is more efficient than other techniques. SQIDPwas applied on three different data sets with 270, 583, and 48842 records but the execution timeof the algorithm remained the same for all three data sets. SQIDP is proved to be a better privacy preservation technique with 100% data utility because it is not anonymized that abides by the recommendation in many privacy legislations like GDPR (General Data Protection Regulation) of the European Union and PDP (Personal Data Protection bill) of India.


Introduction
The majority of the privacy preservation methods developed in the past were based on anonymization techniques which will reduce data utility (1) . Cryptographic techniques were also proved to be inadequate especially when the data is voluminous. Swarm based anonymization techniques for privacy preservation is a recent development in the field of study but privacy legislations recommend non-anonymization based solutions for ensuring data utility (2) . Application of K-anonymity together with perturbation techniques is also studied but suffers from the data utility problem (3) . Data mining https://www.indjst.org/ techniques are also used for privacy preservation to overcome similarity and inference attacks with an improved trade-off between data utility and data anonymization (4) . However, to achieve maximum data utility, a non anonymized solution is preferred. In this paper, we examined the key aspects of privacy legislations, modern privacy threats and proposed a privacy preservation algorithm called SQIDP to offer privacy preservation in data analytics. The key features of the algorithm and the main contributions of the study are listed below. 1. A non-anonymization-based solution to privacy preservation problem as recommended in GDPR. 2. Identity disclosure and attribute disclosure is not possible, because the quasi identifiers are synthesized and cannot be mapped with external data sources. 3. Sensitive data is tokenized before analytics, differential privacy is applied on synthetic data which will prevent background knowledge and homogeneity attack. 4. Strong and coherent privacy protection is guaranteed because the original data set is not involved in data analytics and instead a synthetic data set is used which is statistically similar to the original data set such that the analytical results of synthetic data can be related or mapped to the original data set.

Modern privacy threats and concerns
The nature of privacy threats has changed due to the emergence of applications like recommendation systems, e-commerce, etc. Conventional data analysis included a statistical analysis of data especially using aggregate queries where data was analyzed as a whole (5,6) . Applications like recommendation systems will analyze personal data like buying habits, social media posts and try to predict suitable recommendations that are possible only through constant surveillance. Recommendation systems may lead to the disclosure of sensitive data leading to personal embarrassment and inference attacks. Another important source of privacy breach is the usage of smartphones. (7) Most of the smartphone apps demand permissions to access network, location, contacts, and storage which can be shared by the app developer with third parties and adversaries causing serious privacy breaches and threats to sensitive data. The percentage of users aware of privacy threats of using smartphone apps is very less (8) . The nature of privacy threats is changing every day and some of the modern privacy threats include.
1. Digital Profiling 2. Social media privacy and cyberstalking 3. Image analytics and privacy hazards

Digital profiling
Digital Profiling is the automated processing of person-specific data to evaluate certain attributes relating to a person, particularly to analyze and predict an individual's economic situation, buying habits, health, preferences, interests, behaviour, etc. Digital Profiling also influences group privacy wherein an individual may be a member of one or more groups (9) . Digital Profiling is widely used in direct digital marketing businesses. Profiling without the consent of the individual is a privacy breach. Google has recently announced to end support for third-party cookies in its Chrome browser which will make it very difficult for digital marketing companies to build a user profile. (10) . Article 22 of GDPR facilitates the right to the individual that, no automated data processing including profiling is allowed without consent from the user.

Social media privacy and cyber stalking
Social media platforms are highly vulnerable to stalking attacks. One of the common stalking techniques involves an online mob of anonymous self-organized groups to target individuals causing defamation, threats of violence, and technology-based attacks. Social media are used to build trust between the perpetrator and the victim. When the victim transmits confidential data including pictures and videos, the perpetrator abuses them for blackmail purposes (11) . Social media firms are also responsible to identify the user with malicious intentions.

Image data and privacy hazards
Image data analytics is widely used in health care, social media, and e-commerce applications. In social media applications like Facebook and Instagram, users upload a lot of images every day. An image is worth more than a thousand words and hence it may reveal the emotional state of a person (12) . Some of the key privacy hazards in image data analytics include https://www.indjst.org/ 1. Attempt to analyse the emotional state of people and exploit them. Facebook and Whatsapp status updates can be studied using machine learning models and sentiment analysis can help analyse the social and emotional wellbeing of a person and in turn, exploit them. 2. Disclosure of secret medication being taken by a person by virtue of promotional offers on medicine. 3. Another important privacy concern is identity theft because copies of permanent account number (PAN) cards, passports and driving licenses are kept in digital form and shared. Insurance and banking firms and third parties will extract a lot of sensitive data which is a serious privacy hazard (13,14) . 4. Medical imaging deals with a visual representation of the internal structure of organs and tissues. Medical imaging may lead to leakage of personal and sensitive medical data of a person. (15) 3 Methods

SQIDP algorithm (Synthesize Quasi Identifier and apply Differential Privacy)
Data Privacy has gained paramount importance in recent times and it is evident from the privacy legislation passed in more than 100 countries. Firms dealing with data sensitive applications need to abide by the privacy legislation of respective regions. In the recent past, a lot of promising work has been done in privacy preserving data analytics. Swarm based algorithms were also applied to the data sets alongside perturbation techniques. Swarm based algorithm developed for privacy preservation uses k-anonymity as the building block. Even though swarm algorithms are promising, they suffer from the traditional flaws of anonymization (2) . Researchers have tried to apply a map reduce framework to process data sets using a perturbation mechanism along with probabilistic anonymity (3) . However, the application of anonymization is not recommended in privacy legislation since it reduces the data utility and hence non anonymized solution has to be designed. SQIDP is a non anonymized technique where the original data set D is transformed into D' without any anonymization and the new data set D' is used for analytics.
Step by Step procedure to generate D' from D is described in sections 3.2, 3.3and 3.4.

Synthetic data in data analytics
Synthetic Data is one of the data sanitization methods where original data is replaced with synthetic data ensuring privacy preserving data analytics. Data can be fully synthetic or partial and various types of synthetic data generation methods were studied and compared in the previous literature (16) . Synthetic data can be a generative and additive approach for generating a near replica of original data but care must be taken to ensure the reliability of the synthetic data and also data utility must not be reduced. Quasi Identifiers are the attributes that can be linked with external data sources that may reveal sensitive data and therefore they are synthesized to prevent linkage attacks. The statistical similarity between quasi identifiers and synthetic quasi identifiers is ensured by generating synthetic data with close statistical properties of original quasi identifiers. As part of our research, we employed a novel algorithm called SQIDP in which quasi identifiers (QI) are synthesized, sensitive attribute(s) are tokenized, and finally, differentially privacy is applied to generate a new dataset from the original data set.

Application of SQIDP algorithm:
The algorithm was initially applied on the adult dataset, downloaded from the University of California, Irvine (UCI) machine learning repository (17) . The Quasi Identifiers are fnlwgt, age, capital-gain, and capital-loss attributes. The data set size was 32561 records. Initially, we found the mean and standard deviation of fnlwgt attribute as 189778.4 and 105550 respectively. "rnorm" is a function in R used to generate multivariate random values that are normally distributed. Using rnorm we have generated synthetic data for fnlwgt attribute which is referred to as fnlwgt_syn. The mean and standard deviation of fnlwgt_syn is 189777 and 105551 respectively which is very close enough to the mean and standard deviation of the original attribute. The fnlwgt_syn is shown in Figure 1.   Figure 1 shows how synthetic data for the attributes fnlwgt and capital gain are generated. Similarly, the remaining quasi identifiers are also synthesized.
The attribute marital-status is the sensitive attribute (SA) which is tokenized. Marital-status enumerates {Nevermarried, Married-civ-spouse, Divorced, Married-spouse-absent, Separated, Married-AF-spouse, Widowed}. The SA attribute is tokenized using a numerical vector. The new data set D' is created by combining the non quasi identifiers of D, SQID, and tokenized sensitive attribute (SA). D' is released instead of D for data analytics. D' contains synthetic data that has a very close resemblance with original data but it is not the original data. The mathematical transformations done on the quasi identifiers (QI) will ensure that the analytical results of D' can be applied on D without releasing the original data set. The comparison of synthesized attributes in D' with original values in D is shown in Figure 2.   Figure 2. Linkage attacks were possible in most anonymized methods where the quasi identifiers can be mapped with external data sets. In SQIDP algorithm D' contains synthesized values of quasi identifiers that cannot be mapped to external data sets and linkage attacks can be prevented. Since D' and D have close statistical properties like mean and standard deviation, the results of data analytics made on D' can be applied back on to original data set D. The same process was repeated to create two more partially synthetic datasets viz. Indian Liver patients data and Statlog heart data set which is described in Table 1.

Application of differential privacy
In Section 3 we have demonstrated the generation of partially synthetic data with strategic changes made to quasi identifiers. The dataset thus generated (D') can be released for analytics and the results can be applied back to the original data set. However, to make the dataset more robust to privacy attacks, an additional differential privacy algorithm is employed on D' . Laplace mechanism of differential privacy is applied on D' to generate a differentially private data set which makes it very difficult to predict whether an individual record was present in the data set or not. Package diffpriv (18) contains an implementation of different mechanisms of differential privacy. Laplace mechanism is one of the differential privacy mechanisms where Laplace noise is added to the dataset using Laplace distribution which is the probability density function.
https://www.indjst.org/  Figure 3 analysis: After synthesizing quasi identifiers and tokenizing the sensitive attribute, the data set adult_synthetic is passed to a function that applies the Laplace mechanism of differential privacy on the data set D' . The idea is to add enough noise to hide the contribution of any individual irrespective of the dataset. It is difficult to predict whether a single person is in the dataset or not if the dataset is differentially private.

Results
In SQIDP, the quasi identifiers were replaced with synthetic data generated using random variates having specified normal distribution. The mean and standard deviation of the synthetic data will be very close to the mean and standard deviation of the original quasi identifiers. This will ensure the results of data analytics on synthetic data can be mapped to original data sets.
Advantages of SQIDP: 1. The execution time of SQIDP was same on all three data sets with different sizes and hence it is scalable. 2. SQIDP addressed the background knowledge attack and homogeneity attack because the quasi identifiers are synthesized and cannot be mapped with any external data sources. 3. SQIDP is a non anonymized method and offers 100% data utility.

Discussions
SQIDP is an innovative method of privacy preservation where quasi identifiers are synthesized to ensure no scope of linkage attacks which was a common problem in previous privacy preservation techniques. SQIDP is found to be more efficient than existing techniques and a detailed comparison is given in Table 2 and performance metrics of SQIDP are listed in Table 3. Figure 4 illustrates the comparative graphic analysis of three datasets along with execution time of SQIDP which is found to be uniform across all datasets.
https://www.indjst.org/  Robust SQIDP is robust because it is not vulnerable to Linkage attacks -because of differential privacy Background knowledge attack -because of synthetic quasi identifiers Attribute and Identity disclosure -because of tokenization of sensitive attribute. 3 Compliance SQIDP complies to privacy regulations and does not use anonymization as recommended in GDPR. 4 Execution time All the privacy preserving techniques including SQIDP will have O(n) time complexity. However, SQIDP can be executed in a distributed computing platform to gain better performance. 5 Accuracy Anonymization leads to data loss and in turn affects the results of the analytics. SQIDP is a non anonymized and hence offers accurate analytics. Even though Differential privacy is employed in US Census 2020 (19) , Google Chrome and iOS 11, etc., differential privacy alone cannot guarantee privacy preservation because differential privacy has its limitations. Differential privacy will fail in a few aggregate queries performed on the data. For example, if we want to find the average salary of all woman employees of an https://www.indjst.org/ organization and if there is an employee with a very high salary whose presence or absence in the data makes a significant change in the average. In such cases, a huge amount of noise has to be added to ensure privacy but excess noise may affect the data utility. Another important observation is differential privacy cannot handle background knowledge attacks. If the adversary is aware of certain information about a person, then his presence or absence in the data set does not mean anything. SQIDP is a more efficient privacy preservation technique when compared to conventional anonymization techniques, randomization techniques, and cryptographic techniques. Anonymization techniques will reduce data utility whereas SQIDP does not suffer from data utility and it is also in line with GDPR. SQIDP is more efficient when compared to cryptographic techniques because the application of cryptographic techniques adds processing overhead and also declines data utility. Differential privacy has its own limitations and Differential privacy alone cannot address the privacy threats involved in data analytics. SQIDP has proved to be efficient than Differential privacy alone because synthesizing quasi identifiers and tokenizing sensitive attribute will prevent background knowledge attacks. Irrespective of any number of queries, there is no chance of any privacy breach which was noticed in the application of Differential privacy. Hence SQIDP is an efficient mechanism of privacy preservation in data analytics and a useful contribution to the field of privacy preserving data analytics.

Conclusions
The SQIDP algorithm can be applied only to text data by ensuring privacy preservation and protection from background knowledge attack and linkage attacks. SQIDP is a useful contribution to the field of privacy preserving data analytics that ensures data utility along with privacy preserving data analytics. However, SQIDP is limited to text data and cannot be applied to image or video data. Extensive usage of social media has led to the creation of a huge amount of image and video data that are prone to various cyber security vulnerabilities and have enough research scope.