Competitive equilibrium based personal data market model

Background/Objectives: In a digital economy, with the increasing commercial value, personal data is viewed as a commodity to be bought and sold. Data owners expect an appropriate compensation for trading off their privacy depending on how they value their privacy. Simultaneously, data consumers want to maximize their utility which is dependent on the value derived from the data. Consequently, a data market model that optimally recompenses data owners and maximizes the profits of data consumers is required.Methods: In this study, a data market model and pricing mechanism based on Fisher market model and competitive equilibrium are presented where the value derived from the data is calculated from information entropy. The proposed data model and the pricing mechanism jointly and simultaneously maximize the profit of data owners and the utility of data buyers. Findings: Experiments are conducted on adult data set to validate the efficacy of the proposed approach. Data owners are classified as risk averse, risk neutral, risk taking and privacy regarders. Subsequently, prices of data samples are adjusted to reach equilibrium as defined by the Fisher market model maximizing simultaneously and jointly the profit of data owners and the utility of data buyers. Applications: The proposed competitive equilibrium based personal data market model can be used to find equilibrium prices and bundles of data samples for each data buyer at these prices maximizing the utility of data buyer subject to his budgetary constraints and data requirements, and data owners' privacy preferences.


Introduction
Personal data is continually being generated, gathered and shared at an unprecedented level which is useful in improving the products and services. For instance, personal information is useful in a variety of domains which include medical research, crime analysis, web usage analysis, customer behavior analysis and risk analysis. However, a raw personal data may consist of information about individuals that is sensitive. Consequently, use of personal data is constrained owing to privacy concerns. Due to the mismatch in demand and supply, the concept of personal data market has emerged where personal data is viewed as a commodity. Numerous studies have emphasized that https://www.indjst.org/ a monetary compensation ought to be awarded to the data owners for trading off their privacy (1)(2)(3)(4) . However, evaluating the costs and benefits of preserving or disseminating personal data and quantifying the worth of personal data is notably challenging.
There are three stakeholders involved in the personal data market: 1. The data owners who make available their personal information to data market along with their privacy preferences such as when, how and for what purpose his/her personal data can be collected and shared to others. Furthermore, a data owner obtains an appropriate monetary compensation in return for trading off his privacy. 2. The data buyers who seek data from the data market for either research or commercial purpose. A data buyer makes payment to the data market in proportion to the amount of information acquired and/or value derived from the information obtained. 3. The data market is a trusted intermediary between data owners and data buyers. A data market gathers personal data of data owners together with their privacy preferences and disseminates data to data buyers, meeting the obligation of preserving privacy choices.
Numerous studies exist on personal data market from the perspective of data privacy and can be classified into informationtechnology based and economic-based approaches. Technology-based approaches protect privacy while sharing personal data by masking, generalization and suppression techniques for anonymizing the personally identifiable attributes in the data (5)(6)(7)(8) .
While technology-based approaches can be used in data mining applications to find aggregate or statistical information or to obtain interesting patterns in the data, they are not suitable in applications such as crime analysis and marketing where sensitive information and individual identities are required.
Economics-based approaches rely on pricing mechanism for privacy compensation. In these approaches, personal data is considered as a commodity and individuals trade off their privacy for monetary gain (9,10) . The worth of the personal data is determined using economic tools. Laudon (11) advocates a market-based approach where individuals have the control over the information about them, and thus receive a fair compensation for trading off their privacy. Gkatzelis et al. introduces a market that makes payments to the individuals according to their privacy choices (12) . At the same time, Gkatzelis et al. introduces a mechanism that elicits individuals to truthfully report their privacy preferences. Similarly, (13) proposes a fair privacy compensation mechanism that depends on the privacy attitude of the individuals and the sensitivity of the information provided by each individual. In another study (14) , a big data market model is introduced and an optimal pricing scheme is formulated as a Stackelberg game to maximize the profit of the data source. Yet another study (15) , presents an optimal pricing scheme based on the quality of the provided data. However, studies (14) and (15) do not consider the privacy preferences of the data providers.
Furthermore, there are studies that integrate technology-based approaches with economic-based models that simultaneously preserve individuals' privacy and increases the value of personal information. For instance in (16) , a mechanism is proposed that integrates the technology based security mechanism with the market model that compensates individuals according to their privacy attributes. However, the proposed mechanism of (16) has limited applicability since the economic model employed is dependent on the specific security mechanism.
The key determinants in pricing the personal data include privacy attitudes of the data owners and the value derived by the data buyers from the information obtained. Privacy is the right of an individual and each individual's privacy attitudes are different. Consequently, each person may be willing to sell his personal data for different prices. For instance, individuals less concerned about privacy may intend to sell their personal data for a small payment while others who are more concerned about privacy may consent to sell their personal data for higher payments. Furthermore, there may be some individuals who pretend to regard privacy as more valuable and misreport the cost of their personal data in an expectation of receiving higher payments. On the other hand, data buyers desire to obtain maximum utility from the information acquired at minimum subscription cost. For instance, a data buyer may desire to obtain those samples of data that yield more information content. Thus both data owners and data buyers are selfish and each of them intend to maximize their own objectives.
In this study, a pricing mechanism based on competitive equilibrium theory is presented which jointly optimizes privacy loss of data owners and utility of data buyers. In this study, a data buyer desires to buy the samples that maximize the information entropy.
The remaining of the study is organized as follows: While Section 2 presents an overview of competitive equilibrium theory Section 3 describes calculation of information entropy. Section 4 describes the proposed pricing model and Section 5 demonstrates experimental results. Finally, conclusions are drawn in Section 6.
1. Producers/sellers and the consumers/buyers arrive at an equilibrium price for a product. 2. At an equilibrium price, the market clears; in other words demand equals supply for a product. 3. At equilibrium price, sellers maximize their profits and buyers maximize their utility subject to resource and technological constraints.
In this study, we will discuss the competitive equilibrium of a Fisher market model that contains m commodities and n buyers. Each buyer i is endowed with revenue r i > 0 and utility function u i : [0, 1] m → R that determines the choice for consuming various commodities. Similarly, each commodity j has pre-specified quantity c j . Each buyer spends his revenue to consume a bundle of goods such that his utility function is maximized. Let X i = (x i1 , . . . , x im ) ∈ [0, 1] m represents the bundle of goods consumed by buyer i; where x i j = 1 implies good j is consumed and x i j = 0 implies good j is not consumed by buyer i. Each commodity j has a price p j determined by an auction that will be described below. A competitive equilibrium is a vector of prices P = (p 1 , . . . , p m ) → R m + and bundle of goods X i for 1 ≤ i ≤ n such that the market clears and the utility of each buyer is maximized (17) . Formally, the conditions for the market can be stated as follows: 1. For all 1 ≤ i ≤ n , the vector X i is the maximizer of utility of buyer i, that is max Walras introduced a trial and error process called tâtonment process for determining equilibrium prices (18) . Each buyer reports his demands for goods maximizing his utility at prices announced by an auctioneer. The market is in disequilibrium if there is either an excess demand or supply. Buying and selling does not take place at disequilibrium instead the auctioneer either lowers or raises the prices of goods depending on demand and supply of corresponding goods. The process terminates if the market reaches an equilibrium that is when the market clears. In other words, equilibrium is reached if demand and supply of goods is equal. According to (19) , equilibrium is guaranteed if each commodity is consumed by at least one consumer and each consumer buys at least one item. More formally, Arrow and Debreu prove that the market converges to equilibrium if the utility functions of the agents are concave (20) .

Information Entropy
Information entropy introduced by Claude Shannon is a measure of 'uncertainty' or 'surprise' associated with a random variable. Let X be a discrete random variable whose possible outcomes are x 1 , . . . , x n , then the entropy of variable X is defined as where p (x i ) = Pr(X = x i ) is the probability of i th outcome of X. Suppose a dataset contains n samples each with k attributes, then the entropy of i th sample can be calculated as follows: The total entropy of all samples in a dataset D is computed as follows: A skewed probability distribution is unsurprising and has low entropy and lesser information content. On the other hand, a balanced probability distribution is surprising and has high entropy and more information content. The contribution of information content of i th sample in the whole dataset can be calculated according to (21) as follows: It can be noted that ∑ n i=1 H(x i ) H(D) = 1. According to (22) the information entropy is concave, continuous and continuously differentiable function of x, where x is a compact convex set. https://www.indjst.org/

Proposed Pricing Model
We consider a data market D consisting of n data owners and m data buyers. Let x i denote the sample corresponding to data owner i and X j ⊆ D denote the bundle of samples data buyer j has access to. The utility of a data buyer j is derived from the information entropy of bundle X j and is computed as follows: where H(x i ) is the information entropy of data sample x i defined over the sample set X j ⊆ D calculated according to (2). While each data owner i reports his initial price p i for which he is willing to sell his personal data along with the privacy preferences such as when, how and for what purpose his personal data can be shared to others, each data buyer j reports his budget b j and data requirements to the data market. Data requirements may specify the number of data samples and/or specific kind of data samples required. Data market bundles the samples to each data buyer at given prices maximizing the utility of the buyer subject to his budget constraints and data requirements according to the privacy preferences of each data owner.
In order to obtain market equilibrium, the data market adjusts the prices of each data owner's sample depending on demand and supply, and the information entropy of the corresponding data owner's sample. If p i is too high, there is possibility of x i not being sampled and bought by any data buyer. On the other hand, if p i is too low, the data owner of x i receives less payment for trading off his privacy. Therefore, in addition to the initial price p i , the data market elicits maximum price pmax i and minimum price pmin i at which each data owner is willing to sell his personal data. A risk taker, reports maximum and minimum prices such that pmin i = p i < pmax i ; a person who is risk averse reports maximum and minimum prices such that pmin i < p i = pmax i ; and a person who is risk neutral, reports maximum and minimum prices such that pmin i < p i < pmax i . In addition to risk takers and persons who are risk averse and risk neutral, there are privacy regarders who value privacy more than incentives and such a person reports maximum and minimum prices such that pmin i = p i = pmax i , where price p i is generally very high. A sample has high probability of being sampled if it has more information entropy and/or lesser price, and vice versa. Therefore, data market increases the price of sample that has more information entropy and lesser price to increase the profit of the corresponding owner. On the other hand, the price of the sample is lowered if it has less information entropy and higher price to increase its probability of being sampled. However, data market adjusts the prices of the samples within the range of maximum and minimum prices specified by the data owners.
Equilibrium is reached if a vector of prices P * = (p * i , . . . , p * n ) ∈ R n + and bundle of samples X * j is determined such that the market clears and the utility of each data buyer is maximized subject to his budget and data constraints. Formally, the conditions for equilibrium in the context of data market can be defined as follows: 1. For each data buyer j, X * j is the maximizer of utility function max at given prices P * and budget b j and 2. The market clears, that is ∑ m Details of computing competitive equilibrium prices and bundles of samples for each buyer at equilibrium prices maximizing his utility subject to his budget constraints and data requirements and privacy preferences of data owners is presented in Algorithm 1. Since, information entropy is a concave, strictly continuous and continuously differentiable function of x where x is a compact convex set; the utility functions of the data buyers u j (X j ) for 1 ≤ j ≤ m are concave, strictly continuous and continuously differentiable functions of X j where X j is a compact convex set. According to (20) , competitive equilibrium exists for the data market and pricing model defined above since the necessary and sufficient conditions for the existence of competitive equilibrium are fulfilled.

Experiments and Results
In order to substantiate the efficacy of the pricing mechanism presented in this study, experiments are conducted on Adult dataset of UCI machine learning repository. The dataset consists of 48,842 instances described by 14 attributes. However, random samples of 20,000 are drawn from 48,842 samples and 7 attributers are considered in the experiments. The description of the attributes is presented in Table 1.
In the experiments, 120 buyers are considered each of them requesting 8000 samples. For simplicity, budget of all buyers is set to 1. The initial price p i , maximum price pmax i and minimum price pmin i are set in the range [0, 1] for each sample such that the number of persons who are risk taking, risk averse, risk neutral and privacy regarders is 5000.
In Figure 1, the average number of records sampled at minimum price, initial price and maximum price are plotted for samples whose information entropy is 1, 0.5 and 0.2. It can be observed that the average number of records sampled is decreasing https://www.indjst.org/ Figure 2 presents the average number of records sampled at equilibrium prices for persons who are risk averse, risk neutral, risk taking and privacy regarders. As expected, average number of records sampled for privacy regarders is the least and highest for persons who are risk averse. Furthermore, the average number of records sampled for persons who are risk neutral is less than risk aversers but greater than risk takers who in turn are sampled more than privacy regarders.  Figure 3 illustrates average profits of the persons who are risk averse, risk neutral, risk taking and privacy regarders at equilibrium. It can be noted that the variations in the average profits are not so significant and almost same for persons who are risk averse, risk neutral and risk taking but very much less for privacy regarders because of rigid restrictions on their privacy.   Figure 5 shows the variations of average equilibrium prices for persons who are risk averse, risk neutral and risk taking. While the average equilibrium price of risk takers is the highest, it is the least for risk aversers.  Figure 6 plots utility against buyers at equilibrium prices. In order to illustrate, the experiments are conducted on 10 buyers instead of 120. Since the budget of all buyers is equal, the utility of all the buyers is almost same with little variations since competitive equilibrium maximizes the objective of all the agents simultaneously.

Conclusions
In a data market, the objective of the data owners is to ensure for an appropriate compensation in lieu of trading off their privacy. At the same time, the objective of the data buyers is to derive maximum utility from the data obtained. The value derived from the data is primarily dependent on the information entropy of the data; if information entropy is more, then the utility derived is more and vice a versa. Hence, in this study, a personal data market model and pricing mechanism based on competitive equilibrium is presented that simultaneously and jointly optimizes the profit of data owners and the utility of data buyers. Experiments validate the propositions and the effectiveness of the proposed approach. The proposed approach is applicable for a static data market. In the future, a solution to a dynamic market based on competitive equilibrium should be studied.