Dixie State University :: Computing and Design



Examining Database Security Methods

Perturbation Based Inference Control Methods

Dissertation

Submitted in Partial Fulfillment of the Requirements for

Doctor of Computer Science

By

Bob Nielson, BS, MS

Colorado Springs, CO

March 2009

Examining Database Security Methods

Strategies to Prevent inference Based Snooping in Database Tables

By

Bob Nielson, BS, MS

This Dissertation Proposal is approved by:

David Green, Phd

Cynthia Calongne, DCS

Elaine Waybright, Phd

Date Approved

Abstract

One of the primary reasons for a database is to capture data for future statistical analysis. But savvy users can use this information to deduce the confidential information that is contained in a database. A real need exists to allow for general statistical analysis from a database while still preventing the disclosure of detailed confidential information. This paper discusses several methods that attempt to allow safe partial disclosures. These methods include query restriction, data perturbation, output perturbation, conceptual, partitioning, data logging, anonymization and hybrid methods.

Keywords:

Database security, inference-based snooping, query restrictions, output perturbation, partitioning

Table of Contents

Title Page 1

Signature Page 2

Abstract 3

Keywords 3

Table of Contents 4

Table of Tables 5

Table of Figures 5

I. Introduction 7

II. Related Work 9

Output Perturbation 9

Data Perturbation 11

Suppression 13

Anonymization 14

Partitioning 15

Data Logging 17

Conceptual 18

Hybrid 20

Security Evaluation 22

Methods for compromise 22

The Impact of current research 23

Remaining Questions from current research 23

Future research from Current Research 24

Conclusions from current research 25

III. Hypothesis and Proof 26

IV. Methodology 27

Output Perturbation 27

Data Perturbation 28

Liew Perturbation 29

Nielson Perturbation 30

Comparing the methods 31

No perturbation 32

Uniform Random Perturbation 33

Liew Perturbation 34

Nielson Perturbation 35

Simulation of databases with arrays 36

Exploration of the solution space 37

Evolutionary algorithm 40

A Closer Look at Nielson Perturbation 42

The Experiment 44

Calculating the average fitness score 44

Monte Carlo Simulation 45

Calculating the t scores 46

Appendix 1 47

References 61

Table of Tables

4.1 Example of Small Table with Output Perturbation 27

4.2 Example of Small Table with 20% Data Perturbation 28

4.3 Example of Small Table with Liew perturbation 29

4.4 Example of Small Table with Nielson perturbation 30

4.5 Results of 10 evolutionary runs 31

4.6 Variables used to calculate the fitness score 44

4.7 Calculating the mean and standard deviation 45

4.8 Conclusions to be drawn 46

Table of Figures

4.1 Fitness Graph with No Perturbation 32

4.2 Fitness Graph with Uniform Random Perturbation 33

4.3 Fitness Graph with Liew Perturbation 34

4.4 Fitness Graph with Nielson Perturbation 35

4.5 Fitness Graph with alpha=0, beta=9, and gamma changing 37

4.6 Fitness Exploration with beta=0, gamma=310 and alpha changing 38

4.7 Fitness Exploration with alpha=2, Gamma=310 and beta changing 39

4.8 Fitness of a Nielson Perturbation 40

4.9 Optimized Nielson Perturbation 42

4.10 Optimized Nielson Perturbations with a Small Query Size 43

Examining Database Security Methods 

Strategies to Prevent Inference-Based Snooping in Database Tables

I. Introduction

Today’s businesses are relying more and more on their data that they have collected. IN the past larger profit margins were common. Today profit margins are shrinking. This requires businesses to be more careful on choosing which products to carry. Maintaining the security in those detailed databases is a real challenge. For example, businesses want to be able to access summary information about the salary budget of a department, but they do not wish to disclose the specific salary of an individual. There are several methods that attempt to protect what may be considered snooping. These methods include query restriction, data perturbation, output perturbation, conceptual, partitioning, data logging, anonymization and hybrid methods.

If employees can just query the database to obtain their co-workers’ salary information, then this information can be used to demand a higher wage. The payroll department needs access to this table to run and manage the company payroll. They do not, however, need to access this database to obtain their co-workers’ salary to demand a higher wage. Is there a method that exists to allow the access to the database to conduct normal department work, yet stop access to their co-workers’ salary information?

In the health care field, there is a real need to protect a patient’s privacy. Yet there is also a need to allow for researchers to get summary information. It is acceptable to release that 47 cases of a disease have been reported. But it is not allowed to say that patient X has that disease. In order to allow the first query, a user must have access to the disease field. But in order to stop the second query a user must not have access to that disease field. So there is a real need to allow a user partial access to a field of a database.

The type of employee that has database access to do their job, yet tries to get information that they should not have is called a ‘snooper’. There is a real need to stop these snoopers to ensure the confidentiality of the database.

A database table is said to be compromised if a snooper can infer data from that table. This does not mean that it has to be breeched to be compromised. If a method exists to infer the data then the table is compromised. It is then not a matter of how the table is to be compromised but a question of when.

The most common solution to the problem of inferring data is to ensure that all SELECTs return more than one record. A SELECT will return an error if one specific record is returned. For example asking the data base to return the current salary of employee # 555-55-5555 would return an error message.

Another method is to not respond to queries that only involve one record in a table. This is easily overcome by inferring data in the table. If the snooper knows that there are only 2 men in the payroll department and one of them is himself, then the snooper can query the database to return the average of all male employees in the personnel department. Since the snooper knows his own salary, he can mathematically calculate the other employee’s salary. The average_salary=(unknown_salary+own_salary)/2. Or unknown_salary=average_salary*2-own_salary. This can be used to infer the whole table with little effort. It can be shown that an entire database can be inferred given only one record in the table. This paper will present strategies that address inference-based loopholes in database tables in an attempt to prevent snooping and improve database security.

II. Related Work

Several methods to address the security problem have been proposed. These methods can be broken down into eight methods. These methods include output perturbation, data perturbation, suppression, partitioning, data logging, conceptual, anonymization and hybrid methods.

  Output Perturbation

Output perturbation is modifying the output of a query to stop a snooper.  There are several methods for modifying the output of a query.

One of the output perturbation methods is called the random sample method. In the random sample method the output is based on a random sample of the data, not the entire table. An entity in the table is included with a set probability. This probability can be variable or fixed. When the probability approaches 0 the query set size approaches 1. This approach can be compromised with enough queries. Also, this approach can produce results based on zero or one record (Denning, 1980).

Next are the rounding methods. In these methods the result is rounded. Systematic rounding, random rounding, and controlled rounding have all been investigated. Random rounding can produce the exact result with a small probability. It can also be compromised by averaging a large number of queries. Systematic rounding introduces a non-zero bias to the data. Generally rounding is not considered an effective security-control method (Adam & Worthmann, 1989).

Another method is to insert into the data a random set of records with the same statistics. This increases the size of the table thereby increasing the security of the table. This maintains all the statistical properties but increases the sample size.

Another technique is to add a random value to each output.  The protection level can be adjusted by adding a larger or smaller random value. If the value of a query is 40,000.00 and a random number between -10,000 and 10,000 is added then the value is returned between 30,000 and 50,000.  This provides better protection. But this method can be easily compromised by repeating the query multiple times and taking the average. (Beck, 1980)

Multiplying a random value protects the data further.  Instead of adding a random value to the output of the query the output can be multiplied by a random value. But this method too can be easily compromised by repeating the query and taking an average. A specific record would have a random error introduced but a large number of records should have the error corrected.

Another technique is to lie about the value.  With a certain probability just make up a number.  Again with repeating the query and taking a average this method can be compromised.  Beck proposed a scheme in which a query is answered but with a value from the wrong record (Beck, 1980).  This method fails because there is no way of knowing what values can be swapped and the values make no sense after being swapped.

A method proposed by Beck (1980) says that the error must be the same for every query of the same parameters. Otherwise the snooper could just repeat the query several times and compute the average. He proposes to use a hash of the query to adjust the error function. The effect of the error can be amplified or dampened using a constant multiplier. This can be used to tune the accuracy vs. the security of the data.

It has been shown that output perturbation produces limited security.  It has also been shown that these perturbations modify statistical calculations (Reiss, 1980].  

It is difficult to perturb Boolean values. If a field is a Y/N type field then how can you change that field to N and a half?  Output perturbations are relatively easy to do and can be done on the fly (Adam & Worthmann, 1989).  It is also easy to support additions and deletions for the table.

Data Perturbation  

In the data perturbation methods the database is transformed into another database. The transformed data is then used for statistical analysis. This transformed table needs to be maintained by the system or the database manager.

In data perturbation methods, an error is introduced into the data. With larger data sets the data error approaches zero. In smaller data sets the error is larger. In order to introduce an error to the data a bias factor is introduced. This factor can skew the summary statistics of the data. Both the data error and the bias factor need to be reduced while at the same time they must secure the data.

In the first method, the data is exchanged with another set of data with the same probability distribution. In this method the data that is to be protected is replaced by randomly generated data with the same probability distribution. There are several problems that need to be addressed with this approach:

1.      The data swapping is computationally intensive and is only suited for static data.

2.      If the data is changed by adding or deleting a record, consideration needs to be made with its relation to the rest of the data.

3.      There needs to be a one to one relationship between the original data and the perturbed data.

4.      The precision of the resulting data may be unacceptable.

5.      The small query sizes of 1 to 2 records needs to be considered.

An alternative to this approach is proposed by Liew (1985). He proposes three steps: Step 1 is to identify the underlying density function and estimate the size, mean and standard deviation of the data. Step 2 is to generate a sample set of data with the same size, mean, and standard deviation.

Step 3 is to substitute the real data with the generated data, using the same rank order. The largest value should be swapped with the largest value of the generated data.

This method, however, introduces a bias that is not present in the original data and therefore is not suitable for detailed statistical analysis. This method shows lots of promise and is commonly used.  However the data is not really perturbed that much since the perturbed data has the same mean and standard deviation and rank as the original data.  Therefore it might not be appropriate for certain applications.

Tendick and Matloff (1984) suggest using a matricidal approach to this. They suggest creating matrices of the data and multiplying them by a key matrix of constants to further secure the data. The parameters of the constant matrix can be fine tuned to balance the accuracy and the security of the data. That matrix can be multiplied by another matrix of noisy values.  Changing those values can add a varying degree of error (Ketel, 2005)  (Schl, 1981]).  Matrical methods are more difficult to break since they are dependent on more keys to decode the data.

A random factor can be added or multiplied to each value in the table.  This random factor can be used to scale the error up or down.  A higher multiplication factor would result in a higher error rate and a more secure perturbation.  This approach adds noise that has to be corrected to gain statistical information. This method suffers in terms of scale.

Data perturbation methods have several weaknesses.  First this approach is suitable only for off-line databases such as data warehouses and data marts.  It also does little or nothing to prevent small query sizes.  Addition and subtraction need to be considered.  Sometimes the precision level is unacceptable (Adam & Worthmann, 1989).  Tuab suggests that perturbation suffers in terms of scale.  (Palley, 1987)  (Liew,  1985) 

Suppression

Suppression works by suppressing some of the data values in a database table.  For example, in a table of employees and their salaries, the name of each employee would be suppressed. Several methods exist in order to protect sensitive data.

One method is to answer only N queries per day on a table. This type of protection is subject to a timed attack over several days. It will slow down the attack, but it will not prevent the attack from happening.

A second method is called cell suppression. In this method the basic idea is to suppress the cells that might cause the database to be compromised. If the salary of an employee is sensitive then do not allow query responses on that data. This method might ‘suppress’ the very information that a valid user may require.

Query size control works by not responding to queries with the number of records less than a predefined minimum limit.  Some have proposed that queries must include ALL records in a table to be secure (Yu, 1977).  It has been shown that for a table with L records, restricting queries to include at least L/2 records is NOT secure. (Adam & Worthmann, 1989) (Beck, 1980)  It has been shown that as the query set size increases the accuracy of the statistics decreases. (Palley, 1987)

Random sampling works by sampling the records in the query instead of using all the matching records to answer the query.  There is no way to detect if a specific record is in the query. (Palley, 1987)  (Liew, 1985)  A large number of samples have to be used to ensure accuracy. (Palley, 1987)  Random sampling is a poor way to estimate extremes. (Rowe, 1983)  Some feel that this research area shows promise. (Beck, 1980)  Random sampling can be compromised by repeating the query multiple times and averaging the response.

Suppression methods can increase the security of a database; however it is not considered a fool proof way of protection data.  These methods are commonly combined with other methods to increase security.  These methods are relatively easy to implement.

Anonymization

This methods works by seeking the common field values in the records of a table.  All of those values are then replaced with a common special symbol. (Kifer and Gehrke, 2006)  A quasi-identifier is a series of field values that can be used to identify a specific individual. A table is said to be k-anonymous if there is a least k values in each of the quasi-identifiers. The process of k-anonymization is known to be NP-Hard.  That means that there is no known method to find the common values and replace them. However there have been some methods that give a good solution not just the best solution. (Zhong and Yang, 2005) 

Cell suppression suppresses a cell in a table that is deemed sensitive. (Dinur and Nissim, 2003)  For example a cell that contains an employee name in a table of employees and their salary.  The original table can be perturbed to contain a '?' in all the cells of the employee name table. (Verykios, 2004])  This method can be compromised by the fact that most suppressed cells can be inferred by the other data in the record.  Would it be hard to guess the name of an employee that has a specific phone number or address?  A method to suppress sensitive cells already exists (Adam & Worthmann, 1989) .

A DBA can also use the concept of a view to accomplish this. A view can hide the existence of a given field or record of a database table. A view can also be used to make visible summary information only. But if a snooper has knowledge of the parameters of a view, it is easier to launch an inference attack on the data.

 Partitioning

Partitioning is based on separating the entities in a database into mutually exclusive subsets. As long as the atomic population of each of the subsets is not one a high level of security exists. But if the snooper has knowledge of insert, update, and delete methods whole new arrays of attacks are opened up.

One of the most common methods is to ensure that all query results use several records in the database. As the number of records required computing the result is increased, so is the security of the database. So just how many required records should be used? If a table has L records and the query requires K records to produce an output, then even as K approaches L/2 it has been shown to be insecure.  If R denotes the maximum overlapping pairs of queries, then the number of queries needed is 1+(K-1)/R (Adam & Worthmann,1989).  A subset with a cardinality of 2 is not secure if the snooper knows one of the records.  Care needs to be exercised to ensure that all subsets are mutually exclusive.  If the subsets have a specific record in more than one subset snooper can derive the individual records by using the common records to derive the other records. (Reiss, 1984)

Large numbers of databases have been studied and it is been shown that partitions of size one is frequent. Sometimes it is difficult to achieve a correctly portioned database (Adam & Worthmann, 1989).

In the partitioning method proposed by McLeish (1989) the system follows three basic rules for queries. The first is to allow returning sum and counting queries only if it is for the entire partitioned set. The second is to allow deletions and insertions in pairs only. If single additions or deletions were allowed, the snooper could obtain a sum then delete one record, subtracting the two to deduce the protected value. And finally, if a record is deleted then it cannot be reinserted into the database. A snooper could delete all the records that he is not interested in and then use the remaining records to obtain protected information, and then reinsert the deleted records to cover their tracks.

McLeish (1989) proves that, “A generalized dynamic statistical database remains secure for sum and count queries made before and after each update if and only if the global effective change is not equal to one.”

Data can be split over several tables.  Each partition can be in a separate table or in the same table (Verykios, 2004).

Each table can be broken into several partitions.  Each query needs to use the records from each partition to be serviced  (Chin, 1978).

Dummy entries can be added to each partition to further protect the data from a snooper.  This method can lead to statistical errors. (McLeish, 1989)

Partitioning data is computationally complex.  It takes a lot of CPU time to accomplish this. Advances in Multi-Processing can speed this process up by 2-3 times. (Baru  and Su, 1984) This method is difficult to implement in non-numeric fields. (Beck, 1980)  It also leads to severe data distortion. (Beck, 1980)  Records can be safely added or deleted as long as the cardinality of each partition is not equal to 1. (McLeish, 1989)  But it is common to have instances of databases that have several partitions with a cardinality of 1.  (Adam & Worthmann, 1989)   

Data Logging

Data Logging works by keeping  a log of all queries and answer only those that will not compromise the data. This method, however, is very CPU intensive and disk intensive. It slows down the query speeds significantly. If this method is implemented over a longer period of time, all queries will eventually be blocked because query done a week ago, combined with the current query, could result in a breech.

In a data logging system all queries are logged.  If a query is proposed, the log is checked for query overlapping.  This method works by making sure that no two queries have an overlap of more than a predetermined number of records. (Chin, 1978)

As the number of queries logged goes up the number of comparisons to check if a query is permitted goes up as well.  Checking if releasing the results of query 2 given than query 1 has already been released is np-complete (Machanavajjhala and Gehrke, 2006).  All overlaps of queries need to be logged and checked for compromise (Liew and Liew, 1985).

There has been some research into creating a set of rules that if followed limit but do not guarantee compromises. (Verykios and Bertino  et al. ,2004)

It has been shown that logging methods can work on smaller databases (Palley and Simonoff, 1987) but larger databases pose complexity problems.  It has also been shown that NOT answering a query can lead to a database compromise (Sicherman and Jonge  et al., 1983).

Conceptual

In the conceptual approach a framework of data is described from creation to implementation. It is based on a collection of entities that have common attributes. There is no data manipulation language (such as relational algebra) allowed to merge or intersect the entities. The only data that the snooper has access to are the precompiled statistics of the common attributes.

The framework is described by the following rules:

1.      Each subpopulation has a definition construct that contains allowed query types and histories of allowed queries.

2.      A construct of user knowledge is maintained. The creation of the partitions in the table is based on what a user knows outside the framework of the data.

3.      A constraint enforcer and checker are implemented to make sure that no data is compromised.

4.      Before any modification to the rules is permitted a test set of data is constructed and tested.

5.      A question and answering system is the only method by which the users may communicate to the DBMS (Ozasoyoglu, 1985).

To better manage the security of a database, formal rules need to be created and maintained.  One such system of rules has been proposed. It consists of several parts. The first part is called PDC(Population definition control). The PDC is concerned with the population definition. The second is called UKC(User knowledge construct). The UKC is concerned with the logging of all knowledge that each user has. The third area is the CEC(constraint enforcer checker). It is a list of rules or constraints that must be followed to ensure database safety. (Chin and Ozsoyoglu, 1981)

Other protection specification languages have been proposed. (Hartson  and Hsiao, 1976) This is a control of who has access in the population to the database.  All users with access are listed and well defined.

Next a list of all information that is known by each of the users of the database are listed.  All possible inferences should be analyzed. (Adam & Worthmann, 1989)  

Protecting data completely is of great complexity (Hanson, 1995).  But we can better protect the data by listing and defining access and associated access rules.

This conceptual-model is an ambitious endeavor and has not been completely implemented yet. It is, however, a great tool in which to base further research.

Hybrid

In creating a security model for a database, the speed in which the data can be transformed and output is a major consideration. Some methods require an inordinate amount of time to complete. This is not acceptable by the users of the database. It has been shown that the data-perturbation methods take a significant amount of time to transform the data and there are still problems with inserting and deleting records. So they are only appropriate for static data tables. A whole new class of methods that can be done in memory called memory less inference controls emerges as a possibility worth investigating. But these controls are known to be insecure. The most common security control is data tracking. A log of all previous queries is maintained. Queries that can infer the protected data are blocked. This is known to be exponential in complexity; therefore it will take too long to complete. Maybe one could try a combination of several types of controls at the same time. Controls that are memory less can be combined (Hanson, 1995).

Some of these techniques are:

1.      Maximum Order Control – This is based on the assumption that tables based on fewer attributes tend to release fewer sensitive statistics. By allowing statistical values to compute for only tables with a few attributes increases the security of the whole database.

2.      Maximum Relative Table Size – This is based on the assumption that the larger the table the more secure it is. It has been shown that restricting the queries will be overly restrictive.

3.      Partial Table Suppression – Partial table suppression selects a value of k for the expected density for a given partition within the table and does not release queries that fall below that level. K is assumed to have a value greater than 2.

4.      Query Set Size Control – This seeks to limit the queries based on the relation between the query size and the table size. This control has very little value if used alone (Hanson, 1995).

Since these methods are computationally quick and easy to implement, combining all of these makes sense. First one would run the queries through the maximum order control. Second one would try the maximum relative table size control. If the query passes the first two one would try the query set size control. Finally one should try the partial table suppression control. If the query passes all for these controls then release the results of the query (Hanson, 1995).

Since each of the approaches in general cannot guarantee protection, several researchers are trying to combine several different approaches together to see if that increases security.

Hybrid approaches do give more security but even this security is not foolproof.  The entire security scheme can be compromised by compromising each of the sub-approaches.  This type of security approach also has a processing power cost.  The processing power of the system needs to be addressed as more and more security schemes are implemented (Hanson, 1995).  

Security Evaluation

In order to evaluate if a security method is effective, criteria for the measurement of security needs to be developed. The following areas need to be evaluated: performance, access ease, uncertainty and resistance to mining (Verykios and Bertino, et al. 2004). Others have suggested that effectiveness, feasibility and efficiency also need to be evaluated. (Yu and Chin, 1977).

Also the evaluation of a security method needs to be defined in terms of a scale. If a table of salaries exists a $4,000.00 error is not much in a $120,000.00 salary. But a 4000 point error on a test percentage ranging from 1-100 is a very significant error. (Palley and Simonoff, 1987).

Based on this review of literature, a security method should pass the following tests.: 1. The same query must produce the same answer every time it is executed. If this is not the case then a user could run the query multiple times and average the answer. This would lead to a compromise. 2. The query must produce a more accurate answer as more records are included. It must also distort the answer more if fewer records are included. 3. All queries must produce results. Information can be inferred if the system does not respond to a query.

  Methods for compromise

A database can be compromised by asking several overlapping queries. If a table has 3 records of interest in it, and query one returns the total of record 1 and 2, and query two returns the total of record 2 and 3, the snooper can use simultaneous equations to compromise all three records. (Denning and Denning, 1979). Overlapping queries in this manner can lead to data compromise. (Schwartz and Denning, et al, 1979). These overlaps can be logged and managed to further increase the protection of the data. If the snooper knows just one record in a database then with a series of overlapping queries the entire table can be deduced. (Chin, 1978).

The Impact of the current research

The methods described have been shown to increase the security of a database system. But none have been shown to insure the complete safety of a system. Different methods work best in different situations. None work best in all situations.

There is a trade-off between the security of a database and the needs of the users of the database. Every query allowed is also one more security threat. Time needs to be spent balancing the needs of the users and the security needs of the company in any DBMS.

There is also a trade-off between response time of a query and the security needs of the company. Most security methods take time to execute, and therefore time also needs to be spent balancing these needs too.

Remaining Questions from current research

After a careful review of the literature, some questions can be answered. There does exist methods that will guarantee the security of a database from snoopers. However these methods extract a high price. They make extracting summary information either impossible or improbable.

A balance needs to be formed balancing the needs of a company to security its data from snoopers but still allowing for limited statistical analysis. The balance point is different for each business.

A balance also needs to be formed balancing the need for quick response to queries and still providing some security. Users demand speedy access to the data. Most security protocols slow down the response to the users’ queries. The balance point for this trade-off depends on how much data needs to be protected and secured.

Future Research from Current Research

Although a lot of research has been conducted, primarily in the 1980’s, there still exists room for other areas of research. Today’s computers are bigger and faster, so more can be done to secure them. Additional security algorithms exist and need to be implemented and tested to see how much they can secure a database from snoopers.

Here are some areas for further research:

1.      Testing the security of different methods – A small database can be constructed with a small number of records. All the different methods could be implemented to see if the database could become compromised. If a small database can be compromised then a larger one can become compromised with a little more effort.

2.      Testing the encoding time of databases - A larger database can be constructed with millions of records. All the different methods could be tested to see how much time is required to encode the new database

3.      Testing the response time of queries – Several databases could be constructed to measure the response time of sample queries. This could be compared to the response time of the same queries but in the ‘secure’ database.

4.      Testing the accuracy of different methods – Some methods perturb the results. A test could be performed to test the accuracy of responses to the perturbed results.

5.      Hybrid methods – Several methods could be combined to produce results. Query response times, encoding times, security scores, and accuracy of results need to be measured and compared.

6.      Other methods – Other methods are on the horizon. These methods need to be described, documented, and tested.

Conclusions from current research

In summary, the following seven points proposed by Adam and Wortmann (1989) need to be considered before implementing any method for securing sensitive information in a database:

1.      No single method exists that will work for all tables. There is a trade-off between high levels of security and a high richness of data analysis.

2.      Most security methods treat the database as a file. Inter-relations between the elements of the database need to be considered.

3.      For dynamic online databases, data perturbation methods tend to add a higher level of security. Careful consideration needs to be made concerning the addition and deletion of records.

4.      For a static online database, data swapping seems promising although the bias errors need to be eliminated.

5.      Cell suppression is commonly used for static offline databases but it suffers from security concerns. Random sample methods seem promising in these applications.

6.      New types of attacks are being used every day. New defenses need to be researched to combat these threats.

7.      Finally there is no single security control method that prevents both exact and partial disclosers.

  The only truly secure database is one where no one can access it. This is impractical so a compromise needs to be reached.

III. Hypothesis and Proof

The Nielson perturbation method is significantly better at not disclosing the actual value at the 95% confidence level than random output perturbation, data perturbation, and Liew perturbation. It is easier to prove this by disproving this hypothesis. The alternate hypothesis is that the Nielson method is not equivalent or worse than the other methods, meaning it is better.

The hypotheses that will be tested include:

H1a: The Nielson perturbation method results in a significantly better fitness score than the output perturbation method.

H1b: The Nielson perturbation method results in a significantly better fitness score than the data perturbation method.

H1c: The Nielson perturbation method results in a significantly better fitness score than the Liew perturbation method.

To test this fitness score will be assigned to 10,000 random queries on 10,000 random databases. A smaller fitness score is better because the fitness score is calculated by taking the desired error level and subtracting the actual error level.

A student’s t-test will be conducted on the average fitness scores. The confidence factor for the test will be at the 95% confidence level.

IV. Methodology

Several methods of random perturbation exist to protect against the inference attacks of snoopers: Output Perturbation, Data Perturbation, Liew Perturbation and Nielson Perturbation. Each will be described and implemented and evaluated.

Output Perturbation

Perturbation is a method to prevent inference attacks. Perturbation works by changing the data in a database table. There are two types of perturbation. The first changes the actual data in the table. The second just changes the output of the queries posed. Here is an example of a small database table with output perturbation:

| |Data |Output Perturbed Data |

|1 |101 | |

|2 |92 | |

|3 |103 | |

|4 |91 | |

|5 |100 | |

|6 |81 | |

|7 |122 | |

|8 |113 | |

|9 |103 | |

|10 |94 | |

|[pic] |100 |85 |

|s |11.6 | |

Table 4.1 Example of Small Table with Output Perturbation

Data Perturbation

Data Perturbation creates a second table that is perturbed or changed. For example a table of employees and their salaries can have all the salaries changed by adding or multiplying a random number. If that number is big, then there is a lot of distortion in the table. If that number is small, then the distortion is small. If a query accesses a small number of records then the distortion should be big.

There have been several methods proposed to perturb the data. The common methods multiply or add a random number to each data field. Here is an example of data perturbation with a 20% random factor added in:

| # |Data |20% Output Perturbed |Error |

|6 |81 |96 |15 |

|4 |91 |95 |4 |

|2 |92 |79 |13 |

|10 |94 |89 |5 |

|5 |100 |114 |14 |

|1 |101 |105 |4 |

|3 |103 |120 |17 |

|9 |103 |106 |3 |

|8 |113 |129 |16 |

|7 |122 |120 |2 |

|[pic] |100 |105.3 |9.3 |

|s |11.6 |15.7 |6.1 |

Table 4.2 Example of Small Table with 20% Data Perturbation

Liew Perturbation

A more comprehensive method exists. It is called the Liew method. The first step in the Liew method is to calculate the number, mean, and standard deviation of the original data. Random data is generated with the same number, mean, and standard deviation as the original data. Both copies of the data are then sorted and substituted for each other. Here is an example of Liew perturbation:

| # |Data |Liew Perturbed Data |Error |

|6 |81 |87 |6 |

|4 |91 |93 |2 |

|2 |92 |97 |5 |

|10 |94 |99 |5 |

|5 |100 |99 |1 |

|1 |101 |101 |0 |

|3 |103 |105 |2 |

|9 |103 |115 |12 |

|8 |113 |118 |5 |

|7 |122 |121 |1 |

| | | | |

|[pic] |100 |103.5 |3.9 |

|s |11.6 |11.1 |3.5 |

Table 4.3 Example of Small Table with Liew perturbation

Nielson Perturbation

This paper compares the methods of data perturbation and introduces a new method. This method is referred to as the Nielson method. The Nielson method works by generating a random number between two values called alpha and beta. That number is randomly negated. Then that number is multiplied with the original data to get the perturbed value. This is only done for the first n number of records. The value of n is called gamma. This is done to insure maximum perturbation on smaller datasets and minimum perturbation on larger data sets. The value gamma controls the cutoff between smaller and larger dataset. Here is an example of some data with Nielson perturbation:

| # |Data |Nielson Perturbed Data |Error |

|6 |81 |72 |9 |

|4 |91 |81 |10 |

|2 |92 |81 |11 |

|10 |94 |76 |18 |

|5 |100 |116 |16 |

|1 |101 |86 |15 |

|3 |103 |83 |20 |

|9 |103 |89 |14 |

|8 |113 |98 |15 |

|7 |122 |106 |16 |

|  | | | |

|[pic] |100 |88.8 |14.4 |

|s |11.6 |13.8 |3.5 |

Table 4.4 Example of Small Table with Nielson perturbation

Comparing the methods

In order to compare the methods, a method needs to be developed to compare the ‘fitness’ of each of the methods. The method used is to calculate the average absolute error. For smaller queries it is desirable for a big fitness score. For larger queries a smaller fitness score is desired. For queries less than ½ the size of the database the error will be subtracted from 100. For queries of ½ the size of the database or greater, the error must be subtracted from 0.

Obviously we can not just test one database and expect reliable results. So to test the fitness of several methods a Monte Carlo simulation will be ran. In a Monte Carlo simulation a very large number of random tests are run. This is done to eliminate the effects of the randomness of the test cases. The central limit theorem states that if you take any distribution of numbers and take the average. Then compare the averages, the distribution of those averages will approach a normal distribution as the sample size goes up. Since we will be using millions of databases our averages should be normally distributed. We can compare two averages of a population and see if they come from the same distribution by a simple student’s t-test.

No perturbation

Figure 4.1 is a graph of fitness scores for 10,000 random queries with no data perturbation. The optimal perturbation method will show up as a row of zeros. A large absolute error for the first ½ of the data is desirable. A small absolute error is desirable for the second ½ of the query sizes. Again the closer to zero the fitness score is the better it is. Notice that the data is grouped close together. This is true since there is no random perturbation in the data.

[pic]

Figure 4.1 Fitness Graph with No Perturbation

Uniform Random Perturbation

Next here is a graph of 1,000 random queries with 20% uniform random perturbation (Figure 4.2). The data is grouped close together but has some spread.

[pic]

Figure 4.2 Fitness Graph with Uniform Random Perturbation

Liew Perturbation

Figure 4.3 is a graph of 1,000 random queries with Liew perturbation allied. The data is grouped together but has a bit more spread.

[pic]

Figure 4.3 Fitness Graph with Liew Perturbation

Nielson Perturbation

Figure 4.4 is a graph of data with Nielson perturbation applied. Values of alpha=2.09 beta=1.18 and gamma=66.87 was used. Notice how there is a great random spread with smaller queries and a closely grouped spread with larger queries. This is desirable since we wish to perturb smaller datasets while preserving larger datasets. It is with smaller datasets that inference attacks occur.

[pic]

Figure 4.4 Fitness Graph with Nielson Perturbation

Simulation of databases with arrays

Databases use hard disks to store the data values. Hard disks are notably slow in access times. Because of this an array is used to simulate the database table. Each row in the array represents a record in the database table. The speed of access is not being compared. Rather the relative error is being compared. Because of this an array can be used to minimize the access time of the hard disk drive. Remember millions of database tables need to be simulated.

Exploration of the solution space

Next let’s look at some of the possible solutions and evaluate whether they are better or worse than other solutions. Conventional methods include setting alpha=0 and beta=0. Then find the best gamma. Step two involves setting gamma=best and beta=0 and finding the best alpha. The final step is to set alpha=best and gamma=best and finding the set beta.

First let’s set alpha=0 and beta=0 and explore the range of gamma:

[pic]

Figure 4.5 Fitness Exploration with alpha=0, beta=0 and Gamma changing

Next let’s set gamma to its optimal value = 310 in the graph and set beta=0 and look at alpha:

[pic]

Figure 4.6 Fitness Exploration with beta=0, gamma=310 and alpha changing

Next let’s set gamma=310 and alpha to its optimal value=2 in the graph and explore beta values:

[pic]

Figure 4.7 Fitness Exploration with alpha=2, Gamma=310 and beta changing

Figure 4.8 include fitness results of 1000 random databases with Nielson perturbation applied:

[pic]

Figure 4.8 Fitness of a Nielson Perturbation

As the query size goes up the randomness of the values decreases, which is the desired outcome.

Evolutionary algorithm

But what is the optimal value of alpha, beta, and gamma? An evolutionary algorithm is next employed to find the best values of these parameters. It works by starting at a random point and then adding a random factor to alpha, beta, and gamma. It then runs a simulation to see the average fitness score. If the result is better it then uses the new values and starts over again. If the result is worse than the original then it uses the old values and starts over. In order to explore the other values of the parameters there is a 5% probability of a random restart. This process continues for 1000 generations. The best scores are then displayed. Because each generation requires a simulation of millions of databases this process takes a significant amount of CPU power and time to complete.

Results were computed for ten different runs. Since each run simulated 100,000,000 database queries, each run took about 48 hours to compute. Here are those results:

|Alpha |Beta |Gamma |Fitness |

|1.50 |1.11 |335.38 |46.30 |

|1.75 |0.75 |269.71 |46.49 |

|1.33 |1.27 |233.69 |46.41 |

|1.40 |1.33 |382.70 |46.43 |

|2.09 |1.18 | 66.87 |46.20 |

|1.34 |1.24 |154.18 |46.55 |

|1.33 |1.16 | 60.31 |47.26 |

|1.48 |0.93 |193.00 |46.90 |

|1.63 |1.09 |105.35 |46.70 |

|1.29 |0.97 |106.76 |46.92 |

Table 4.5 Results of 10 evolutionary runs

From the data it looks like there is a relatively flat optimization front. But there is a definite trend to use: alpha=2.09, beta=1.18 and gamma= 66.87. The gamma parameter is not as clear.

A Closer Look at Nielson Perturbation

Figure 4.9 is a graph of 1,000 random queries by size and fitness function with these parameters. Note that data close to zero is desirable.

[pic]

Figure 4.9 Optimized Nielson Perturbations

It is in data with smaller query sizes that a real danger of inference attacks is present. In Figure 4.10 note that there is a large randomness factor at query sizes less than 40 records. Smaller fitness scores are best.

[pic]

Figure 4.10 Optimized Nielson Perturbations with a Small Query Size

The results shown in Figure 4.10 are desirable. Also note that with real small queries in the range of 1-3 records major distortion is shown.

The experiment

Now all that is left is to conduct the actual experiment and answer the question: “Is the Nielson perturbation method better than the output perturbation method, data perturbation, and Liew perturbation methods?” The preliminary items have been conducted. All that is left is to simulate with arrays 1,000,000 random queries for each of the types of perturbation and test using a Student’s t-test with a confidence factor of 95% for each perturbation type. The parameters to be used for the Nielson method are alpha=2.09, beta=1.18 and gamma= 66.87. The Liew method does not have any parameters. That is one of the reasons that it is so popular. The output perturbation method has a fixed distortion value. A value of 20% will be used here. Also the data perturbation method has a distortion value. A value of 20% will also be used here.

Calculating the average fitness score

Table 4.6 includes the variables that will be used in the calculation of the fitness score symbolized by f:

|Variable |Description |

|D |The actual data value |

|P |The perturbed data value |

|q |Query Size |

|b |Database Size |

|f |Fitness |

| |If q99% |

|Liew perturbation |100,000 |50.11 |4.91 |132.92 |199,984 |>99% |

|Nielson perturbation |100,000 |48.79 |4.95 | | | |

|data perturbation |100,000 |49.86 |4.96 |107.37 |199,997 |>99% |

Table 5.1 Results First Run

Just to be certain the experiment was then repeated two more times for a total of 3 times. The results of the two repeat experiments are shown in table 5.2 and table 5.3.

|method |count |ave |s |t |df |p |

|no perturbation |100,000 |49.93 |4.99 |117.56 |199,983 |>99% |

|Liew perturbation |100,000 |50.08 |4.90 |132.76 |199,979 |>99% |

|Nielson perturbation |100,000 |48.76 |4.95 | | | |

|data perturbation |100,000 |49.95 |4.95 |107.31 |199,997 |>99% |

Table 5.2 Results Second Run

|method |count |ave |s |t |df |p |

|no perturbation |100,000 |49.96 |4.99 |117.78 |199,981 |>99% |

|Liew perturbation |100,000 |50.10 |4.90 |132.95 |199,982 |>99% |

|Nielson perturbation |100,000 |48.79 |4.94 | | | |

|data perturbation |100,000 |49.95 |4.95 |107.52 |199,997 |>99% |

Table 5.3 Results Third Run

The results were significant at the 99% significance level. The Alternate hypothesis is rejected and the NULL hypothesis is accepted. The Nielson perturbation method is significantly better than no perturbation, Liew perturbation, and data perturbation at hiding and protecting data from inference attacks.

VI. Further Studies

One item that was not tested is the time that it takes to perturb the data into the various perturbed forms. The Nielson method would normally be done at night when the data processing requirements of the server is low. The Nielson method makes a perturbed copy of the data to be disclosed. The Liew method does a similar perturbation of the data. But the Liew method requires several passes through the database. The first pass is to calculate the mean and standard deviation. Then there is a pass to generate data with the same mean and standard deviation. Then both files need to be sorted. Finally the data needs to be swapped. This would require a lot of CPU time to do this. The Nielson method should produce faster results. The multiplication method should be less CPU intensive than the Nielson method.

Another item for research would be to modify the Nielson method to be able to perturb the data as it is extracted from the database. Care must be done to prevent the user from executing the query multiple times and taking the average. Several methods could ensure against this. The first would be to log all queries and make sure that a query is not duplicated. However, in normal situations the same query is executed several times to the course of doing business. Another technique would be to use the users name has a hash to seed the random number generator. This way each user would get the same sequence of random numbers. This would result in the user getting the same results for the same query. However, this may be compromised by the user figuring out the sequence of random numbers thereby obtaining the random multipliers.

Appendex 1 (Source Code in Python for AI Experiment)

import random

## nielson perturbation

## alpha (outer range 0-1)

## beta (inner range 0-alpha)

## gamma (Records to change 0-# of records)

def nielson(alpha,beta,number) :

tryNumb=random.uniform(beta,alpha)

if (random.uniform(-1,1) < 0):

tryNumb = -tryNumb

return number+(tryNumb*number)

# uniform perturbation

def uniform(factor,data):

trynumb=random.uniform(-factor,factor)

return data+(data*trynumb)

# liew perturbation

def liew(data):

return data

# randomly reorder list

def scrambledbl(db,liew):

db.sort()

liew.sort()

for i in range(1,int(random.uniform(1,100))):

pos1=int(random.uniform(0,len(db)))

pos2=int(random.uniform(0,len(db)))

temp=db[pos1]

db[pos1]=db[pos2]

db[pos2]=temp

temp=liew[pos1]

liew[pos1]=liew[pos2]

liew[pos2]=temp

return db

# randomly reorder list

def scrambledb(db):

for i in range(1,int(random.uniform(1,100))):

pos1=int(random.uniform(0,len(db)))

pos2=int(random.uniform(0,len(db)))

temp=db[pos1]

db[pos1]=db[pos2]

db[pos2]=temp

return db

# create new db

def createdb():

db=list([])

for i in range(0,1000):

db.append(random.uniform(50,150))

return db

# create new liew

def createdbl():

liew=list([])

for i in range(0,1000):

liew.append(random.uniform(50,150))

return liew

# execute query

def querydb(db,size,alpha,beta,gamma):

scrambledb(db)

result=list([0,0,0,0,0,0,0,0])

for tryrec in range(1,size+1):

data=db[tryrec]

if tryrec < gamma:

nielsondata=nielson(alpha,beta,data)

else:

nielsondata=data

uniformdata=uniform(.1,data)

liewdata=liew(data)

result[0]=result[0]+1

result[1]=result[1]+data

result[2]=result[2]+nielsondata

result[3]=result[3]+uniformdata

result[4]=result[4]+liewdata

for i in range(1,5):

result[i]=result[i]/result[0]

result[5]=alpha

result[6]=beta

result[7]=gamma

return result

# execute query (liew)

def querydbl(db,liew,size,alpha,beta,gamma):

scrambledbl(db,liew)

result=list([0,0,0,0,0,0,0,0])

for tryrec in range(1,size+1):

data=db[tryrec]

liewdata=liew[tryrec]

if tryrec < gamma:

nielsondata=nielson(alpha,beta,data)

else:

nielsondata=data

uniformdata=uniform(.1,data)

result[0]=result[0]+1

result[1]=result[1]+data

result[2]=result[2]+nielsondata

result[3]=result[3]+uniformdata

result[4]=result[4]+liewdata

for i in range(1,5):

result[i]=result[i]/result[0]

result[5]=alpha

result[6]=beta

result[7]=gamma

return result

## calc alpha beta and gamma

def ai():

maxIterate =int(raw_input("# of AI iterations "))

maxDb=int(raw_input("# of random DB's "))

maxQueries=int(raw_input("# of random queries "))

db=createdb()

max=maxIterate*maxDb*maxQueries

print max*.0015/60, " minutes."

print max*.0015/(60*60), " Hours."

loops=0

print "Executing . . . "

a=9999999

bestresult= list([1,a,a,a,a,a,a,a])

alpha=1

beta=.5

gamma=10

for interate in range(1,maxIterate):

queryresult=list([0,0,0,0,0,0,0,0])

size=int(random.uniform(1,1000))

if random.uniform(1,100) alpha:

beta=alpha-.01

if gamma > size:

gamma=size

for trydb in range(0,maxDb):

db=createdb()

for tryquery in range(0,maxQueries):

loops=loops+1

size=int(random.uniform(1,1000))

result=querydb(db,size,alpha,beta,gamma)

for i in range(2,5):

if size < 500:

queryresult[i]=queryresult[i]+100-abs(result[i]-result[1])

else :

queryresult[i]=queryresult[i]+abs(result[i]-result[1])

queryresult[0]=queryresult[0]+1

queryresult[5]=result[5]

queryresult[6]=result[6]

queryresult[7]=result[7]

if queryresult[2]/queryresult[0] < bestresult[2]/bestresult[0] :

for i in range(0,8):

bestresult[i]=queryresult[i]

results=bestresult[2]/bestresult[0]

percentDone=(float(loops)/float(max))*100

print " %.2f" % percentDone,

print "% = ",

print " %.2f" % bestresult[5],

print " %.2f" % bestresult[6],

print " %.2f" % bestresult[7],

print " - ",

print " %.2f" % results

print "Done . . . ."

## graph gammas

def graphg():

file = open("gamma.txt","w")

maxDb=int(raw_input("# of random DB's "))

maxQueries=int(raw_input("# of random queries "))

maxIterate=51

db=createdb()

max=(maxIterate)*(maxDb)*(maxQueries)

print max*.0015/60, " minutes."

print max*.0015/(60*60), " Hours."

loops=0

print "Executing . . . "

a=9999999

bestresult= list([1,a,a,a,a,a,a,a])

for iterate in range(0,maxIterate):

gamma=float(iterate*10)

queryresult=list([0,0,0,0,0,0,0,0])

size=int(random.uniform(1,1000))

alpha=1.5

beta=1.0

for trydb in range(0,maxDb):

db=createdb()

for tryquery in range(0,maxQueries):

loops=loops+1

size=int(random.uniform(1,1000))

result=querydb(db,size,alpha,beta,gamma)

for i in range(2,5):

if size < 500:

queryresult[i]=queryresult[i]+100-abs(result[i]-result[1])

else :

queryresult[i]=queryresult[i]+abs(result[i]-result[1])

queryresult[0]=queryresult[0]+1

queryresult[5]=result[5]

queryresult[6]=result[6]

queryresult[7]=result[7]

if queryresult[2]/queryresult[0] < bestresult[2]/bestresult[0] :

for i in range(0,8):

bestresult[i]=queryresult[i]

results=queryresult[2]/queryresult[0]

percentDone=(float(loops)/float(max))*100

print " %.2f" % percentDone,

print "% = ",

print " %.2f" % queryresult[5],

print " %.2f" % queryresult[6],

print " %.2f" % queryresult[7],

print " - ",

print " %.2f" % results

print >> file, queryresult[7],results

print "Done . . . ."

file.close()

## graph alpha

def grapha():

file = open("alpha.txt","w")

maxDb=int(raw_input("# of random DB's "))

maxQueries=int(raw_input("# of random queries "))

maxIterate=21

db=createdb()

max=(maxIterate)*(maxDb)*(maxQueries)

print max*.0015/60, " minutes."

print max*.0015/(60*60), " Hours."

loops=0

print "Executing . . . "

a=9999999

bestresult= list([1,a,a,a,a,a,a,a])

alpha=1

beta=0

gamma=230

for iterate in range(0,maxIterate):

alpha=float(iterate)/10

queryresult=list([0,0,0,0,0,0,0,0])

size=int(random.uniform(1,1000))

for trydb in range(0,maxDb):

db=createdb()

for tryquery in range(0,maxQueries):

loops=loops+1

size=int(random.uniform(1,1000))

result=querydb(db,size,alpha,beta,gamma)

for i in range(2,5):

if size < 500:

queryresult[i]=queryresult[i]+100-abs(result[i]-result[1])

else :

queryresult[i]=queryresult[i]+abs(result[i]-result[1])

queryresult[0]=queryresult[0]+1

queryresult[5]=result[5]

queryresult[6]=result[6]

queryresult[7]=result[7]

if queryresult[2]/queryresult[0] < bestresult[2]/bestresult[0] :

for i in range(0,8):

bestresult[i]=queryresult[i]

results=queryresult[2]/queryresult[0]

percentDone=(float(loops)/float(max))*100

print " %.2f" % percentDone,

print "% = ",

print " %.2f" % queryresult[5],

print " %.2f" % queryresult[6],

print " %.2f" % queryresult[7],

print " - ",

print " %.2f" % results

print >> file, queryresult[5],results

print "Done . . . ."

file.close()

## graph beta

def graphb():

file = open("beta.txt","w")

maxDb=int(raw_input("# of random DB's "))

maxQueries=int(raw_input("# of random queries "))

maxIterate=21

db=createdb()

max=(maxIterate)*(maxDb)*(maxQueries)

print max*.0015/60, " minutes."

print max*.0015/(60*60), " Hours."

loops=0

print "Executing . . . "

a=9999999

bestresult= list([1,a,a,a,a,a,a,a])

alpha=2

beta=0

gamma=230

for iterate in range(0,maxIterate):

beta=float(iterate)/10

queryresult=list([0,0,0,0,0,0,0,0])

size=int(random.uniform(1,1000))

for trydb in range(0,maxDb):

db=createdb()

for tryquery in range(0,maxQueries):

loops=loops+1

size=int(random.uniform(1,1000))

result=querydb(db,size,alpha,beta,gamma)

for i in range(2,5):

if size < 500:

queryresult[i]=queryresult[i]+100-abs(result[i]-result[1])

else :

queryresult[i]=queryresult[i]+abs(result[i]-result[1])

queryresult[0]=queryresult[0]+1

queryresult[5]=result[5]

queryresult[6]=result[6]

queryresult[7]=result[7]

if queryresult[2]/queryresult[0] < bestresult[2]/bestresult[0] :

for i in range(0,8):

bestresult[i]=queryresult[i]

results=queryresult[2]/queryresult[0]

percentDone=(float(loops)/float(max))*100

print " %.2f" % percentDone,

print "% = ",

print " %.2f" % queryresult[5],

print " %.2f" % queryresult[6],

print " %.2f" % queryresult[7],

print " - ",

print " %.2f" % results

print >> file, queryresult[6],results

print "Done . . . ."

file.close()

## graph size

def graphs():

file = open("size.txt","w")

maxDb=int(raw_input("# of random DB's "))

maxQueries=int(raw_input("# of random queries "))

maxIterate=1

db=createdb()

max=(maxIterate)*(maxDb)*(maxQueries)

print max*.0015/60, " minutes."

print max*.0015/(60*60), " Hours."

loops=0

print "Executing . . . "

a=9999999

bestresult= list([1,a,a,a,a,a,a,a])

alpha=2

beta=2

gamma=230

for iterate in range(0,maxIterate):

queryresult=list([0,0,0,0,0,0,0,0])

size=int(random.uniform(1,1000))

for trydb in range(0,maxDb):

db=createdb()

for tryquery in range(0,maxQueries):

loops=loops+1

size=int(random.uniform(1,1000))

result=querydb(db,size,alpha,beta,gamma)

for i in range(2,5):

if size < 500:

queryresult[i]=queryresult[i]+100-abs(result[i]-result[1])

else :

queryresult[i]=queryresult[i]+abs(result[i]-result[1])

queryresult[0]=queryresult[0]+1

queryresult[5]=result[5]

queryresult[6]=result[6]

queryresult[7]=result[7]

print >> file, size, result[2]

if queryresult[2]/queryresult[0] < bestresult[2]/bestresult[0] :

for i in range(0,8):

bestresult[i]=queryresult[i]

results=queryresult[2]/queryresult[0]

percentDone=(float(loops)/float(max))*100

print " %.2f" % percentDone,

print "% = ",

print " %.2f" % queryresult[5],

print " %.2f" % queryresult[6],

print " %.2f" % queryresult[7],

print " - ",

print " %.2f" % results

print "Done . . . ."

file.close()

## graph size optimal

def graphso():

file = open("sizeo.txt","w")

maxDb=int(raw_input("# of random DB's "))

maxQueries=int(raw_input("# of random queries "))

maxIterate=1

db=createdb()

max=(maxIterate)*(maxDb)*(maxQueries)

print max*.0015/60, " minutes."

print max*.0015/(60*60), " Hours."

loops=0

print "Executing . . . "

a=9999999

bestresult= list([1,a,a,a,a,a,a,a])

alpha=2.09

beta=1.18

gamma=66.87

for iterate in range(0,maxIterate):

queryresult=list([0,0,0,0,0,0,0,0])

size=int(random.uniform(1,1000))

for trydb in range(0,maxDb):

db=createdb()

for tryquery in range(0,maxQueries):

loops=loops+1

size=int(random.uniform(1,100))

result=querydb(db,size,alpha,beta,gamma)

for i in range(2,5):

if size < 500:

queryresult[i]=queryresult[i]+100-abs(result[i]-result[1])

else :

queryresult[i]=queryresult[i]+abs(result[i]-result[1])

queryresult[0]=queryresult[0]+1

queryresult[5]=result[5]

queryresult[6]=result[6]

queryresult[7]=result[7]

if size < 500:

queryerror=100-abs(result[2]-result[1])

else:

queryerror=abs(result[2]-result[1])

print >> file, size, queryerror

if queryresult[2]/queryresult[0] < bestresult[2]/bestresult[0] :

for i in range(0,8):

bestresult[i]=queryresult[i]

results=queryresult[2]/queryresult[0]

percentDone=(float(loops)/float(max))*100

print " %.2f" % queryresult[5],

print " %.2f" % queryresult[6],

print " %.2f" % queryresult[7],

print " - ",

print " %.2f" % results

print "Done . . . ."

file.close()

## graph size data

def graphsu():

file = open("sizeu.txt","w")

maxDb=int(raw_input("# of random DB's "))

maxQueries=int(raw_input("# of random queries "))

maxIterate=1

db=createdb()

max=(maxIterate)*(maxDb)*(maxQueries)

print max*.0015/60, " minutes."

print max*.0015/(60*60), " Hours."

loops=0

print "Executing . . . "

a=9999999

bestresult= list([1,a,a,a,a,a,a,a])

alpha=2.09

beta=1.18

gamma=66.87

for iterate in range(0,maxIterate):

queryresult=list([0,0,0,0,0,0,0,0])

size=int(random.uniform(1,1000))

for trydb in range(0,maxDb):

db=createdb()

for tryquery in range(0,maxQueries):

loops=loops+1

size=int(random.uniform(1,1000))

result=querydb(db,size,alpha,beta,gamma)

for i in range(2,5):

if size < 500:

queryresult[i]=queryresult[i]+100-abs(result[i]-result[1])

else :

queryresult[i]=queryresult[i]+abs(result[i]-result[1])

queryresult[0]=queryresult[0]+1

queryresult[5]=result[5]

queryresult[6]=result[6]

queryresult[7]=result[7]

if size < 500:

queryerror=100-abs(result[3]-result[1])

else:

queryerror=abs(result[3]-result[1])

print >> file, size, queryerror

if queryresult[2]/queryresult[0] < bestresult[2]/bestresult[0] :

for i in range(0,8):

bestresult[i]=queryresult[i]

results=queryresult[2]/queryresult[0]

percentDone=(float(loops)/float(max))*100

print " %.2f" % queryresult[5],

print " %.2f" % queryresult[6],

print " %.2f" % queryresult[7],

print " - ",

print " %.2f" % results

print "Done . . . ."

file.close()

## graph size liew

def graphsl():

file = open("sizel.txt","w")

maxDb=int(raw_input("# of random DB's "))

maxQueries=int(raw_input("# of random queries "))

maxIterate=1

db=createdb()

liew=createdbl()

max=(maxIterate)*(maxDb)*(maxQueries)

print max*.0015/60, " minutes."

print max*.0015/(60*60), " Hours."

loops=0

print "Executing . . . "

a=9999999

bestresult= list([1,a,a,a,a,a,a,a])

alpha=2.09

beta=1.18

gamma=66.87

for iterate in range(0,maxIterate):

queryresult=list([0,0,0,0,0,0,0,0])

size=int(random.uniform(1,1000))

for trydb in range(0,maxDb):

db=createdb()

liew=createdbl()

for tryquery in range(0,maxQueries):

loops=loops+1

size=int(random.uniform(1,1000))

result=querydbl(db,liew,size,alpha,beta,gamma)

for i in range(2,5):

if size < 500:

queryresult[i]=queryresult[i]+100-abs(result[i]-result[1])

else :

queryresult[i]=queryresult[i]+abs(result[i]-result[1])

queryresult[0]=queryresult[0]+1

queryresult[5]=result[5]

queryresult[6]=result[6]

queryresult[7]=result[7]

if size < 500:

queryerror=100-abs(result[4]-result[1])

else:

queryerror=abs(result[4]-result[1])

print >> file, size, queryerror

if queryresult[2]/queryresult[0] < bestresult[2]/bestresult[0] :

for i in range(0,8):

bestresult[i]=queryresult[i]

results=queryresult[2]/queryresult[0]

percentDone=(float(loops)/float(max))*100

print " %.2f" % queryresult[5],

print " %.2f" % queryresult[6],

print " %.2f" % queryresult[7],

print " %.2f" % results

print "Done . . . ."

file.close()

## graph size no randomness

def graphsd():

file = open("sized.txt","w")

maxDb=int(raw_input("# of random DB's "))

maxQueries=int(raw_input("# of random queries "))

maxIterate=1

db=createdb()

max=(maxIterate)*(maxDb)*(maxQueries)

print max*.0015/60, " minutes."

print max*.0015/(60*60), " Hours."

loops=0

print "Executing . . . "

a=9999999

bestresult= list([1,a,a,a,a,a,a,a])

alpha=2.09

beta=1.18

gamma=66.87

for iterate in range(0,maxIterate):

queryresult=list([0,0,0,0,0,0,0,0])

size=int(random.uniform(1,1000))

for trydb in range(0,maxDb):

db=createdb()

for tryquery in range(0,maxQueries):

loops=loops+1

size=int(random.uniform(1,1000))

result=querydb(db,size,alpha,beta,gamma)

for i in range(2,5):

if size < 500:

queryresult[i]=queryresult[i]+100-abs(result[i]-result[1])

else :

queryresult[i]=queryresult[i]+abs(result[i]-result[1])

queryresult[0]=queryresult[0]+1

queryresult[5]=result[5]

queryresult[6]=result[6]

queryresult[7]=result[7]

if size < 500:

queryerror=100-abs(result[1]-result[1])

else:

queryerror=abs(result[1]-result[1])

print >> file, size, queryerror

if queryresult[2]/queryresult[0] < bestresult[2]/bestresult[0] :

for i in range(0,8):

bestresult[i]=queryresult[i]

results=queryresult[2]/queryresult[0]

percentDone=(float(loops)/float(max))*100

print " %.2f" % queryresult[5],

print " %.2f" % queryresult[6],

print " %.2f" % queryresult[7],

print "Done . . . ."

file.close()

## main program

graphsl()

Appendix 2 (Source Code in Python for Main Experiment)

# experment.py - Conduct experiment to prove that Nielson perturabtion is

# signficantly better than the other known perturbation methods

# 06/01/2009 bn Programming started

import random

import math

# global parameters

DB_SIZE=1000

ALPHA=2.09

BETA=1.18

GAMMA=66.87

REPS=100000

def main():

# init total variables

# no perturbation

sum=0.0

sum_sqr=0.0

count=0

# nielson perturbation

nielson_sum=0.0

nielson_sum_sqr=9.0

nielson_count=0

# liew perturbation

liew_sum=0.0

liew_sum_sqr=0.0

liew_count=0

# data perturbation (20%)

data_sum=0.0

data_sum_sqr=0.0

data_count=0

last=-999

for i in range(REPS):

# print progress

if int(i/(REPS/100)) > last:

print int(i/(REPS/100)),

last= int(i/(REPS/100))

# create databases

db=[]

nielson_db=[]

liew_db=[]

data_db=[]

create_db(db,nielson_db,liew_db,data_db,DB_SIZE)

# 100 random queries

no_error,nielson_error,liew_error,data_error=query_db(db,nielson_db,liew_db,data_db,100)

# prepare for stats

sum=sum+no_error

sum_sqr=sum_sqr+no_error*no_error

count=count+1

liew_sum=liew_sum + liew_error

liew_sum_sqr=liew_sum_sqr + liew_error* liew_error

liew_count = liew_count + 1

nielson_sum=nielson_sum + nielson_error

nielson_sum_sqr=nielson_sum_sqr + nielson_error*nielson_error

nielson_count=nielson_count + 1

data_sum=data_sum + data_error

data_sum_sqr = data_sum_sqr + data_error*data_error

data_count=data_count+1

# calc statistics

mean=sum/count

var=(sum_sqr - sum*mean) / (count - 1)

liew_mean=liew_sum/liew_count

liew_var=(liew_sum_sqr - liew_sum*liew_mean) / (liew_count - 1)

nielson_mean=nielson_sum/nielson_count

nielson_var=(nielson_sum_sqr - nielson_sum*nielson_mean) / (nielson_count - 1)

data_mean=data_sum/data_count

data_var=(data_sum_sqr - data_sum*data_mean) / (data_count - 1)

no_t=(mean-nielson_mean)/math.sqrt((math.sqrt(var)/count)+(math.sqrt(nielson_var)/nielson_count))

liew_t=(liew_mean-nielson_mean)/math.sqrt((math.sqrt(liew_var)/count)+(math.sqrt(nielson_var)/nielson_count))

data_t=(data_mean-nielson_mean)/math.sqrt((math.sqrt(data_var)/count)+(math.sqrt(nielson_var)/nielson_count))

no_df=(var/count+nielson_var/count)*(var/count+nielson_var/nielson_count)

no_df=no_df/(((var/count)*(var/count))/(count-1)+((nielson_var/nielson_count)*(nielson_var/nielson_count)/(nielson_count-1)))

liew_df=(liew_var/liew_count+nielson_var/count)*(liew_var/liew_count+nielson_var/nielson_count)

liew_df=liew_df/(((liew_var/liew_count)*(liew_var/liew_count))/(liew_count-1)+((nielson_var/nielson_count)*(nielson_var/nielson_count)/(nielson_count-1)))

data_df=(data_var/data_count+nielson_var/nielson_count)*(data_var/data_count+nielson_var/nielson_count)

data_df=data_df/(((data_var/data_count)*(data_var/data_count))/(data_count-1)+((nielson_var/nielson_count)*(nielson_var/nielson_count)/(nielson_count-1)))

# print results

print

print

print "NO PERTURBATION"

print " count = ", count

print " ave = ", mean

print " s = ", math.sqrt(var)

print " t = ", no_t

print " df = ", no_df

print "LIEW PERTURBATION"

print " count = ", liew_count

print " ave = ", liew_mean

print " s = ", math.sqrt(liew_var)

print " t = ", liew_t

print " df = ",liew_df

print "NIELSON PERTURBATION"

print " count = ", nielson_count

print " ave = ", nielson_mean

print " s = ", math.sqrt(nielson_var)

print "DATA PERTURBATION"

print " count = ", data_count

print " ave = ", data_mean

print " s = ", math.sqrt(data_var)

print " t = ", data_t

print " df = ", data_df

# create the random database uniformly distributed between 50-150

# then perform the perturbation to create the other databases

def create_db(db,nielson_db,liew_db,data_db,size):

for i in range(size):

numb=random.uniform(50,150)

db.append(numb)

liew_pert(db,liew_db)

nielson_pert(db,nielson_db)

data_pert(db,data_db)

# perform liew perturbation

def liew_pert(db,liew_db):

for i in range(DB_SIZE):

numb=random.uniform(50,150)

liew_db.append(numb)

liew_db.sort()

db.sort()

for i in range(random.randint(10,100)):

point1=random.randint(0,DB_SIZE-1)

point2=random.randint(0,DB_SIZE-1)

liew_db[point1],liew_db[point2] = liew_db[point2],liew_db[point1]

db[point1],db[point2]=db[point2],db[point1]

# perform nielson perturbation

def nielson_pert(db,nielson_db):

for i in range(DB_SIZE):

numb=random.uniform(ALPHA,BETA)

if random.uniform(0,1) > .50:

numb=-numb

nielson_db.append(db[i]+db[i]*numb)

# perform 20% data perturbation

def data_pert(db,data_db):

for i in range(DB_SIZE):

data_db.append(db[i]+db[i]*random.uniform(-.20,.20))

# perform random queries on database and report ave fitness scores

def query_db(db,nielson_db,liew_db,data_db,number):

mean_no=0.0

mean_nielson=0.0

mean_liew=0.0

mean_data=0.0

for j in range(number):

query_size=random.randint(1,DB_SIZE-1)

sum=0

nielson_sum=0

liew_sum=0

data_sum=0

for i in range(query_size):

sum=sum+db[i]

if query_size < GAMMA:

nielson_sum=nielson_sum+nielson_db[i]

else:

nielson_sum=nielson_sum+db[i]

liew_sum=liew_sum+liew_db[i]

data_sum=data_sum+data_db[i]

result=sum/query_size

nielson_result=nielson_sum/query_size

liew_result=liew_sum/query_size

data_result=data_sum/query_size

no_error=0

nielson_error=nielson_result-result

liew_error=liew_result-result

data_error=data_result-result

if nielson_error < 0:

nielson_error=-nielson_error

if liew_error < 0:

liew_error=-liew_error

if data_error < 0:

data_error=-data_error

if query_size < DB_SIZE/2:

nielson_error=100-nielson_error

no_error=100-no_error

liew_error=100-liew_error

data_error=100-data_error

mean_no=mean_no+no_error

mean_nielson=mean_nielson+nielson_error

mean_liew=mean_liew+liew_error

mean_data=mean_data+data_error

no_error=mean_no/number

nielson_error=mean_nielson/number

liew_error=mean_liew/number

data_error=mean_data/number

return no_error,nielson_error,liew_error,data_error

main()

References

Adam, Nabil R & Worthmann, John C (1989, December). Security-Control Methods for Statistical Databases: A Comparative Study. ACM Computing Surveys, Vol. 21, No. 4.

Baru, C. K. and S. Y. W. Su (1984). Performance evaluation of the statistical aggregation by categorization in the SM3 system. Proceedings of the 1984 ACM SIGMOD international conference on Management of data. Boston, Massachusetts, ACM Press.

Beck, Leland L(1980 Sept.). A Security Mechanism for Statistical Databases. ACM Transactions of Database Systems, Vol. 5, No. 3, Sept. 1980, Pages 316-338.

Chin, Francis (1978, March). Security in Statistical Databases for Queries with Small Counts. ACM Transactions on Database Systems, Vol. 3, No. 1, Pages 92-104.

Chin, F. Y. and G. Ozsoyoglu (1981). "Statistical database design." ACM Trans. Database Syst. 6(1): 113-139.

Denning, D. E. and P. J. Denning (1979). "The tracker: a threat to statistical database security." ACM Trans. Database Syst. 4(1): 76-96.

Denning, D. E. (1980). Secure Statistical Databases with Random Sample Queries. ACM Transactions on Database Systems, Vol 5, No. 3, Pages 291-315.

Dinur, I. and K. Nissim (2003). Revealing information while preserving privacy. Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. San Diego, California, ACM Press.

Hanson, Steven C (1995, June). Hybrid Inferential Security Methods for Statistical Databases. ACM SIGAPP Applied Computed Review, Vol. 3, Issue 1.

Hartson, H. R. and D. K. Hsiao (1976). Full protection specifications in the semantic model for database protection languages. Proceedings of the annual conference. Houston, Texas, United States, ACM Press.

Ketel, M. and A. Homaifar (2005). Privacy-preserving mining by rotational data transformation. Proceedings of the 43rd annual southeast regional conference - Volume 1. Kennesaw, Georgia, ACM Press.

Kifer, D. and J. Gehrke (2006). Injecting utility into anonymized datasets. Proceedings of the 2006 ACM SIGMOD international conference on Management of data. Chicago, IL, USA, ACM Press.

Liew, C.K., Choi, W.J., and Liew, C.J (1985). A data distortion by probability distribution. ACM Transactions on Database Systems, Vol. 10, No. 3, Page 395-411.

Machanavajjhala, A. and J. Gehrke (2006). On the efficiency of checking perfect privacy. Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. Chicago, IL, USA, ACM Press.

McLeish, Mary (1989, March). Further Results on the Security of Partitioned Dynamic Statistical Databases. ACM Transactions of Database Systems, Vol. 14, No. 1, Pages 98-113.

Ozsoyoglu, G., Su, T. A. (1985). Rounding and Inference Control in Conceptual Models for Statistical Databases. Proceedings of IEEE Symposium on Security and Privacy, Pages 160-173.

Palley, M. A. and J. S. Simonoff (1987). "The use of regression methodology for the compromise of confidential information in statistical databases." ACM Trans. Database Syst. 12(4): 593-608.

Reiss, S. P. (1984). "Practical data-swapping: the first steps." ACM Trans. Database Syst. 9(1): 20-37.

Rowe, N. C. (1983). Top-down statistical estimation on a database. Proceedings of the 1983 ACM SIGMOD international conference on Management of data. San Jose, California, ACM Press.

Sicherman, G. L., W. D. Jonge, et al. (1983). "Answering queries without revealing secrets." ACM Trans. Database Syst. 8(1): 41-59.

Schl, J. and rer (1981). "Security of statistical databases: multidimensional transformation." ACM Trans. Database Syst. 6(1): 95-112.

Schwartz, M.D., D.E. Denning, et al (1979). “Linear queries in statistical databases.” ACM Trans. Database Syst. 4(2)L 156-167.

Tendick, Patrick & Matloff, Norman (1984, March). A Modify Random-Perturbation Method for Database Security. ACM Transactions on Database Systems, Vol. 19, No. 1, March 1984, Pages 47-63.

Verykios, V. S., E. Bertino, et al. (2004). "State-of-the-art in privacy preserving data mining." SIGMOD Rec. 33(1): 50-57.

Yu, C. T. and F. Y. Chin (1977). A study on the protection of statistical data bases. Proceedings of the 1977 ACM SIGMOD international conference on Management of data. Toronto, Ontario, Canada, ACM Press.

Zhong, S., Z. Yang, et al. (2005). Privacy-enhancing k-anonymization of customer data. Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. Baltimore, Maryland, ACM Press.

 

 

 

 

-----------------------

beta=0 gamma=310

46.5

47

47.5

48

48.5

49

49.5

50

50.5

0

0.5

1

1.5

2

2.5

Alpha

Fitness

Fitness

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download