THESIS
2015
xii, 104 pages : illustrations (some color) ; 30 cm
Abstract
In this thesis we focus on publishing statistics on a private database with ϵ-differential privacy.
We target at three scenarios; (i) when the statistics are computed over a static database with high
sensitivity, (ii) when we wish to publish 1-dimensional histograms, and (iii) when the statistics are
published continuously over data that are updated by a stream.
For the first scenario, we address one-time publishing of non-overlapping counts computed
over highly sensitive data. These statistics are useful in a wide and important range of applications,
including transactional, traffic and medical data analysis. Prior work on ϵ-differential privacy
publishes such statistics with prohibitively low utility in several practical scenarios. Towards this
end, we present GS, a method tha...[
Read more ]
In this thesis we focus on publishing statistics on a private database with ϵ-differential privacy.
We target at three scenarios; (i) when the statistics are computed over a static database with high
sensitivity, (ii) when we wish to publish 1-dimensional histograms, and (iii) when the statistics are
published continuously over data that are updated by a stream.
For the first scenario, we address one-time publishing of non-overlapping counts computed
over highly sensitive data. These statistics are useful in a wide and important range of applications,
including transactional, traffic and medical data analysis. Prior work on ϵ-differential privacy
publishes such statistics with prohibitively low utility in several practical scenarios. Towards this
end, we present GS, a method that pre-processes the counts by elaborately g̲rouping and s̲moothing
them via averaging. This step acts as a form of preliminary perturbation that diminishes sensitivity,
and enables GS to achieve ϵ-differential privacy through low Laplace noise injection. The grouping
strategy is dictated by a sampling mechanism, which minimizes the smoothing perturbation. We
demonstrate the superiority of GS over its competitors, and confirm its practicality, via extensive
experiments on real datasets.
For the second scenario, we focus on the problem of differentially private histogram publication,
for range-sum query answering. Specifically, we derive an 1-dimensional histogram from a given
dataset, such that (i) it satisfies ϵ-differential privacy, and (ii) it achieves high utility for queries
that request the sum of contiguous histogram bins. Existing schemes are distinguished into two
categories: “data-aware” and “data-oblivious”. Data-aware methods exploit data characteristic to
increase utility. However, they have superquadratic running time, and their error increases with
the query range. On the other hand, data-oblivious methods are fast but may yield lower utility for
datasets commonly found in practice, especially for short ranges. In this paper, we propose schemes
that combine and improve characteristics of both approaches, with emphasis on both efficiency
and utility. Towards this goal, we formulate a principled approach, which defines a small set of
simple modules, based on which we can devise a variety of more complex schemes. We first
express the state-of-the-art methods in terms of these modules, which allows us to identify the
performance bottlenecks. Next, we design novel efficient and effective schemes based on non-trivial
module combinations. We experimentally evaluate all mechanisms on three real datasets
with diverse characteristics, and demonstrate the benefits of our proposals over previous work.
For the third scenario, we address continuously publishing of non-overlapping counts. Numerous
applications require continuous publication of statistics for monitoring purposes, such as
real-time traffic analysis, timely disease outbreak discovery, and social trends observation. These
statistics may be derived from sensitive user data and, hence, necessitate privacy preservation. Although
ϵ-differential privacy is a notable paradigm for offering strong privacy guarantees in statistics
publishing, there is limited literature that adapts this concept to settings where the statistics are
computed over an infinite stream of “events” (i.e., data items generated by the users), and published
periodically. These works aim at hiding a single event over the entire stream. We argue that, in
most practical scenarios, sensitive information is revealed from multiple events occurring at contiguous
time instances. Towards this end, we put forth the novel notion of w-event privacy over
infinite streams, which protects an event sequence occurring in w successive time instants. We first
formulate our privacy concept, motivate its importance, and introduce a methodology for achieving
it. We next design two instantiations, whose utility is independent of the stream length. Finally, we
confirm the practicality of our solutions experimenting with real data.
Post a Comment