An Optimal Algorithm for the Distinct Elements Problem
An Optimal Algorithm for the Distinct Elements Problem
Daniel M. Kane
Department of Mathematics Stanford University
aladkeenin@
February 21st, 2014
Joint work with Jelani Neslon (Harvard) and David Woodruff (IBM)
D. Kane (Stanford)
Distinct Elements
February 2014 1 / 30
Streaming Algorithms
Algorithm to analyze Data passing through a server OR Data in a large database
In either case, there may be too much data to store a full (separate) copy of everything. Want an algorithm that:
Uses little space Little processing time per item Makes only one pass through the data
D. Kane (Stanford)
Distinct Elements
February 2014 2 / 30
Distinct Elements Problem
Distinct elements problem: Given a stream S of elements of [n] := {1, 2, . . . , n}, return the number of distinct elements in S. Example:
S = 1, 7, 2, 7, 23, 1, 7
Distinct elements are 1, 2, 7, 23. Return 4. Applications:
Detecting Denial of Service Attacks [Akella, Bharambe, Reiter, Seshan]
Query planning
[Selinger, Astrahan, Chamberlin, Lorie]
Data Integration [Brown, Haas, Myllymaki, Pirahesh, Reinwald, Sismanis]
D. Kane (Stanford)
Distinct Elements
February 2014 3 / 30
Approximation
Returning |S| exactly is too much to hope for. Compare |S|,|S {x}| to determine if x S Would essentially need to store all elements of the stream Requires (n) space
Instead return answer F so that F = |S|(1 ? ) for some small > 0.
D. Kane (Stanford)
Distinct Elements
February 2014 4 / 30
Randomization
Deterministic algorithms also do not work: Let Si be subsets that overlap at most 2/3rds of their elements. Need to distinguish stream Si Si from Si Sj . Need to store enough information to tell which Si you have. Requires (n) space.
Thus, to obtain reasonable space, we will need to use a randomized algorithm that we will only require to be correct with probability 2/3 (this can be improved to 1 - by running O(log(-1)) copies of the algorithm in parallel).
D. Kane (Stanford)
Distinct Elements
February 2014 5 / 30
Problem Statement
Problem
Find a randomized algorithm that given a stream S of elements in [n]: Makes only a single pass over the stream Uses s space Uses t time per update
Returns a value in (|S|(1 - ), |S|(1 + )) with probability at least 2/3rds
D. Kane (Stanford)
Distinct Elements
February 2014 6 / 30
Past Work
Paper [FM85] [AMS99] [GT01] [BJKST02] [BJKST02] [KNW10]
Space
O (log(n))
O (log(n)) O( -2 log(n)) O( -2 log(n)) O~( -2 + log(n)) O( -2 + log(n))
Time
-
O (log(n)) O( -2) O(log( -1)) O~( -2 log log(n))
O (1)
Notes Const. , rand. oracle Const.
Optimal, This talk
D. Kane (Stanford)
Distinct Elements
February 2014 7 / 30
Lower Bounds
Space lower bounds of ( -2 + log(n)) were proven by Indyk and Woodruff (later extended to cover all > n-1/2 by Woodruff). Idea: Given a stream ST need to distinguish the cases:
S and T overlap on a 1/2 + -fraction of their entries S and T overlap on a 1/2 - -fraction of their entries This reduces it to the one-way communication complexity of the Gap-Hamming Problem.
D. Kane (Stanford)
Distinct Elements
February 2014 8 / 30
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- solution of final exam 10 701 15 781 machine learning
- pandas
- cheat sheet pyspark sql python lei mao
- 1 apl6 common substrings of more than two strings
- exploring data using python 3 charles r severance
- spark cheat sheet stanford university
- sorting codility
- with pandas f m a f ma vectorized a f operations cheat
- an optimal algorithm for the distinct elements problem
- data wrangling tidy data pandas python data
Related searches
- the five elements of literature
- optimal temperature for photosynthesis
- the 7 elements of culture
- the five elements of life
- choose the nonmetallic elements from the list
- the basic economic problem is
- find an equation for the line below
- the mind body problem philosophy
- write an inequality for the graph calculator
- write an equation for the function calculator
- the 6 elements of plot
- what are the 5 elements of plot