PDF of Eric’s handwritten notes are here.
Today we’ll discuss hashing and see a widely-used scheme known as a Bloom filter. Before diving into hashing let’s look at a toy problem which will be helpful for our intuition.
Randomly throwing balls into bins
Given balls and
bins, each ball is thrown into a random bin. Load of bin
denoted by
# of balls assigned to bin
.
Max load is the maximum # of balls in a bin . It can be shown that the max load is
with high probability (with high prob. means with probability
for constant
).
A proof for : Pr(bin
gets load
) =
.
Pr(some bin gets load )
.
Pr(no bin gets load )
.
(In fact, the bound can be refined to ).
Power of 2 choices
Assign balls sequentially into bins
For to
:
- Choose 2 random bins,
and
- If
: then assign ball
to bin
- else: assign ball
to bin
With this change, the max load is with high probability.
If random bins are used (instead of 2), the max load is
with high probability.
Application to Hashing
Universe of possible passwords, database
of unacceptable passwords. Query: Is
?
Hash table:
Map elements of into bins
using hash function
.
Chain hashing: is a linked list of elements
of
where
To add into
:
- Compute
- add
onto linked list at
To query :
- Compute
- Check if
is in the linked list at
Query time = load at bin
Let ,
. This is equivalent to putting
balls into
bins.
If is a random function, then
is a random bin in
. If
, the query time is
.
How to get a which is random?
Choose . For
,
A better choice is to choose ,
.
Better Approach
Hash table: . Hash function:
.
To add into
:
- Compute
- add
onto linked list at
which is less loaded
To query :
- Compute
- Check if
is in the linked list at
For random , if
, the max load is
.
Bloom Filters
- Uses less space
- Simpler
- Faster:
- False positives: If
, sometimes, we say YES
is a 0-1 array of size
, initialized to all 0’s
Add into
:
- Compute
- Set
Query :
- Check if
, then YES else NO
False Positive problem: It might happen that but we added
into
and
Better Approach
Instead, use hash functions
For each ,
Initialize to all 0’s
Add into
:
- For all
- Set
- Set
Query :
- if for all
, output YES
- else (if
where
), output NO
Guarantees:
- If
, then for query
, we always say YES
- If
, then for query
, we might say YES. This is a false positive.
Probability of a false positive
If , then
To find the best that minimizes
, we take the derivative of
and find the solution for
.
The best is
which makes
.