Class BloomFilter
When querying a Bloom filter, there are no false positives. Specifically: When querying an item that has already been inserted to the filter, the filter will always indicate that the item is present. There is a chance of false positives, where querying an item that has never been presented to the filter will indicate that the item has already been seen. Consequently, any query should be interpreted as "might have seen."
A standard Bloom filter is unlike typical sketches in that it is not sub-linear in size and does not resize itself. A Bloom filter will work up to a target number of distinct items, beyond which it will saturate and the false positive rate will start to increase. The size of a Bloom filter will be linear in the expected number of distinct items.
See the BloomFilterBuilder class for methods to create a filter, especially one sized correctly for a target number of distinct elements and a target false positive probability.
This implementation uses xxHash64 and follows the approach in Kirsch and Mitzenmacher, "Less Hashing, Same Performance: Building a Better Bloom Filter," Wiley Interscience, 2008, pp. 187-218.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final longThe maximum size of a bloom filter in bits. -
Method Summary
Modifier and TypeMethodDescriptionlongReturns the number of bits in the BloomFilter that are set to 1.longReturns the total number of bits in the BloomFilter.doubleReturns the percentage of all bits in the BloomFilter set to 1.shortReturns the configured number of hash functions for this BloomFilterlonggetSeed()Returns the hash seed for this BloomFilter.static longgetSerializedSize(long numBits) Returns the serialized length of a non-empty BloomFilter of the given size, in byteslongReturns the length of this BloomFilter when serialized, in bytesbooleanReturns whether the filter has a backing Memory objectstatic BloomFilterheapify(org.apache.datasketches.memory.Memory mem) Reads a serialized image of a BloomFilter from the provided Memoryvoidintersect(BloomFilter other) Intersects two BloomFilters by applying a logical AND.voidinvert()Inverts all the bits of the BloomFilter.booleanisCompatible(BloomFilter other) Helps identify if two BloomFilters may be unioned or intersected.booleanisDirect()Returns whether the filter is a direct (off-heap) or on-heap object.booleanisEmpty()Checks if the BloomFilter has processed any itemsbooleanReturns whether the filter is in read-only mode.booleanquery(byte[] data) Queries the filter with the provided byte[] and returns whether the array might have been seen previously.booleanquery(char[] data) Queries the filter with the provided char[] and returns whether the array might have been seen previously.booleanquery(double item) Queries the filter with the provided double and returns whether the value might have been seen previously.booleanquery(int[] data) Queries the filter with the provided int[] and returns whether the array might have been seen previously.booleanquery(long item) Queries the filter with the provided long and returns whether the value might have been seen previously.booleanquery(long[] data) Queries the filter with the provided long[] and returns whether the array might have been seen previously.booleanquery(short[] data) Queries the filter with the provided short[] and returns whether the array might have been seen previously.booleanQueries the filter with the provided double and returns whether the value might have been seen previously.booleanquery(org.apache.datasketches.memory.Memory mem) Queries the filter with the provided Memory and returns whether the data might have been seen previously.booleanqueryAndUpdate(byte[] data) Updates the filter with the provided byte[] and returns the result from quering that array prior to the update.booleanqueryAndUpdate(char[] data) Updates the filter with the provided char[] and returns the result from quering that array prior to the update.booleanqueryAndUpdate(double item) Updates the filter with the provided double and returns the result from quering that value prior to the update.booleanqueryAndUpdate(int[] data) Updates the filter with the provided int[] and returns the result from quering that array prior to the update.booleanqueryAndUpdate(long item) Updates the filter with the provided long and returns the result from quering that value prior to the update.booleanqueryAndUpdate(long[] data) Updates the filter with the provided long[] and returns the result from quering that array prior to the update.booleanqueryAndUpdate(short[] data) Updates the filter with the provided short[] and returns the result from quering that array prior to the update.booleanqueryAndUpdate(String item) Updates the filter with the provided String and returns the result from quering that value prior to the update.booleanqueryAndUpdate(org.apache.datasketches.memory.Memory mem) Updates the filter with the provided Memory and returns the result from quering that Memory prior to the update.voidreset()Resets the BloomFilter to an empty statebyte[]Serializes the current BloomFilter to an array of bytes.long[]Serializes the current BloomFilter to an array of longs.toString()voidunion(BloomFilter other) Unions two BloomFilters by applying a logical OR.voidupdate(byte[] data) Updates the filter with the provided byte[].voidupdate(char[] data) Updates the filter with the provided char[].voidupdate(double item) Updates the filter with the provided double value.voidupdate(int[] data) Updates the filter with the provided int[].voidupdate(long item) Updates the filter with the provided long value.voidupdate(long[] data) Updates the filter with the provided long[].voidupdate(short[] data) Updates the filter with the provided short[].voidUpdates the filter with the provided String.voidupdate(org.apache.datasketches.memory.Memory mem) Updates the filter with the data in the provided Memory.static BloomFilterwrap(org.apache.datasketches.memory.Memory mem) Wraps the given Memory into this filter class.static BloomFilterwritableWrap(org.apache.datasketches.memory.WritableMemory wmem) Wraps the given WritableMemory into this filter class.
-
Field Details
-
MAX_SIZE_BITS
public static final long MAX_SIZE_BITSThe maximum size of a bloom filter in bits.
-
-
Method Details
-
heapify
Reads a serialized image of a BloomFilter from the provided Memory- Parameters:
mem- Memory containing a previously serialized BloomFilter- Returns:
- a BloomFilter object
-
wrap
Wraps the given Memory into this filter class. The class itself only contains a few metadata items and holds a reference to the Memory object, which contains all the data.- Parameters:
mem- the given Memory object- Returns:
- the wrapping BloomFilter class.
-
writableWrap
Wraps the given WritableMemory into this filter class. The class itself only contains a few metadata items and holds a reference to the Memory object, which contains all the data.- Parameters:
wmem- the given WritableMemory object- Returns:
- the wrapping BloomFilter class.
-
reset
public void reset()Resets the BloomFilter to an empty state -
isEmpty
public boolean isEmpty()Checks if the BloomFilter has processed any items- Returns:
- True if the BloomFilter is empty, otherwise False
-
getBitsUsed
public long getBitsUsed()Returns the number of bits in the BloomFilter that are set to 1.- Returns:
- The number of bits in use in this filter
-
getCapacity
public long getCapacity()Returns the total number of bits in the BloomFilter.- Returns:
- The total size of the BloomFilter
-
getNumHashes
public short getNumHashes()Returns the configured number of hash functions for this BloomFilter- Returns:
- The number of hash functions to apply to inputs
-
getSeed
public long getSeed()Returns the hash seed for this BloomFilter.- Returns:
- The hash seed for this filter
-
hasMemory
public boolean hasMemory()Returns whether the filter has a backing Memory object- Returns:
- true if backed by Memory, otherwise false
-
isReadOnly
public boolean isReadOnly()Returns whether the filter is in read-only mode. That is possible only if there is a backing Memory in read-only mode.- Returns:
- true if read-only, otherwise false
-
isDirect
public boolean isDirect()Returns whether the filter is a direct (off-heap) or on-heap object. That is possible only if there is a backing Memory.- Returns:
- true if using direct memory access, otherwise false
-
getFillPercentage
public double getFillPercentage()Returns the percentage of all bits in the BloomFilter set to 1.- Returns:
- the percentage of bits in the filter set to 1
-
update
public void update(long item) Updates the filter with the provided long value.- Parameters:
item- an item with which to update the filter
-
update
public void update(double item) Updates the filter with the provided double value. The value is canonicalized (NaN and infinities) prior to updating.- Parameters:
item- an item with which to update the filter
-
update
Updates the filter with the provided String. The string is converted to a byte array using UTF8 encoding.Note: this will not produce the same output hash values as the
update(char[])method and will generally be a little slower depending on the complexity of the UTF8 encoding.- Parameters:
item- an item with which to update the filter
-
update
public void update(byte[] data) Updates the filter with the provided byte[].- Parameters:
data- an array with which to update the filter
-
update
public void update(char[] data) Updates the filter with the provided char[].- Parameters:
data- an array with which to update the filter
-
update
public void update(short[] data) Updates the filter with the provided short[].- Parameters:
data- an array with which to update the filter
-
update
public void update(int[] data) Updates the filter with the provided int[].- Parameters:
data- an array with which to update the filter
-
update
public void update(long[] data) Updates the filter with the provided long[].- Parameters:
data- an array with which to update the filter
-
update
public void update(org.apache.datasketches.memory.Memory mem) Updates the filter with the data in the provided Memory.- Parameters:
mem- a Memory object with which to update the filter
-
queryAndUpdate
public boolean queryAndUpdate(long item) Updates the filter with the provided long and returns the result from quering that value prior to the update.- Parameters:
item- an item with which to update the filter- Returns:
- The query result prior to applying the update
-
queryAndUpdate
public boolean queryAndUpdate(double item) Updates the filter with the provided double and returns the result from quering that value prior to the update. The double is canonicalized (NaN and +/- infinity) in the call.- Parameters:
item- an item with which to update the filter- Returns:
- The query result prior to applying the update
-
queryAndUpdate
Updates the filter with the provided String and returns the result from quering that value prior to the update. The string is converted to a byte array using UTF8 encoding.Note: this will not produce the same output hash values as the
queryAndUpdate(char[])method and will generally be a little slower depending on the complexity of the UTF8 encoding.- Parameters:
item- an item with which to update the filter- Returns:
- The query result prior to applying the update, or false if item is null
-
queryAndUpdate
public boolean queryAndUpdate(byte[] data) Updates the filter with the provided byte[] and returns the result from quering that array prior to the update.- Parameters:
data- an array with which to update the filter- Returns:
- The query result prior to applying the update, or false if data is null
-
queryAndUpdate
public boolean queryAndUpdate(char[] data) Updates the filter with the provided char[] and returns the result from quering that array prior to the update.- Parameters:
data- an array with which to update the filter- Returns:
- The query result prior to applying the update, or false if data is null
-
queryAndUpdate
public boolean queryAndUpdate(short[] data) Updates the filter with the provided short[] and returns the result from quering that array prior to the update.- Parameters:
data- an array with which to update the filter- Returns:
- The query result prior to applying the update, or false if data is null
-
queryAndUpdate
public boolean queryAndUpdate(int[] data) Updates the filter with the provided int[] and returns the result from quering that array prior to the update.- Parameters:
data- an array with which to update the filter- Returns:
- The query result prior to applying the update, or false if data is null
-
queryAndUpdate
public boolean queryAndUpdate(long[] data) Updates the filter with the provided long[] and returns the result from quering that array prior to the update.- Parameters:
data- an array with which to update the filter- Returns:
- The query result prior to applying the update, or false if data is null
-
queryAndUpdate
public boolean queryAndUpdate(org.apache.datasketches.memory.Memory mem) Updates the filter with the provided Memory and returns the result from quering that Memory prior to the update.- Parameters:
mem- an array with which to update the filter- Returns:
- The query result prior to applying the update, or false if mem is null
-
query
public boolean query(long item) Queries the filter with the provided long and returns whether the value might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.- Parameters:
item- an item with which to query the filter- Returns:
- The result of querying the filter with the given item
-
query
public boolean query(double item) Queries the filter with the provided double and returns whether the value might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible. Double values are canonicalized (NaN and +/- infinity) before querying.- Parameters:
item- an item with which to query the filter- Returns:
- The result of querying the filter with the given item
-
query
Queries the filter with the provided double and returns whether the value might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible. The string is converted to a byte array using UTF8 encoding.Note: this will not produce the same output hash values as the
update(char[])method and will generally be a little slower depending on the complexity of the UTF8 encoding.- Parameters:
item- an item with which to query the filter- Returns:
- The result of querying the filter with the given item, or false if item is null
-
query
public boolean query(byte[] data) Queries the filter with the provided byte[] and returns whether the array might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.- Parameters:
data- an array with which to query the filter- Returns:
- The result of querying the filter with the given data, or false if data is null
-
query
public boolean query(char[] data) Queries the filter with the provided char[] and returns whether the array might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.- Parameters:
data- an array with which to query the filter- Returns:
- The result of querying the filter with the given data, or false if data is null
-
query
public boolean query(short[] data) Queries the filter with the provided short[] and returns whether the array might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.- Parameters:
data- an array with which to query the filter- Returns:
- The result of querying the filter with the given data, or false if data is null
-
query
public boolean query(int[] data) Queries the filter with the provided int[] and returns whether the array might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.- Parameters:
data- an array with which to query the filter- Returns:
- The result of querying the filter with the given data, or false if data is null
-
query
public boolean query(long[] data) Queries the filter with the provided long[] and returns whether the array might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.- Parameters:
data- an array with which to query the filter- Returns:
- The result of querying the filter with the given data, or false if data is null
-
query
public boolean query(org.apache.datasketches.memory.Memory mem) Queries the filter with the provided Memory and returns whether the data might have been seen previously. The filter's expected False Positive Probability determines the chances of a true result being a false positive. False negatives are never possible.- Parameters:
mem- a Memory array with which to query the filter- Returns:
- The result of querying the filter with the given Memory, or false if data is null
-
union
Unions two BloomFilters by applying a logical OR. The result will recognized any values seen by either filter (as well as false positives).- Parameters:
other- A BloomFilter to union with this one
-
intersect
Intersects two BloomFilters by applying a logical AND. The result will recognize only values seen by both filters (as well as false positives).- Parameters:
other- A BloomFilter to union with this one
-
invert
public void invert()Inverts all the bits of the BloomFilter. Approximately inverts the notion of set-membership. -
isCompatible
Helps identify if two BloomFilters may be unioned or intersected.- Parameters:
other- A BloomFilter to check for compatibility with this one- Returns:
- True if the filters are compatible, otherwise false
-
getSerializedSizeBytes
public long getSerializedSizeBytes()Returns the length of this BloomFilter when serialized, in bytes- Returns:
- The length of this BloomFilter when serialized, in bytes
-
getSerializedSize
public static long getSerializedSize(long numBits) Returns the serialized length of a non-empty BloomFilter of the given size, in bytes- Parameters:
numBits- The number of bits of to use for size computation- Returns:
- The serialized length of a non-empty BloomFilter of the given size, in bytes
-
toByteArray
public byte[] toByteArray()Serializes the current BloomFilter to an array of bytes.Note: Method throws if the serialized size exceeds
Integer.MAX_VALUE.- Returns:
- A serialized image of the current BloomFilter as byte[]
-
toLongArray
public long[] toLongArray()Serializes the current BloomFilter to an array of longs. UnliketoByteArray(), this method can handle any size filter.- Returns:
- A serialized image of the current BloomFilter as long[]
-
toString
-