Improving the check if an object is in a Set/HashMap data structure in Ruby by using better comparisons, memoization, and object orientation practices


I was verifying the performance of an endpoint in an API using framegraph when I noticed that checking if an object is in a set or a hash data structure was consuming a lot of time. This post explains the problem and the solution to it.

First of all, checking if an object is in a Set or HashMap is O(1) when we don’t consider collisions. Even in the presence of collisions, it should be much less than O(n), where n is the number of items in the set. Otherwise, we are operating as a linked list.

In my case, checking if an object is in a Set of 1000 objects was consuming 1.62 seconds or 37% of my request time. Only for us to compare, checking if an integer is in a Set with 1000 integers consumes less than 1 millisecond.

So, what is the problem? To dive into it below is the class I was working with. I removed everything from the business problem and changed the class name only to allow us to concentrate on the problem and the solution here.

class CustomID
  attr_reader :id

  def initialize(id)
    @id = id
  end

  def ==(other)
    other.class == self.class && other.id == id
  end

  def hash
    self.class.hash | id.hash
  end
end

There is nothing wrong with this code. However, if we use it frequently, it could be a bottleneck.

The method == makes a string comparison which is O(n), where n is the size of the string, and in case of success, it compares two integers. I improved it to make a faster integer comparison and use only one machine instruction. Integer comparison is O(b), where b is the number of bits used to represent the integer.

The method hash is very expensive. It calculates the hash of a string and the hash of an integer and does an or operation.

Both methods cited above are where our performance problem starts. For a Set of 1000 elements, we need to calculate the object’s hash and use the method == to check if an object is inside it. Again, this is fine if we check this a few times. However, doing it a few million times daily could be a significant bottleneck.

The solution is to remove the string comparison in the method == and memoize the expansive hash operation. The new code is below:

class ImprovedCustomID
  attr_reader :id

  def initialize(id)
    @id = id
  end

  def ==(other)
    other.hash == hash
  end

  def hash
    @hash ||= self.class.hash | id.hash
  end

  def freeze
    hash
    super
  end
end

The new code overrides the method freeze to call hash and only after it, call super. The override is necessary because the hash memoized version adds a new instance variable @hash when it is called the first time.

To compare both versions, we can use the code below, which uses the library benchmark-ips:

require "benchmark/ips"
custom_id = CustomID.new(61000)
improved_custom_id =ImprovedCustomID.new(61000)
custom_ids = (1..1000).map {|i| CustomID.new(i*1000) }
improved_custom_ids = (1..1000).map {|i| ImprovedCustomID.new(i*1000) }
Benchmark.ips do |x|
  x.report("include? before improvements") { custom_ids.include?(custom_id) }
  x.report("include? after improvements") { improved_custom_ids.include?(improved_custom_id) }
  x.compare!
end

With this benchmark, we got the results:

Warming up --------------------------------------
include? before improvements
                        16.564k i/100ms
include? after improvements
                        23.623k i/100ms
Calculating -------------------------------------
include? before improvements
                        165.641k (± 0.3%) i/s -    828.200k in   5.000032s
include? after improvements
                        235.784k (± 0.3%) i/s -      1.181M in   5.009491s

Comparison:
include? after improvements:   235784.5 i/s
include? before improvements:   165640.9 i/s - 1.42x  slower

Therefore, in this simple case, the new code is 42% faster. The new code was 2x more quickly in my production case, from 1.62 seconds to 866.16 milliseconds.