Using data sets in Ruby: performance of Hash vs Set

A data set is a collection of objects that ensures each object is present only once in the collection.

In Ruby, there are 2 simple ways to define data sets:

  1. use the Set Ruby standard library, or
  2. use the Hash standard class.

Of course using Set is the preferred way, as it has been designed specifically for this purpose:

require 'set'
my_set = Set.new
my_set << 5
my_set << 9
my_set << 5
puts my_set.to_a.join(', ') # 5, 9

Using Hash is relatively easy too: just use the Hash’s keys to represent the data set, and associate a useless dummy value (nil) for each key:

my_set = {}
my_set[5] = nil
my_set[9] = nil
my_set[5] = nil
puts my_set.keys.join(', ') # 5, 9

So what about comparing them performance based ?
Here is the benchmark code:

require 'set'
require 'profile'

nbr_inserts = 100000
nbr_reads = 1000
size_set = 1000

Profiler__::start_profile
my_set = Set.new
nbr_inserts.times do |i|
  my_set << rand(size_set)
end
Profiler__::stop_profile
puts "===================== Set INSERT ==================="
Profiler__::print_profile($stdout)

Profiler__::start_profile
my_hash = {}
nbr_inserts.times do |i|
  my_hash[rand(size_set)] = nil
end
Profiler__::stop_profile
puts "===================== Hash INSERT ==================="
Profiler__::print_profile($stdout)

Profiler__::start_profile
nbr_reads.times do |i|
  c = 0
  my_set.each do |i|
    c += i
  end
end
Profiler__::stop_profile
puts "===================== Set READ ==================="
Profiler__::print_profile($stdout)

Profiler__::start_profile
nbr_reads.times do |i|
  d = 0
  my_hash.each do |k, v|
    d += k
  end
end
Profiler__::stop_profile
puts "===================== Hash READ ==================="
Profiler__::print_profile($stdout)

And here is the result:

===================== Set INSERT ===================
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 47.60     2.07      2.07        1  2072.00  4353.00  Integer#times
 31.22     3.43      1.36   100000     0.01     0.02  Set#add
 10.77     3.90      0.47   100000     0.00     0.00  Hash#[]=
 10.41     4.35      0.45   100000     0.00     0.00  Kernel.rand
  0.00     4.35      0.00        1     0.00     0.00  Set#initialize
  0.00     4.35      0.00        1     0.00     0.00  NilClass#nil?
  0.00     4.35      0.00        2     0.00     0.00  Class#new
  0.00     4.35      0.00        1     0.00     0.00  Hash#initialize
  0.00     4.35      0.00        1     0.00     0.00  Kernel.set_trace_func
  0.00     4.35      0.00        1     0.00  4353.00  #toplevel
===================== Hash INSERT ===================
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 72.25     1.92      1.92        1  1916.00  2652.00  Integer#times
 14.18     2.29      0.38   100000     0.00     0.00  Kernel.rand
 13.57     2.65      0.36   100000     0.00     0.00  Hash#[]=
  0.00     2.65      0.00        1     0.00  2652.00  #toplevel
===================== Set READ ===================
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 99.48     3.09      3.09     1000     3.09     3.09  Hash#each_key
  0.52     3.10      0.02        1    16.00  3104.00  Integer#times
  0.00     3.10      0.00     1000     0.00     3.09  Set#each
  0.00     3.10      0.00     1000     0.00     0.00  Kernel.block_given?
  0.00     3.10      0.00        1     0.00  3104.00  #toplevel
===================== Hash READ ===================
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
100.00     1.70      1.70     1000     1.70     1.70  Hash#each
  0.00     1.70      0.00        1     0.00  1700.00  Integer#times
  0.00     1.70      0.00        1     0.00  1700.00  #toplevel

Nice chart:

Conclusion

Using Hash is around 73% faster than using Set.
This does not mean we should always use Hash instead of Set. Hash’s syntax is not as clear as Set’s one when dealing with data sets. Nil value objects make the code look a bit glitchy.
However if performance is a concern to you, as sometimes data sets can be used intensively, you may consider using Hash before thinking of implementing your algorithms in a C extension.

About Muriel Salvan

I am a freelance project manager and polyglot developer, expert in Ruby and Rails. I created X-Aeon Solutions and rivierarb Ruby meetups. I also give trainings and conferences on technical topics. My core development principles: Plugins-oriented architectures, simple components, Open Source power, clever automation, constant technology watch, quality and optimized code. My experience includes big and small companies. I embrace agile methodologies and test driven development, without giving up on planning and risks containment methods as well. I love Open Source and became a big advocate.
Benchmark, Ruby , , , , ,

1 comment


Leave a Reply

Your email address will not be published.