A data set is a collection of objects that ensures each object is present only once in the collection.
In Ruby, there are 2 simple ways to define data sets:
Of course using Set is the preferred way, as it has been designed specifically for this purpose:
require 'set'
my_set = Set.new
my_set << 5
my_set << 9
my_set << 5
puts my_set.to_a.join(', ') # 5, 9
Using Hash is relatively easy too: just use the Hash’s keys to represent the data set, and associate a useless dummy value (nil) for each key:
my_set = {}
my_set[5] = nil
my_set[9] = nil
my_set[5] = nil
puts my_set.keys.join(', ') # 5, 9
So what about comparing them performance based ?
Here is the benchmark code:
require 'set'
require 'profile'
nbr_inserts = 100000
nbr_reads = 1000
size_set = 1000
Profiler__::start_profile
my_set = Set.new
nbr_inserts.times do |i|
my_set << rand(size_set)
end
Profiler__::stop_profile
puts "===================== Set INSERT ==================="
Profiler__::print_profile($stdout)
Profiler__::start_profile
my_hash = {}
nbr_inserts.times do |i|
my_hash[rand(size_set)] = nil
end
Profiler__::stop_profile
puts "===================== Hash INSERT ==================="
Profiler__::print_profile($stdout)
Profiler__::start_profile
nbr_reads.times do |i|
c = 0
my_set.each do |i|
c += i
end
end
Profiler__::stop_profile
puts "===================== Set READ ==================="
Profiler__::print_profile($stdout)
Profiler__::start_profile
nbr_reads.times do |i|
d = 0
my_hash.each do |k, v|
d += k
end
end
Profiler__::stop_profile
puts "===================== Hash READ ==================="
Profiler__::print_profile($stdout)
And here is the result:
===================== Set INSERT =================== % cumulative self self total time seconds seconds calls ms/call ms/call name 47.60 2.07 2.07 1 2072.00 4353.00 Integer#times 31.22 3.43 1.36 100000 0.01 0.02 Set#add 10.77 3.90 0.47 100000 0.00 0.00 Hash#[]= 10.41 4.35 0.45 100000 0.00 0.00 Kernel.rand 0.00 4.35 0.00 1 0.00 0.00 Set#initialize 0.00 4.35 0.00 1 0.00 0.00 NilClass#nil? 0.00 4.35 0.00 2 0.00 0.00 Class#new 0.00 4.35 0.00 1 0.00 0.00 Hash#initialize 0.00 4.35 0.00 1 0.00 0.00 Kernel.set_trace_func 0.00 4.35 0.00 1 0.00 4353.00 #toplevel ===================== Hash INSERT =================== % cumulative self self total time seconds seconds calls ms/call ms/call name 72.25 1.92 1.92 1 1916.00 2652.00 Integer#times 14.18 2.29 0.38 100000 0.00 0.00 Kernel.rand 13.57 2.65 0.36 100000 0.00 0.00 Hash#[]= 0.00 2.65 0.00 1 0.00 2652.00 #toplevel ===================== Set READ =================== % cumulative self self total time seconds seconds calls ms/call ms/call name 99.48 3.09 3.09 1000 3.09 3.09 Hash#each_key 0.52 3.10 0.02 1 16.00 3104.00 Integer#times 0.00 3.10 0.00 1000 0.00 3.09 Set#each 0.00 3.10 0.00 1000 0.00 0.00 Kernel.block_given? 0.00 3.10 0.00 1 0.00 3104.00 #toplevel ===================== Hash READ =================== % cumulative self self total time seconds seconds calls ms/call ms/call name 100.00 1.70 1.70 1000 1.70 1.70 Hash#each 0.00 1.70 0.00 1 0.00 1700.00 Integer#times 0.00 1.70 0.00 1 0.00 1700.00 #toplevel
Conclusion
Using Hash is around 73% faster than using Set.
This does not mean we should always use Hash instead of Set. Hash’s syntax is not as clear as Set’s one when dealing with data sets. Nil value objects make the code look a bit glitchy.
However if performance is a concern to you, as sometimes data sets can be used intensively, you may consider using Hash before thinking of implementing your algorithms in a C extension.

1 comment