A data set is a collection of objects that ensures each object is present only once in the collection.
In Ruby, there are 2 simple ways to define data sets:
Of course using Set
is the preferred way, as it has been designed specifically for this purpose:
require 'set' my_set = Set.new my_set << 5 my_set << 9 my_set << 5 puts my_set.to_a.join(', ') # 5, 9
Using Hash
is relatively easy too: just use the Hash’s keys to represent the data set, and associate a useless dummy value (nil) for each key:
my_set = {} my_set[5] = nil my_set[9] = nil my_set[5] = nil puts my_set.keys.join(', ') # 5, 9
So what about comparing them performance based ?
Here is the benchmark code:
require 'set' require 'profile' nbr_inserts = 100000 nbr_reads = 1000 size_set = 1000 Profiler__::start_profile my_set = Set.new nbr_inserts.times do |i| my_set << rand(size_set) end Profiler__::stop_profile puts "===================== Set INSERT ===================" Profiler__::print_profile($stdout) Profiler__::start_profile my_hash = {} nbr_inserts.times do |i| my_hash[rand(size_set)] = nil end Profiler__::stop_profile puts "===================== Hash INSERT ===================" Profiler__::print_profile($stdout) Profiler__::start_profile nbr_reads.times do |i| c = 0 my_set.each do |i| c += i end end Profiler__::stop_profile puts "===================== Set READ ===================" Profiler__::print_profile($stdout) Profiler__::start_profile nbr_reads.times do |i| d = 0 my_hash.each do |k, v| d += k end end Profiler__::stop_profile puts "===================== Hash READ ===================" Profiler__::print_profile($stdout)
And here is the result:
===================== Set INSERT =================== % cumulative self self total time seconds seconds calls ms/call ms/call name 47.60 2.07 2.07 1 2072.00 4353.00 Integer#times 31.22 3.43 1.36 100000 0.01 0.02 Set#add 10.77 3.90 0.47 100000 0.00 0.00 Hash#[]= 10.41 4.35 0.45 100000 0.00 0.00 Kernel.rand 0.00 4.35 0.00 1 0.00 0.00 Set#initialize 0.00 4.35 0.00 1 0.00 0.00 NilClass#nil? 0.00 4.35 0.00 2 0.00 0.00 Class#new 0.00 4.35 0.00 1 0.00 0.00 Hash#initialize 0.00 4.35 0.00 1 0.00 0.00 Kernel.set_trace_func 0.00 4.35 0.00 1 0.00 4353.00 #toplevel ===================== Hash INSERT =================== % cumulative self self total time seconds seconds calls ms/call ms/call name 72.25 1.92 1.92 1 1916.00 2652.00 Integer#times 14.18 2.29 0.38 100000 0.00 0.00 Kernel.rand 13.57 2.65 0.36 100000 0.00 0.00 Hash#[]= 0.00 2.65 0.00 1 0.00 2652.00 #toplevel ===================== Set READ =================== % cumulative self self total time seconds seconds calls ms/call ms/call name 99.48 3.09 3.09 1000 3.09 3.09 Hash#each_key 0.52 3.10 0.02 1 16.00 3104.00 Integer#times 0.00 3.10 0.00 1000 0.00 3.09 Set#each 0.00 3.10 0.00 1000 0.00 0.00 Kernel.block_given? 0.00 3.10 0.00 1 0.00 3104.00 #toplevel ===================== Hash READ =================== % cumulative self self total time seconds seconds calls ms/call ms/call name 100.00 1.70 1.70 1000 1.70 1.70 Hash#each 0.00 1.70 0.00 1 0.00 1700.00 Integer#times 0.00 1.70 0.00 1 0.00 1700.00 #toplevel
Conclusion
Using Hash is around 73% faster than using Set.
This does not mean we should always use Hash instead of Set. Hash’s syntax is not as clear as Set’s one when dealing with data sets. Nil value objects make the code look a bit glitchy.
However if performance is a concern to you, as sometimes data sets can be used intensively, you may consider using Hash before thinking of implementing your algorithms in a C extension.
1 comment