====== Hash Function Efficiency ======
(This article is also available [[hash:efficiency | in Russian]])
In article of Arash Partow
[[http://www.partow.net/programming/hashfunctions/ |
"General Purpose Hash Function Algorithms"]]
several 32-bit algorithms are reviewed:
* rs --- simple hash function from Robert Sedgwicks book [[http://www.amazon.com/gp/product/0201514257/ | 'Algorithms in C']]
* js --- bitwise hash function by Justin Sobel
* pjw --- algorithm based on work by Peter J. Weinberger
* bkdr --- hash function from Brian Kernighan and Dennis Ritchie's book [[http://www.amazon.com/gp/product/0131103628/ | 'The C Programming Language']]
* sdbm --- algorithm of choice used in SDBM project
* djb --- algorithm produced by Professor Daniel J. Bernstein
* dek --- algorithm proposed by Donald E. Knuth in [[http://www.amazon.com/gp/product/0201896850/ | 'The Art Of Computer Programming']]
* ap --- algorithm produced by Arash Partow
Another five variants:
* faq6 --- number 6 from [[http://burtleburtle.net/bob/hash/hashfaq.html | FAQ by Bob Jenkins]]
* lookup3 --- author [[http://burtleburtle.net/bob/hash/ | Bob Jenkins]]
* ly --- proposed by [[http://leo.yuriev.ru/random | Leonid Yuriev]] (congruential generator)
* rot13 --- simple algorithm with circular shift, by Serge Vakulenko
* crc32 --- [[http://www.w3.org/TR/PNG-CRCAppendix.html | standard checksum]] with polynom x32+x26+x23+x22+x16+x12+x11+x10+x8+x7+x5+x4+x2+x+1
Here are [[sources | the C sources]].
===== Test 1 =====
To measure the efficiency of hash functions I prepared the following test data:
* {{usdict.gz | American dictionary}} from Ispell project, 62075 words.
* {{rudict.gz | Russian dictionary}} from Ispell project, 128900 words.
* {{symbols.gz | A list of symbols}}, extracted from all libs on my linux workstation (libc.a and others), 136073 words.
Total volume after merging is **326797** different words.
For every word a 32-bit hash value was computed, and counted a number
of collisions.
^ Algorithm ^ Collisions ^
^ rs | 9 |
^ js | 98 |
^ pjw | 1315 |
^ bkdr | 11 |
^ sdbm | 14 |
^ djb | 83 |
^ dek | 308 |
^ ap | 16 |
^ ly | 9 |
^ rot13 | 7 |
^ faq6 | 14 |
^ lookup3 | 9 |
^ crc32 | 13 |
A list of collisions for rs, bkdr, sdbm, ap, ly, rot13, faq6, lookup3 and crc32
is available [[conflicts | here]].
Algorithms with minimal collisions:
* [[sources#ROT13 Hash Function| rot13]] --- one circular shift (rotation) and addition
* [[sources#LOOKUP3 Hash Function| lookup3]] --- one shift and addition (or more)
* [[sources#LY Hash Function| ly]] --- one multiplication and addition
* [[sources#RS Hash Function| rs]] --- two multiplications
Results are presented on picture.
Two outsiders --- pjw and dek --- exceed the limits of Y axis.
{{hash-eff.png}}
----
===== Test 2 =====
In previous test, all data had MSB unchanged. For the second test another data set was selected:
* {{dedict.gz | German dictionary}} from Ispell project, 39612 words.
* {{hudict.gz | Hungarian dictionary}} from Ispell project, 211880 words.
* {{itdict.gz | Italian dictionary}} from Ispell project, 37268 words.
* {{sedict.gz | Swedish dictionary}} from Ispell project, 24019 words.
Total volume after merging is **310595** different words.
Algorithms js, pjw, djb и dek were excluded from testing.
^ Algorithm ^ Collisions ^
^ rs | 5 |
^ bkdr | 14 |
^ sdbm | 14 |
^ ap | 32 |
^ ly | 8 |
^ rot13 | 10 |
^ faq6 | 13 |
^ lookup3 | 14 |
^ crc32 | 15 |
Algorithms with minimal collisions:
* [[sources#RS Hash Function| rs]] --- two multiplications
* [[sources#LY Hash Function| ly]] --- one multiplication and addition
* [[sources#ROT13 Hash Function| rot13]] --- one circular shift (rotation) and addition
Results are presented on picture.
{{de-hu-se-it.png}}