====== Hash Function Efficiency ====== (This article is also available [[hash:efficiency | in Russian]]) In article of Arash Partow [[http://www.partow.net/programming/hashfunctions/ | "General Purpose Hash Function Algorithms"]] several 32-bit algorithms are reviewed: * rs --- simple hash function from Robert Sedgwicks book [[http://www.amazon.com/gp/product/0201514257/ | 'Algorithms in C']] * js --- bitwise hash function by Justin Sobel * pjw --- algorithm based on work by Peter J. Weinberger * bkdr --- hash function from Brian Kernighan and Dennis Ritchie's book [[http://www.amazon.com/gp/product/0131103628/ | 'The C Programming Language']] * sdbm --- algorithm of choice used in SDBM project * djb --- algorithm produced by Professor Daniel J. Bernstein * dek --- algorithm proposed by Donald E. Knuth in [[http://www.amazon.com/gp/product/0201896850/ | 'The Art Of Computer Programming']] * ap --- algorithm produced by Arash Partow Another five variants: * faq6 --- number 6 from [[http://burtleburtle.net/bob/hash/hashfaq.html | FAQ by Bob Jenkins]] * lookup3 --- author [[http://burtleburtle.net/bob/hash/ | Bob Jenkins]] * ly --- proposed by [[http://leo.yuriev.ru/random | Leonid Yuriev]] (congruential generator) * rot13 --- simple algorithm with circular shift, by Serge Vakulenko * crc32 --- [[http://www.w3.org/TR/PNG-CRCAppendix.html | standard checksum]] with polynom x32+x26+x23+x22+x16+x12+x11+x10+x8+x7+x5+x4+x2+x+1 Here are [[sources | the C sources]]. ===== Test 1 ===== To measure the efficiency of hash functions I prepared the following test data: * {{usdict.gz | American dictionary}} from Ispell project, 62075 words. * {{rudict.gz | Russian dictionary}} from Ispell project, 128900 words. * {{symbols.gz | A list of symbols}}, extracted from all libs on my linux workstation (libc.a and others), 136073 words. Total volume after merging is **326797** different words. For every word a 32-bit hash value was computed, and counted a number of collisions. ^ Algorithm ^ Collisions ^ ^ rs | 9 | ^ js | 98 | ^ pjw | 1315 | ^ bkdr | 11 | ^ sdbm | 14 | ^ djb | 83 | ^ dek | 308 | ^ ap | 16 | ^ ly | 9 | ^ rot13 | 7 | ^ faq6 | 14 | ^ lookup3 | 9 | ^ crc32 | 13 | A list of collisions for rs, bkdr, sdbm, ap, ly, rot13, faq6, lookup3 and crc32 is available [[conflicts | here]]. Algorithms with minimal collisions: * [[sources#ROT13 Hash Function| rot13]] --- one circular shift (rotation) and addition * [[sources#LOOKUP3 Hash Function| lookup3]] --- one shift and addition (or more) * [[sources#LY Hash Function| ly]] --- one multiplication and addition * [[sources#RS Hash Function| rs]] --- two multiplications Results are presented on picture. Two outsiders --- pjw and dek --- exceed the limits of Y axis. {{hash-eff.png}} ---- ===== Test 2 ===== In previous test, all data had MSB unchanged. For the second test another data set was selected: * {{dedict.gz | German dictionary}} from Ispell project, 39612 words. * {{hudict.gz | Hungarian dictionary}} from Ispell project, 211880 words. * {{itdict.gz | Italian dictionary}} from Ispell project, 37268 words. * {{sedict.gz | Swedish dictionary}} from Ispell project, 24019 words. Total volume after merging is **310595** different words. Algorithms js, pjw, djb и dek were excluded from testing. ^ Algorithm ^ Collisions ^ ^ rs | 5 | ^ bkdr | 14 | ^ sdbm | 14 | ^ ap | 32 | ^ ly | 8 | ^ rot13 | 10 | ^ faq6 | 13 | ^ lookup3 | 14 | ^ crc32 | 15 | Algorithms with minimal collisions: * [[sources#RS Hash Function| rs]] --- two multiplications * [[sources#LY Hash Function| ly]] --- one multiplication and addition * [[sources#ROT13 Hash Function| rot13]] --- one circular shift (rotation) and addition Results are presented on picture. {{de-hu-se-it.png}}