String Hashing
String Hashing technique for fast string comparison
String Hashing
String Hashing is a technique that converts strings to numbers (hash values) enabling fast string comparison in .
Main Concept
Convert string to a number using the formula:
Where:
- = base (commonly 31 or 37)
- = modulo (commonly a large prime)
- = ASCII value of character at position
Polynomial Rolling Hash
To compute hash of any substring in , we precompute prefix hash:
Hash of substring :
Implementation
const long long MOD = 1e9 + 7;
const long long BASE = 31;
struct StringHash {
vector<long long> h, p;
int n;
StringHash(const string& s) {
n = s.size();
h.resize(n + 1);
p.resize(n + 1);
p[0] = 1;
h[0] = 0;
for (int i = 0; i < n; i++) {
p[i + 1] = (p[i] * BASE) % MOD;
h[i + 1] = (h[i] * BASE + (s[i] - 'a' + 1)) % MOD;
}
}
// Get hash of s[l..r] (0-indexed, inclusive)
long long getHash(int l, int r) {
long long res = (h[r + 1] - h[l] * p[r - l + 1] % MOD + MOD) % MOD;
return res;
}
// Compare s[l1..r1] with s[l2..r2]
bool compare(int l1, int r1, int l2, int r2) {
return getHash(l1, r1) == getHash(l2, r2);
}
};
Double Hashing
To reduce collision probability, use 2 hash functions simultaneously:
struct DoubleHash {
StringHash h1, h2;
DoubleHash(const string& s) : h1(s, 31, 1e9+7), h2(s, 37, 1e9+9) {}
pair<long long, long long> getHash(int l, int r) {
return {h1.getHash(l, r), h2.getHash(l, r)};
}
};
Usage Examples
1. Count Distinct Substrings of Length K
int countDistinct(const string& s, int k) {
StringHash sh(s);
set<long long> seen;
for (int i = 0; i + k <= s.size(); i++) {
seen.insert(sh.getHash(i, i + k - 1));
}
return seen.size();
}
2. Pattern Matching (Rabin-Karp)
vector<int> rabinKarp(const string& text, const string& pattern) {
StringHash textHash(text);
StringHash patternHash(pattern);
vector<int> matches;
long long patHash = patternHash.getHash(0, pattern.size() - 1);
for (int i = 0; i + pattern.size() <= text.size(); i++) {
if (textHash.getHash(i, i + pattern.size() - 1) == patHash) {
matches.push_back(i);
}
}
return matches;
}
Complexity
| Operation | Time | Space |
|---|---|---|
| Preprocessing | ||
| Query hash | - | |
| Compare substrings | - |
Tips
- Use Double Hashing to reduce collisions
- Choose appropriate base and mod - base should be larger than alphabet size, mod should be a large prime
- Watch out for overflow - use unsigned long long or __int128
- Birthday Paradox - collision probability increases with more hash values