Hashing

Hashing is a data structure

Its goal is to make the time complexity of inserting, deleting, looking up O(1)

Concept: If we know the exact location of the data, with an array structure, then we can directly access its location via index

Terms:

Key: unique info for different datas

hash function: convert a key to its hash index

hash index: data’s location in the hash table

hash table: array that stores references to (K,V) pairs (data)

table size: the length of hash table

load factor: number of elements / table size

Java’s hashCode() method: 1. in the object class

2. return a 32-bit int

3. some classes overwrite it

Trivial hash function: key => minor changes => valid index

Perfect hash function: one key <=> only one index

void insert(K key, V value) { table [ hashFunction ( key ) ] = data ; }

D lookup (K key) {return table [ hashFunction ( key ) ] ; }

void delete (K key) { table [ hashFunction ( key ) ] = null ; }

Collisions : different keys map into same index

Key Issues: 1. Good hash function
2. good table size
3. handle collisions

Good Hash Functions:
1. must be deterministic
2. uniform distribution outputs
3. minimize collisions
4. easily computable for computers O(1)

Techniques for Generating Hash Codes

Integer Key

ex: 90123456789

hash code = 123 * 11 + 456 * 121 + 789 * 1

Extraction: divide the useful part of the key

Weighting: modify each part by a weight

Folding: combine weighted parts together into an int

Handling String Keys:

let C_i be the ASCII table code for the charAt(i)
let N = length of String

for example “CAT”

hashCode = C_0 * 31^(N-1) + C_1 * 31^(N-2) +… + C_N-1 * 1

CAT = 67 * 31^2 + 65 * 31^1 + 84 * 31^0 = 66486

(NOTE: the code can be negative when overflow occurs)

Why we choose power 31?
1. It is a prime number
2. one less than the power of 2 (then it can be quickly computed by binary shift) —-> 31*i = (i <<5) – i

Handling Double Keys

Store a 64 bit floating point

1. Splitting it into two 32-bit parts, named upper part and lower part
2. Perform a bit exclusive or ( XOR )
3. calculate the hash code based on the result

e.g K = 011011011
upper = 0110; lower = 1011
0110
1011 XOR
————
1101

Then the hashCode is decimal representing of 1101, that is #13

pseudo code:

long bits = Double.doubleToLongBits (key)
int lower = (int) bits
int upper = (int) bits >> 32
int result = upper XOR lower

Choosing the Table Size

Suppose we have 100 items

When the table size is 10000 —– Load factor 0.01, expected collisions: 0-1

When the table size is 1000 —– Load factor 0.1, expected collisions: 3-7

When the table size is 100 —– Load factor 1, expected collisions: 30-40

When the table size is 10 —– Load factor 10, expected collisions: 90

So, smaller load factor leads to fewer collisions. Yet, it costs extra space at the same time

The optimal load factor is between 0.7 and 0.8

The most ideal situation of table size is a prime number, but sometimes people use an odd number for convenience.

Prime number maximum the number of spots available in HashTable

Resizing the Hash Table

when the current load factor is bigger than the threshold load factor, then we need to rehash the table

Wrong solution: Making table bigger then copying items to new table
It leads to chaos index

Good Strategy:
1. Double the table size to the nearest prime number ( conveniently * 2 + 1)
2. rehash each item into a new table

Time Complexity of this strategy seems O(N) but it is only called when we insert around N elements, thus O(N) / N still be O(1), which is called “amortization”

Collision Handling using Open Addressing

Some element is already occupying target spot

Open addressing: check the table for an open spot and base on a rule of searching

1. Linear Probing: step by step until finding an open address
i.e index, index+1. index+2
must wrap around table

Con: { primary clustering, not uniformly distributed } leads to O(N)
pro: { can always find an open address if one exists }

Deleting:

If we delete 337, then we may not are able to find 359 ( since 337 % 11 = 359% 11 )
Hence we may need to a “dummy value(-99999) ” to mark this spot that it is different from pure null spot.

2. Quadratic Probing: +1, +4, +9, +16 …

Con: {may not find an open address or existing elements, can lead to secondary clustering}
Pro: {avoids primary clustering}

3. Double Hashing: define a second hash function to compute step size
index, index+ hash2(key), index + 2hash2(key), index + 3hash2(key)
hash2(key) is called step size

Collision Handling using Buckets

Buckets: each element stores a bucket, that can store multiple items in one spot

1. Array Buckets – “fixed size”

Cons {can have long chains, can overfill}
Pros {easy to implement, no waste space}

2. Chained Buckets

Cons { long chains leads to poor look up time complicity }
Pros {no waste space, does not over fill}

3. Tree Buckets

Cons {lots of extra work, not much benefit}
Pros {guards against O(N) => O(log2(N) }

Notes from CS 400, UW-Madison

LeetCode4

https://blog.csdn.net/hk2291976/article/details/51107778

比特币黄金首遭“51%攻击”，可能动摇数字货币世界的根基

2018年5月中，“51%攻击”在区块链世界中爆发，比特币的分叉币之一BTG(比特币黄金)不幸沦为受害者。如果“51%攻击”可以成功篡改区块链，甚至可能动摇整个数字货币世界的根基。此次攻击中，比特币黄金开发团队公告显示攻击者掌控了BTG网络的大比例算力，从而针对交易所发动“51%攻击”，成功实施“双花交易”。攻击者向自己发送了超过38万个BTG。如果所有资金被盗，攻击者从交易所的损失中将获利超过1800万美元。

“51%攻击”的原理在于数字货币采用的分布式记账机制。以比特币为例，比特币网络是一个去中心化的分布式账本，每记一笔账都需要“全民投票”确认。当你掌握全网算力超过50%时，就拥有了对比特币网络操纵和篡改的权力。