哈希应用(详解)

1.位图

2.位图的应用

3.布隆过滤器

1.位图

1.1 位图的概念:

在海量数据中,并且数据都不重复, 那么要查找某一个数据使用其他的方法, 不仅会消耗很大的内存, 并且效率非常之低下. 那么就是可以采用每个比特位进行存放数据的信息(是否存在).

1.2 举个小小的实际栗子:

问题介绍: 给40 亿个不重复的无符号整数，没排过序。给一个无符号整数，如何快速判断一个数是否在这40 亿个数中.

(1)我们可能会思考到遍历;二分查找那么你要考虑到如果是遍历那么时间复杂度为O(n); 二分查找不仅要求数据是升序排放而且时间复杂度为O(log2^n).

(2)如果采用位图的方法:

那么需要多少个比特位捏? 2^32 = 4,294,967,296(42亿个数据);

那么需要多少内存捏? 2^32 / 8 / 1024 / 1024 / 1024 = 0.5G

1.3位图的实现

(1) bitset构造函数:

_bitset是数组类型为int, 分成4byte, 那么就是4 * 8= 32bit; 然后全部初始化为0.

        bitset()
		{
			//_bits.resize(N / 32 + 1, 0);
			_bits.resize((N >> 5) + 1, 0);
		}

(2) set位图操作:

因为一个数据我们不知道它是在哪个块中(这里简单理解32bit为一个块, 然后还有在块的第几个位置.那么刚好需要找到.

然后还要将j位数据置1,其他位不变捏?

那就是采用这一位的数据按位或上1.如果为0,那么也是1, 为1更加是1.

        void set(size_t x)
		{
			size_t i = x / 32;
			size_t j = x % 32;
			_bits[i] |= (1 << j);
		}

(3) reset操作:

和上面set找到数据一样; 处理数据就是将j位置置0, 其他位置不变.

如何保证j位置置0, 其他位置不变?

那就是将j位置按位与1的取反,那么无论j为是0还是1都被置0; 1的其他位取反之后是1和其他位按位与还是本身.

        void reset(size_t x)
		{
			size_t i = x / 32;
			size_t j = x % 32;
			_bits[i] &= ~(1 << j);
		}

(4)测试test:

测试数据的j位是否为1还是为0.

        void test(size_t x)
		{
			size_t i = x / 32;
			size_t j = x % 32;
			return _bits[i] & (1 << j);
		}

(5)主函数main:

#define _CRT_SECURE_NO_WARNINGS 1
#include"bitset.h"
#include<iostream>
#include<vector>
using namespace std;

int main()
{
	study::bitset<100> bs;
	bs.set(40);
	bs.set(39);

	cout << bs.test(38) << endl;
	cout << bs.test(39) << endl;
	cout << bs.test(40) << endl;
	cout << bs.test(41) << endl;

	bs.reset(40);
	cout << bs.test(38) << endl;
	cout << bs.test(39) << endl;
	cout << bs.test(40) << endl;
	cout << bs.test(41) << endl;

	return 0;
}

2.位图的应用

(1) 给定100亿个整数，设计算法找到只出现一次的整数？

分析: 那么可以将一个位图的状态分为三种: 出现0次, 出现1次, 出现两次及以上.

    template<size_t N>
	class twobitset
	{
	public:
		void set(size_t x)
		{
			//0->1 00 -> 01
			if (_bs1.test(x) == false && _bs2.test(x) == false)
			{
				_bs2.set(x);
			}
			//1->2 01 -> 10
			else if (_bs1.test(x) == false && _bs2.test(x) == true)
			{
				_bs1.set(x);
				_bs2.reset(x);
			}
			//2->3 10 -> 11
			else if (_bs1.test(x) == true && _bs2.test(x) == false)
			{
				_bs1.set(x);
				_bs2.set(x);
			}
			//3->..不变
		}

		void print()
		{
			for (size_t i = 0; i < N; i++)
			{
				//01
				if (_bs1.test(i) == false && _bs2.test(i) == true)
				{
					cout << "1->" <<  i << endl;
				}
				//10
				else if (_bs1.test(i) == true && _bs2.test(i) == false)
				{
					cout << "2->" << i << endl;
				}
			}
			cout << endl;
		}
	private:
		bitset<N> _bs1;
		bitset<N> _bs2;
	};

(2)给两个文件，分别有100亿个整数，我们只有1G内存，如何找到两个文件交集？

可以这样分析: 将两个文件的数据映射两个位图, 如果一个值在两张位图里面都存在就是交集.

    template<size_t N>
	class intersection
	{
	public:
		void set(size_t x)
		{
			//出现0次 0 -> 1
			if (_bits1.test(x) == false)
			{
				_bits1.set(x);
			}
			//出现两次及其以上.

			if (_bits2.test(x) == false)
			{
				_bits2.set(x);
			}
		}

		void check(size_t x)
		{
			for (size_t i = 0; i < N; i++)
			{
				//相交就返回.
				if (_bits1.test(i) == true && _bits2.test(i) == true)
				{
					cout << "相交" << i << endl;
				}
			}
			cout << endl;
		}

	private:
		bitset<N> _bits1;
		bitset<N> _bits2;
	};

(3) 1个文件有100亿个int，1G内存，设计算法找到出现次数不超过2次的所有整数?

是不是和第一个实例优点相同; 那你猜对了,一个位图的状态分为种情况; 分别是出现次数为0; 出现为1次; 出现次数为2; 出现次数2次以上.

	template<size_t N>
	class threebitset
	{
	public:
		void set(size_t x)
		{
			//0次 000 -> 001
			if (_bs1.test(x) == false && _bs2.test(x) == false && _bs3.test(x) == false)
			{
				_bs3.set(x);
			}
			//1次 001 -> 010
			else if (_bs1.test(x) == false && _bs2.test(x) == false && _bs3.test(x) == true)
			{
				_bs2.set(x);
				_bs3.reset(x);
			}
			//2次 010 -> 011
			else if (_bs1.test(x) == false && _bs2.test(x) == true && _bs3.test(x) == false)
			{
				_bs3.set(x);
			}
			//3次 011 -> 100
			else if (_bs1.test(x) == false && _bs2.test(x) == true && _bs3.test(x) == true)
			{
				_bs1.set(x);
				_bs2.reset(x);
				_bs3.reset(x);
			}
			//3次及其以上 100;不管
		}

		void print()
		{
			for (size_t i = 0; i < N; i++)
			{
				//出现0次
				if (_bs1.test(i) == false && _bs2.test(i) == false && _bs3.test(i) == false)
				{
					cout << "0->" << i << endl;
				}
				//出现1次
				else if (_bs1.test(i) == false && _bs2.test(i) == false && _bs3.test(i) == true)
				{
					cout << "1->" << i << endl;
				}
				//出现2次
				else if (_bs1.test(i) == false && _bs2.test(i) == true && _bs3.test(i) == false)
				{
					cout << "2->" << i << endl;
				}
			}
			cout << endl;
		}

	private:
		bitset<N> _bs1;
		bitset<N> _bs2;
		bitset<N> _bs3;
	};

3.布隆过滤器

考虑一个小问题:

如果是整形考虑在不在以及子类问题, 那就是位图可以解决;

那如果是其他类型考虑在不在捏? 那就要上布隆过滤器.

3.1代码编写

(1) BKDRHash和APHash以及DJBHash都是用来映射的, 由于一个值如果映射一个位置就很容易导致冲突, 造成误判, 然而上面的三种方法就是一个值映射到多个位置, 降低误判的概率,但是无法避免.

(2)一般布隆过滤器是不允许删除的,因为映射的值如果删除很可能会影响到其他的值的查找.

#pragma once
#include<iostream>
#include<vector>
#include<string>
#include"Bitset.h"
using namespace std;

namespace Study
{
	struct BKDRHash
	{
		size_t operator()(const string& key)
		{
			// BKDR
			size_t hash = 0;
			for (auto e : key)
			{
				hash *= 31;
				hash += e;
			}

			return hash;
		}
	};

	struct APHash
	{
		size_t operator()(const string& key)
		{
			size_t hash = 0;
			for (size_t i = 0; i < key.size(); i++)
			{
				char ch = key[i];
				if ((i & 1) == 0)
				{
					hash ^= ((hash << 7) ^ ch ^ (hash >> 3));
				}
				else
				{
					hash ^= (~((hash << 11) ^ ch ^ (hash >> 5)));
				}
			}
			return hash;
		}
	};

	struct DJBHash
	{
		size_t operator()(const string& key)
		{
			size_t hash = 5381;
			for (auto ch : key)
			{
				hash += (hash << 5) + ch;
			}
			return hash;
		}
	};

	template<size_t N,
		class K = string,
		class HashFunc1 = BKDRHash,
		class HashFunc2 = APHash,
		class HashFunc3 = DJBHash>
class BoolFilter
	{
	public:
		void set(const K& key)
		{
			size_t hash1 = HashFunc1()(key) % N;
			size_t hash2 = HashFunc2()(key) % N;
			size_t hash3 = HashFunc3()(key) % N;

			_bs.set(hash1);
			_bs.set(hash2);
			_bs.set(hash3);
		}

		//一般不允许删除滴,因为会影响到其他的值.
		void Reset(const K& key);

		bool Test(const K& key)
		{
			size_t hash1 = HashFunc1()(key) % N;
			if (_bs.test(hash1) == false)
				return false;

			size_t hash2 = HashFunc2()(key) % N;
			if (_bs.test(hash2) == false)
				return false;

			size_t hash3 = HashFunc3()(key) % N;
			if (_bs.test(hash3) == false)
				return false;

			//存在误判
			return true;
		}
	private:
		study::bitset<N> _bs;
	};

	void TestBF1()
	{
		BoolFilter<100> bf;
		bf.set("桃子");
		bf.set("香蕉");
		bf.set("哈密瓜");
		bf.set("西瓜");

		cout << bf.Test("桃子") << endl;
		cout << bf.Test("香蕉") << endl;
		cout << bf.Test("哈密瓜") << endl;
		cout << bf.Test("葡萄") << endl;
		cout << bf.Test("西瓜") << endl;
		cout << bf.Test("火龙果") << endl;
		cout << bf.Test("香瓜 ") << endl;
		cout << bf.Test("猕猴桃") << endl;
	}

	void TestBF2()
	{
		srand(time(0));
		const size_t N = 10000;
		BoolFilter<N * 10> bf;

		vector<string> v1;
		string url = "西瓜";

		for (size_t i = 0; i < N; i++)
		{
			v1.push_back(url + to_string(i));
		}

		for (auto& str : v1)
		{
			bf.set(str);
		}

		vector<string> v2;
		for (size_t i = 0; i < N; i++)
		{
			string urlstr = url;
			urlstr += to_string(999999 + i);
			v2.push_back(urlstr);
		}

		size_t n2 = 0;
		for (auto& str : v2)
		{
			if (bf.Test(str))//误判
			{
				n2++;
			}
		}
		cout << "相似字符的误判率为:" << (double)n2 / (double)N << endl;
	
		vector<string> v3;
		for (size_t i = 0; i < N; i++)
		{
			string url = "桃子";
			url += to_string(i + rand());
			v3.push_back(url);
		}

		size_t n3 = 0;
		for (auto& str : v3)
		{
			if (bf.Test(str))//误判
			{
				n3++;
			}
		}
		cout << "相似字符的误判率为:" << (double)n3 / (double)N << endl;
	}
}

3.2布隆过滤器的应用

(1)给两个文件，分别有100亿个query，我们只有1G内存，如何找到两个文件交集？分别给出精确算法和近似算法.

分析: 假如一个query是50byte, 那么100亿query就是5000亿byte也就是约等于500G,那么绝对会超出内存, 更加找不出来, 那么可以考虑一下哈希分割的思想.

首先就是A和B都将每个query的数据进行映射(i = Hahs(query) % 500); 然后映射到A0-A499/ B0-B499的小文件里面.最后Ai和BI找交集即可(Ai和Bi分别都插入setA和SetB,找交集).

那么又有一个小问题: 如果一个小文件的大小是10G内存还是不够捏?

可能有下面两种情况: 首先就是小文件里大多数是一个query, 还有就是小文件里面很多不同的query. 面对这两种情况我们都可以采用set里面解决, 为啥捏? 如果是情况一那么一定插入成功的只有第一次出现的,后面失败. 情况二的话不断插入set就会内存不足, 抛异常, 那么我们可以采取二次分割, 再找交集的方法.