| 聊一聊双十一背后的技术 毫秒分词算啥, 试试正则和相似度 原作者:digoal/德哥 创作时间:2016-11-21 17:36:26+08 |
doudou586 发布于2016-11-21 18:36:26
评论: 1
浏览: 11997
|
看刑侦剧经常有看到人物拼图,然后到图库搜索的,以前可能靠的是人肉,今天,使用PG,可以靠数据库的图形近似度搜索功能。 《弱水三千,只取一瓢,当图像搜索遇见PostgreSQL (Haar wavelet)》
但是我今天不是写图形搜索,我要写的是文本搜索,文本搜索,大家一定会想到分词。千万不要以为分词可以搞定一切需求,比如这样的需求就搞不定。
hello world打成了hello word或者hello w0rld,你要让数据库匹配出来,怎么搞?
又或者你的业务需要写正则进行匹配,怎么搞?比如一些域名的查询,错误! 超链接引用无效。
甚至更变态的 fi[a-z]{1}e.*?.?? ,这样的查询。
数据量小,并发小时,这种查询是可以忍受全表扫描和CPU处理过滤的。
但是想想一下,你是一个日请求过亿的业务,或者数据量亿级别的,全表扫描和CPU的开销会让你疯掉的。
那么来看看PostgreSQL是如何做到的吧。PG就像机器猫的百宝箱,里面什么都能找到,一起来召唤。
在pg_trgm的代码注释中有详细的介绍,分为4个步骤。
1. 使用PostgreSQL regexp库,将正则转换为NFA样式(图形化词组)。
2. 将NFA样式再进行转换,转换为扩展的图形样式(trigrams),包括拆分后的查询词组与NOT词组。
3. 简化,过滤不必要的trigrams。
4. 打包为TrgmPackedGraph结构,支持GIN,GIST索引的检索。
新建一张测试表,其中包含一个字符串类型字段,插入5000万随机字符串,要来就来大数据量,数据库可不是玩具。
postgres=# create extension pg_trgm; postgres=# create table tbl(id int primary key, info text, crt_time timestamp); postgres=# insert into tbl select id,md5(random()::text), now() from generate_series(1,50000000) t(id);
创建支持正则的索引
postgres=# CREATE INDEX trgm_idx_tbl_gin ON tbl USING GIN (info gin_trgm_ops);
正则查询例子,刚给一位搞其他数据库产品的同学演示过,被查询速度惊吓到,直接给跪了。
postgres=# select * from tbl where info ~ '53?6b.*8823a' limit 10;
id | info | crt_time
----------+----------------------------------+----------------------------
4587797 | 0b1e56bd258823a9f9338919617651e5 | 2016-11-18 14:43:25.726919
7269097 | 9d5a7aace4bb56b29fe54cd98823a8ec | 2016-11-18 14:43:25.726919
11589980 | 3536b69b29b607348823a675cf4842c3 | 2016-11-18 14:43:25.726919
(3 rows)
Time: 142.087 ms
postgres=# explain (analyze,verbose,timing,costs,buffers)
select * from tbl where info ~ '53?6b.*8823a' limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------------
Limit (cost=527.75..537.82 rows=10 width=45) (actual time=140.867..140.902 rows=3 loops=1)
Output: id, info, crt_time
Buffers: shared hit=911
-> Bitmap Heap Scan on public.tbl (cost=527.75..5564.25 rows=5000 width=45)
(actual time=140.847..140.881 rows=3 loops=1)
Output: id, info, crt_time
Recheck Cond: (tbl.info ~ '53?6b.*8823a'::text)
Rows Removed by Index Recheck: 5
Heap Blocks: exact=8
Buffers: shared hit=911
-> Bitmap Index Scan on trgm_idx_tbl (cost=0.00..526.50 rows=5000 width=0)
(actual time=140.832..140.832 rows=8 loops=1)
Index Cond: (tbl.info ~ '53?6b.*8823a'::text)
Buffers: shared hit=903
Planning time: 0.389 ms
Execution time: 140.928 ms
(14 rows)
postgres=# select * from tbl where info ~ 'hello.*[a-f]{1}abc' limit 10;
id | info | crt_time
----+------+----------
(0 rows)
Time: 0.974 ms
postgres=# explain (analyze,verbose,timing,costs,buffers)
select * from tbl where info ~ 'hello.*[a-f]{1}abc' limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------------
Limit (cost=852.75..862.82 rows=10 width=45) (actual time=0.451..0.451 rows=0 loops=1)
Output: id, info, crt_time
Buffers: shared hit=35
-> Bitmap Heap Scan on public.tbl (cost=852.75..5889.25 rows=5000 width=45)
(actual time=0.449..0.449 rows=0 loops=1)
Output: id, info, crt_time
Recheck Cond: (tbl.info ~ 'hello.*[a-f]{1}abc'::text)
Buffers: shared hit=35
-> Bitmap Index Scan on trgm_idx_tbl (cost=0.00..851.50 rows=5000 width=0)
(actual time=0.447..0.447 rows=0 loops=1)
Index Cond: (tbl.info ~ 'hello.*[a-f]{1}abc'::text)
Buffers: shared hit=35
Planning time: 0.366 ms
Execution time: 0.479 ms
(12 rows)
模糊查询例子(其实正则已经包含了模糊查询,这里只是再展示一下)
postgres=# select * from tbl where info ~ '821b8b92' limit 10;
id | info | crt_time
----+----------------------------------+----------------------------
4 | c5132821b8b92ba36487d06b438d8aed | 2016-11-18 14:43:25.726919
(1 row)
Time: 141.353 ms
postgres=# explain (analyze,verbose,timing,costs,buffers)
select * from tbl where info ~ '821b8b92' limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Limit (cost=527.75..537.82 rows=10 width=45) (actual time=140.293..140.299 rows=1 loops=1)
Output: id, info, crt_time
Buffers: shared hit=904
-> Bitmap Heap Scan on public.tbl
(cost=527.75..5564.25 rows=5000 width=45) (actual time=140.291..140.297 rows=1 loops=1)
Output: id, info, crt_time
Recheck Cond: (tbl.info ~ '821b8b92'::text)
Rows Removed by Index Recheck: 1
Heap Blocks: exact=2
Buffers: shared hit=904
-> Bitmap Index Scan on trgm_idx_tbl (cost=0.00..526.50 rows=5000 width=0)
(actual time=140.279..140.279 rows=2 loops=1)
Index Cond: (tbl.info ~ '821b8b92'::text)
Buffers: shared hit=902
Planning time: 0.331 ms
Execution time: 140.323 ms
(14 rows)
以上例子若全表扫描的耗时(本例未使用PG支持的CPU并行计算)
postgres=# select * from tbl where info ~ '53?6b.*8823a' limit 10;
id | info | crt_time
----------+----------------------------------+----------------------------
4587797 | 0b1e56bd258823a9f9338919617651e5 | 2016-11-18 14:43:25.726919
7269097 | 9d5a7aace4bb56b29fe54cd98823a8ec | 2016-11-18 14:43:25.726919
11589980 | 3536b69b29b607348823a675cf4842c3 | 2016-11-18 14:43:25.726919
(3 rows)
Time: 93489.353 ms
postgres=# select * from tbl where info ~ 'hello.*[a-f]{1}abc' limit 10;
id | info | crt_time
----+------+----------
(0 rows)
Time: 78246.122 ms
postgres=# select * from tbl where info ~ '821b8b92' limit 10;
id | info | crt_time
----+----------------------------------+----------------------------
4 | c5132821b8b92ba36487d06b438d8aed | 2016-11-18 14:43:25.726919
(1 row)
Time: 82218.218 ms

