PG全文检索


背景:大规模的全文检索通常使用专用的引擎,如ES,或基于SQL的Sphix,PG支持全文检索,对于规模不是很大的全文检查通常能满足

  • tsvector

postgres@findb:select 'Hello,leadx,how are you? I am missing you!'::tsvector;
                         tsvector                         
----------------------------------------------------------
 'Hello,leadx,how' 'I' 'am' 'are' 'missing' 'you!' 'you?'
(1 row)

postgres@findb:select to_tsvector('english','Hello,leadx,how are you? I am missing you!');
         to_tsvector          
------------------------------
 'hello':1 'leadx':2 'miss':8

可以用::tsvector来转换字符串为tsvector类型,to_tsvector来做规范化处理去掉“non-normalized”的部分

  • tsquery

postgres@findb:select 'hello&leadx'::tsquery;
      tsquery      
-------------------
 'hello' & 'leadx'
(1 row)
postgres@findb:select to_tsquery('hello&leadx');
    to_tsquery     
-------------------
 'hello' & 'leadx'
(1 row)
postgres@findb:select to_tsquery('hello|leadx');
    to_tsquery     
-------------------
 'hello' | 'leadx'
(1 row)
postgres@findb:select to_tsvector('english','how are you?,Hello,leadx,how are you? I am missing you!')@@to_tsquery('hello&you');
 ?column? 
----------
 t
(1 row)

postgres@findb:select to_tsvector('english','how are you?,Hello,leadx,how are you? I am missing you!')@@to_tsquery('hello&cat');
 ?column? 
----------
 f
(1 row)
postgres@findb:select to_tsvector('english','how are you?,Hello,leadx,how are you? I am missing you!')@@to_tsquery('hello|cat');
 ?column? 
----------
 t
(1 row)

tsquery为全文检索搜索条件,&表示and,|表示or

  • 实战例子1

postgres@findb:create table test_search(id int4,name text);
CREATE TABLE
postgres@findb:insert into test_search(id,name) select n,n||'_leadx' from generate_series(1,2000000) n;
INSERT 0 2000000
postgres@findb:\timing on
Timing is on.
postgres@findb:explain analyze select * from test_search where name like '1_leadx';
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..28368.88 rows=200 width=17) (actual time=0.225..91.600 rows=1 loops=1)
   Workers Planned: 1
   Workers Launched: 1
   ->  Parallel Seq Scan on test_search  (cost=0.00..27348.88 rows=118 width=17) (actual time=43.506..88.621 rows=0 loops=2)
         Filter: (name ~~ '1_leadx'::text)
         Rows Removed by Filter: 1000000
 Planning Time: 0.114 ms
 Execution Time: 91.619 ms
(8 rows)

Time: 92.388 ms

postgres@findb:create index idx_test_search on test_search using gin (to_tsvector('english',name));
CREATE INDEX
Time: 11564.912 ms (00:11.565)

postgres@findb:\di+
                                     List of relations
 Schema |         Name          | Type  |  Owner   |     Table      |  Size  | Description 
--------+-----------------------+-------+----------+----------------+--------+-------------
 public | idx_test_search       | index | postgres | test_search    | 104 MB | 

postgres@findb:\dt+
                           List of relations
 Schema |      Name      | Type  |  Owner   |    Size    | Description 
--------+----------------+-------+----------+------------+-------------
 public | test_search    | table | postgres | 99 MB      | 

postgres@findb:select * from test_search where to_tsvector('english',name)@@to_tsquery('english','1_leadx');
 id |  name   
----+---------
  1 | 1_leadx
(1 row)

Time: 0.873 ms
postgres@findb:explain analyze select * from test_search where to_tsvector('english',name)@@to_tsquery('english','1_leadx');
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test_search  (cost=36.39..240.08 rows=50 width=17) (actual time=0.090..0.090 rows=1 loops=1)
   Recheck Cond: (to_tsvector('english'::regconfig, name) @@ '''1'' & ''leadx'''::tsquery)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on idx_test_search  (cost=0.00..36.38 rows=50 width=0) (actual time=0.085..0.085 rows=1 loops=1)
         Index Cond: (to_tsvector('english'::regconfig, name) @@ '''1'' & ''leadx'''::tsquery)
 Planning Time: 0.225 ms
 Execution Time: 0.125 ms
(7 rows)

Time: 1.197 ms
#no-prefix testcase
postgres@findb:insert into test_search values(2000001,'noneprefix testcase');
INSERT 0 1
Time: 4.371 ms
postgres@findb:explain analyze select * from test_search where to_tsvector('english',name)@@to_tsquery('english','testcase');
                                                          QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test_search  (cost=113.50..15218.16 rows=10000 width=17) (actual time=0.052..0.052 rows=1 loops=1)
   Recheck Cond: (to_tsvector('english'::regconfig, name) @@ '''testcas'''::tsquery)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on idx_test_search  (cost=0.00..111.00 rows=10000 width=0) (actual time=0.013..0.013 rows=1 loops=1)
         Index Cond: (to_tsvector('english'::regconfig, name) @@ '''testcas'''::tsquery)
 Planning Time: 0.086 ms
 Execution Time: 0.071 ms
(7 rows)

Time: 0.557 ms
#not like 'LIKE'
postgres@findb:select * from test_search where to_tsvector('english',name)@@to_tsquery('english','testcas');
 id | name 
----+------
(0 rows)

Time: 0.484 ms

通过全文检索走索引大幅缩短了查询时间,但是全文检索与like不一样,like是字符匹配,全文检索是单词匹配

  • json/jsonb的全文检索

postgres@findb:\df *to_tsvector*
                                 List of functions
   Schema   |       Name        | Result data type |   Argument data types   | Type 
------------+-------------------+------------------+-------------------------+------
 pg_catalog | array_to_tsvector | tsvector         | text[]                  | func
 pg_catalog | json_to_tsvector  | tsvector         | json, jsonb             | func
 pg_catalog | json_to_tsvector  | tsvector         | regconfig, json, jsonb  | func
 pg_catalog | jsonb_to_tsvector | tsvector         | jsonb, jsonb            | func
 pg_catalog | jsonb_to_tsvector | tsvector         | regconfig, jsonb, jsonb | func
 pg_catalog | to_tsvector       | tsvector         | json                    | func
 pg_catalog | to_tsvector       | tsvector         | jsonb                   | func
 pg_catalog | to_tsvector       | tsvector         | regconfig, json         | func
 pg_catalog | to_tsvector       | tsvector         | regconfig, jsonb        | func
 pg_catalog | to_tsvector       | tsvector         | regconfig, text         | func
 pg_catalog | to_tsvector       | tsvector         | text                    | func
(11 rows)

postgres@findb:create or replace function random_range(int4,int4)
findb-# returns int4
findb-# language SQL
findb-# as $$
findb$# select ($1+floor(($2-$1+1)*random()))::int4;
findb$# $$;
CREATE FUNCTION

postgres@findb:create or replace function random_text_simple(length int4)
findb-# returns text
findb-# language PLPGSQL
findb-# as $$
findb$# DECLARE
findb$# possible_chars text:='0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ';
findb$# output text :='';
findb$# i int4;
findb$# pos int4;
findb$# begin
findb$# for i in 1..length loop
findb$# pos :=random_range(1,length(possible_chars));
findb$# output:=output||substr(possible_chars,pos,1);
findb$# end loop;
findb$# return output;
findb$# end;
findb$# $$;
CREATE FUNCTION
postgres@findb:\df
                              List of functions
 Schema |        Name        | Result data type | Argument data types | Type 
--------+--------------------+------------------+---------------------+------
 public | random_range       | integer          | integer, integer    | func
 public | random_text_simple | text             | length integer      | func
(2 rows)
postgres@findb:select random_text_simple(5);
 random_text_simple 
--------------------
 GRWZ6
(1 row)
postgres@findb:truncate table user_ini;
TRUNCATE TABLE
postgres@findb:
postgres@findb:
postgres@findb:insert into user_ini(id,user_id,user_name) select r,round(random()*1000000),random_text_simple(6) from generate_series(1,1000000) as r ;
INSERT 0 1000000
postgres@findb:create table tbl_user_search_json(id serial,user_info json);
CREATE TABLE
postgres@findb:insert into tbl_user_search_json(user_info) select row_to_json(user_ini) from user_ini;
INSERT 0 1000000
postgres@findb:\timing on
Timing is on.
postgres@findb:select * from tbl_user_search_json where to_tsvector('english',user_info)@@to_tsquery('ENGLISH','GUNTVU');
 id |                                         user_info                                         
----+-------------------------------------------------------------------------------------------
  7 | {"id":7,"user_id":550209,"user_name":"GUNTVU","create_time":"2020-12-19T12:11:55.070116"}
(1 row)

Time: 2371.057 ms (00:02.371)

postgres@findb:create index idx_tbl_user_search_json on tbl_user_search_json using gin(to_tsvector('english',user_info));
CREATE INDEX
Time: 15754.502 ms (00:15.755)
postgres@findb:select * from tbl_user_search_json where to_tsvector('english',user_info)@@to_tsquery('ENGLISH','GUNTVU');
 id |                                         user_info                                         
----+-------------------------------------------------------------------------------------------
  7 | {"id":7,"user_id":550209,"user_name":"GUNTVU","create_time":"2020-12-19T12:11:55.070116"}
(1 row)

Time: 5.685 ms

全文检查支持json/jsonb,并支持索引加速