English 中文(简体)
KNN 采用指数化的员额类别
原标题:KNN Across categories in postgis using indexing

我有不同类型的数据集。 对于数据集的每一点,我想在每个类别中找到最接近点。 我可以做到这一点,但计算时间非常长,我挣扎要利用一个空间指数,同时以电子方式提供这些信息。

www.un.org/Depts/DGACM/index_spanish.htm 抽样数据生成

CREATE TYPE point_type AS ENUM ( 1 , 2 , 3 , 4 , 5 );

CREATE TABLE points AS
  SELECT ST_MakePoint(
    1000*random(),
    1000*random()
    )::geometry(Point) AS geom,
     ((random()*3)::int+1)::text::point_type  point_type,
         pk
  FROM generate_series(1,6000) pk;
update points
set point_type= 5  where pk=999;

http://www.ohchr.org。

create index points_geom_idx
    on points using gist (geom);

CREATE INDEX points_dual ON points (point_type, geom);

www.un.org/Depts/DGACM/index_spanish.htm 奏效但速度非常缓慢,但效果良好:

由于距离遥远,有线电视新闻网首先被拖走,然后被束缚后过滤?

explain analyse
with types as (
select column1::point_type point_type from (
values
( 1 ), ( 2 ), ( 3 ), ( 4 ),( 5 )
       )
)
SELECT c1.point_type,
       c1.pk AS main_id,
       b.pk  AS secondary_id,
       c1.secondary_point_type,
       b.secondary_point_type,
       b.distance
FROM (SELECT c.point_type,
             c.pk,
             c.geom,
             types.point_type secondary_point_type
      FROM  points c
          join types on true
          ) c1

         LEFT JOIN LATERAL ( SELECT c2.point_type,
                                    c2.geom,
                                    c2.pk,
                                    c2.point_type secondary_point_type,
                                    c1.geom <->c2.geom AS distance
                             FROM points c2

         where c1.pk <> c2.pk          and c1.secondary_point_type=c2.point_type

                             ORDER BY distance
                             LIMIT 1)  b on true;

Query that is very fast but doesn t provide correct results I believe this is because it s just getting the closest point, and if that point isn t of the correct type, the join ultimately fails, so no data is joined, leaving nulls for most results

explain analyse
with types as (
select column1::point_type point_type from (
values
( 1 ), ( 2 ), ( 3 ), ( 4 ),( 5 )
       )
)
SELECT c1.point_type,
       c1.pk AS main_id,
       b.pk  AS secondary_id,
       c1.secondary_point_type,
       b.secondary_point_type,
       b.distance
FROM (SELECT c.point_type,
             c.pk,
             c.geom,
             types.point_type secondary_point_type
      FROM  points c
          join types on true
          ) c1

         LEFT JOIN LATERAL ( SELECT c2.point_type,
                                    c2.geom,
                                    c2.pk,
                                    c2.point_type secondary_point_type,
                                    c1.geom <->c2.geom AS distance
                             FROM points c2

         where c1.pk <> c2.pk
                             ORDER BY distance
                             LIMIT 1)  b on c1.secondary_point_type=b.secondary_point_type ;

I m trying to achieve this query quickly, using the spatial index for all knn measures across all types. Thanks!

outputs for analyze first query:

Sort  (cost=29155.39..29230.39 rows=30000 width=28) (actual time=24533.167..24543.539 rows=30000 loops=1)
"  Output: c.point_type, c.pk, c2.pk, ((""*VALUES*"".column1)::point_type), c2.point_type, ((c.geom <-> c2.geom))"
  Sort Key: c2.point_type DESC
  Sort Method: quicksort  Memory: 2409kB
  Buffers: shared hit=180999
  ->  Nested Loop Left Join  (cost=0.15..26924.49 rows=30000 width=28) (actual time=5.024..24430.122 rows=30000 loops=1)
"        Output: c.point_type, c.pk, c2.pk, (""*VALUES*"".column1)::point_type, c2.point_type, ((c.geom <-> c2.geom))"
        Buffers: shared hit=180999
        ->  Nested Loop  (cost=0.00..499.07 rows=30000 width=72) (actual time=0.546..105.076 rows=30000 loops=1)
"              Output: c.point_type, c.pk, c.geom, ""*VALUES*"".column1"
              Buffers: shared hit=64
              ->  Seq Scan on public.points c  (cost=0.00..124.00 rows=6000 width=40) (actual time=0.341..12.850 rows=6000 loops=1)
                    Output: c.geom, c.point_type, c.pk
                    Buffers: shared hit=64
              ->  Materialize  (cost=0.00..0.09 rows=5 width=32) (actual time=0.001..0.006 rows=5 loops=6000)
"                    Output: ""*VALUES*"".column1"
"                    ->  Values Scan on ""*VALUES*""  (cost=0.00..0.06 rows=5 width=32) (actual time=0.034..0.141 rows=5 loops=1)"
"                          Output: ""*VALUES*"".column1"
        ->  Limit  (cost=0.15..0.86 rows=1 width=52) (actual time=0.802..0.803 rows=1 loops=30000)
              Output: NULL::point_type, NULL::geometry(Point), c2.pk, c2.point_type, ((c.geom <-> c2.geom))
              Buffers: shared hit=180935
              ->  Result  (cost=0.15..4249.52 rows=5999 width=52) (actual time=0.800..0.800 rows=1 loops=30000)
                    Output: NULL::point_type, NULL::geometry(Point), c2.pk, c2.point_type, (c.geom <-> c2.geom)
"                    One-Time Filter: ((""*VALUES*"".column1)::point_type = (""*VALUES*"".column1)::point_type)"
                    Buffers: shared hit=180935
                    ->  Index Scan using points_geom_idx on public.points c2  (cost=0.15..500.15 rows=5999 width=40) (actual time=0.787..0.787 rows=1 loops=30000)
                          Output: c2.geom, c2.point_type, c2.pk
                          Order By: (c2.geom <-> c.geom)
                          Filter: (c.pk <> c2.pk)
                          Rows Removed by Filter: 1
                          Buffers: shared hit=180935
Settings: search_path =  public, topology, tiger 
Planning Time: 4.964 ms
Execution Time: 24553.107 ms

第二点:

QUERY PLAN
Nested Loop  (cost=0.88..1197.38 rows=30000 width=28) (actual time=3.535..4538.832 rows=30000 loops=1)
"  Output: c.point_type, c.pk, b.pk, (""*VALUES*"".column1)::point_type, b.secondary_point_type, b.distance"
  Buffers: shared hit=36251
  ->  Seq Scan on public.points c  (cost=0.00..124.00 rows=6000 width=40) (actual time=0.095..4.897 rows=6000 loops=1)
        Output: c.geom, c.point_type, c.pk
        Buffers: shared hit=64
  ->  Hash Left Join  (cost=0.88..0.98 rows=5 width=48) (actual time=0.726..0.743 rows=5 loops=6000)
"        Output: ""*VALUES*"".column1, b.pk, b.secondary_point_type, b.distance"
"        Hash Cond: ((""*VALUES*"".column1)::point_type = b.secondary_point_type)"
        Buffers: shared hit=36187
"        ->  Values Scan on ""*VALUES*""  (cost=0.00..0.06 rows=5 width=32) (actual time=0.001..0.008 rows=5 loops=6000)"
"              Output: ""*VALUES*"".column1"
        ->  Hash  (cost=0.87..0.87 rows=1 width=16) (actual time=0.707..0.707 rows=1 loops=6000)
              Output: b.pk, b.secondary_point_type, b.distance
              Buckets: 1024  Batches: 1  Memory Usage: 9kB
              Buffers: shared hit=36187
              ->  Subquery Scan on b  (cost=0.15..0.87 rows=1 width=16) (actual time=0.701..0.703 rows=1 loops=6000)
                    Output: b.pk, b.secondary_point_type, b.distance
                    Buffers: shared hit=36187
                    ->  Limit  (cost=0.15..0.86 rows=1 width=52) (actual time=0.700..0.700 rows=1 loops=6000)
                          Output: NULL::point_type, NULL::geometry(Point), c2.pk, c2.point_type, ((c.geom <-> c2.geom))
                          Buffers: shared hit=36187
                          ->  Index Scan using points_geom_idx on public.points c2  (cost=0.15..4249.52 rows=5999 width=52) (actual time=0.695..0.695 rows=1 loops=6000)
                                Output: NULL::point_type, NULL::geometry(Point), c2.pk, c2.point_type, (c.geom <-> c2.geom)
                                Order By: (c2.geom <-> c.geom)
                                Filter: (c.pk <> c2.pk)
                                Rows Removed by Filter: 1
                                Buffers: shared hit=36187
Settings: search_path =  public, topology, tiger 
Planning Time: 3.206 ms
Execution Time: 4549.481 ms
问题回答

你的第二位指数需要像你的第一个指数一样。 为此,你们需要推广树苗。

CREATE EXTENSION btree_gist;
CREATE INDEX points_dual ON points using gist (point_type, geom);




相关问题
摘录数据

我如何将Excel板的数据输入我的Django应用? I m将PosgreSQL数据库作为数据库。

Postgres dump of only parts of tables for a dev snapshot

On production our database is a few hundred gigabytes in size. For development and testing, we need to create snapshots of this database that are functionally equivalent, but which are only 10 or 20 ...

How to join attributes in sql select statement?

I want to join few attributes in select statement as one for example select id, (name + + surname + + age) as info from users this doesn t work, how to do it? I m using postgreSQL.

What text encoding to use?

I need to setup my PostgreSQL DB s text encoding to handle non-American English characters that you d find showing up in languages such as German, Spanish, and French. What character encoding should ...

SQL LIKE condition to check for integer?

I am using a set of SQL LIKE conditions to go through the alphabet and list all items beginning with the appropriate letter, e.g. to get all books where the title starts with the letter "A": SELECT * ...

热门标签