Applied Spatial Statistics（三）点模式分析 - 最近邻分析

点模式分析是一种用于识别数据集中的异常或异常模式的统计方法。在点模式分析中，最常用的一种分析方法是最近邻分析（Nearest Neighbor Analysis）。这种方法主要关注数据点之间的距离，通过比较数据点与其最近邻点的距离来识别异常值或异常模式。

最近邻分析的基本原理：

距离度量：首先，需要定义一个距离度量，如欧几里得距离、曼哈顿距离或闵可夫斯基距离等，来衡量数据点之间的距离。
距离计算：计算每个数据点到其他所有点的距离，并找出每个点的最近邻点。
距离分布：统计所有数据点到其最近邻点的距离，并分析这些距离的分布。
异常值识别：如果某个数据点到其最近邻点的距离显著大于其他点的平均距离，那么这个数据点可能被认为是异常的。

最近邻分析的应用：

异常检测：在许多领域，如金融欺诈检测、网络安全和医疗诊断中，最近邻分析被用来识别异常行为或异常事件。
聚类分析：最近邻分析也常用于聚类分析中，通过分析点之间的距离来确定数据点的相似性，从而将相似的数据点聚类在一起。
推荐系统：在推荐系统中，最近邻分析可以用来找到与用户兴趣相似的其他用户，从而提供个性化的推荐。
图像处理：在图像处理中，最近邻分析可以用于图像压缩、图像分割和特征提取等任务。

本笔记本将演示最近邻统计量的抽样分布，以及我们如何使用完全空间随机性来进行假设检验。

我们将使用一个名为“pointpats”的 python 包，它是 Python 空间分析库 (PySAL) 的子模块。

import pointpats
from pointpats import PoissonPointProcess, PoissonClusterPointProcess, Window, poly_from_bbox, PointPattern
import libpysal as ps
import matplotlib.pyplot as plt
import numpy as np

考虑这是我们观察到的事件及其位置。

observed = np.array([[66.22, 32.54], [22.52, 22.39], [31.01, 81.21],
          [9.47, 31.02],  [30.78, 60.10], [75.21, 58.93],
          [79.26,  7.68], [8.23, 39.93],  [98.73, 77.17],
          [89.78, 42.53], [65.19, 92.08], [54.46, 8.48]])

plt.scatter(observed[:,0],observed[:,1],s=40,color='red',)

<matplotlib.collections.PathCollection at 0x177f08450>

在这里插入图片描述

在“pointpats”中，您可以非常轻松地计算最近邻距离。下面的结果与我们在幻灯片上的结果相同。第一个元素是最近邻索引数组。请注意，在 python 中，索引从零开始，因此与幻灯片上的索引相比，索引偏移了 1。第二个元素是每个事件与其最近事件之间距离的数组。

pp = PointPattern(observed)
pp.knn()

(array([[ 9],
        [ 3],
        [ 4],
        [ 7],
        [ 2],
        [ 9],
        [11],
        [ 3],
        [ 5],
        [ 5],
        [ 5],
        [ 6]]),
 array([[25.59050019],
        [15.64542745],
        [21.11125292],
        [ 8.99587128],
        [21.11125292],
        [21.93729473],
        [24.81289987],
        [ 8.99587128],
        [29.76387072],
        [21.93729473],
        [34.63124168],
        [24.81289987]]))

我们可以计算平均最近邻距离：

observed_ann = pp.knn()[1].mean()
observed_ann

21.612139802089246

然后我们尝试模拟一些随机模式作为比较的基线。

首先我们创建一个窗口

ext_square = [(0,0), (0,100), (100,100), (100, 0)]
window_square = Window(ext_square)

然后我们按照泊松点过程生成随机点。

# n is the number of points
# sample is number of random patterns we generate, we just do 1 pattern.
samples = PoissonPointProcess(window_square, n=12, samples=1)

#get the coordinates from the PoissonPointProcess function.
pp_pcp = samples.realizations[0]
plt.scatter(pp_pcp[:,0],pp_pcp[:,1],s=15,color='red')

<matplotlib.collections.PathCollection at 0x291ee9710>

在这里插入图片描述

为每个随机模式计算 ANND

pp = PointPattern(pp_pcp)
pp.knn()

(array([[ 9],
        [ 9],
        [ 6],
        [10],
        [ 9],
        [ 6],
        [ 5],
        [ 6],
        [ 3],
        [ 1],
        [ 3],
        [ 1]]),
 array([[26.79532428],
        [ 9.11686077],
        [27.75869435],
        [ 5.62917903],
        [23.94820159],
        [20.25354371],
        [20.25354371],
        [28.34052548],
        [27.80993129],
        [ 9.11686077],
        [ 5.62917903],
        [17.91492018]]))

pp.knn()[1].mean()

18.547230348943653

接下来，还记得我们谈到的排列测试吗？我们将生成随机模式，作为我们的零采样分布。在这里，我们只是重复前面的步骤 1,000 次。

ann_list = []
np.random.seed(222)
for i in range(1000):
    
    samples = PoissonPointProcess(window_square, 12, 1)
    pp_pcp = samples.realizations[0]
    pp = PointPattern(pp_pcp)
    pp.knn()
    
    ann = pp.knn()[1].mean()
    print(ann)
    
    ann_list.append(pp.knn()[1].mean())

绘制 CSR 假设下 ANND 的抽样分布。

plt.hist(ann_list)
plt.axvline(observed_ann,color='red')

<matplotlib.lines.Line2D at 0x176cc0250>

在这里插入图片描述

观察到ANND = 21.6，处于垂直线位置。

我们可以计算在 1,000 个随机模式中我们得到比观察到的 ANND 更极端的 ANND 值的次数。

p_sim = np.sum(ann_list >= observed_ann)/1000
p_sim

0.039

39 次，因此我们可以根据排列测试得到 p 值为 0.039。然后你就可以得出你的结论了。较大的 ANND 表示规则/分散，而较小的 ANND 表示簇状模式。

因此，让我们对聚类模式重复此操作。我们可以使用名为“PoissonClusterPointProcess”的函数来生成一些假的聚类模式。您可以更改多个参数以允许集群的数量（“parents”）和集群的大小（“radius”）。

help(PoissonClusterPointProcess)

Help on class PoissonClusterPointProcess in module pointpats.process:

class PoissonClusterPointProcess(PointProcess)
 |  PoissonClusterPointProcess(window, n, parents, radius, samples, keep=False, asPP=False, conditioning=False)
 |  
 |  Poisson cluster point process (Neyman Scott).
 |  Two stages:
 |  1. parent CSR process: :math:`N`-conditioned or
 |  :math:`\lambda`-conditioned. If parent events follow a
 |  :math:`\lambda`-conditioned CSR process,
 |  the number of parent events varies across realizations.
 |  2. child process: fixed number of points in circle centered
 |  on each parent.
 |  
 |  Parameters
 |  ----------
 |  window        : :py:class:`~.window.Window`
 |                  Bounding geometric object to contain point process
 |                  realizations.
 |  n             : int
 |                  Size of each realization.
 |  parents       : int
 |                  Number of parents.
 |  radius        : float
 |                  Radius of the circle centered on each parent.
 |  samples       : list
 |                  Number of realizations.
 |  asPP          : bool
 |                  Control the data type of value in the "realizations"
 |                  dictionary. If True, the data type is point
 |                  pattern as defined in pointpattern.py; if False,
 |                  the data type is an two-dimensional array.
 |  conditioning  : bool
 |                  If True, use the :math:`lambda`-conditioned CSR process
 |                  for parent events, leading to varied number of
 |                  parent events across realizations;
 |                  if False, use the :math:`N`-conditioned CSR process.
 |  
 |  Attributes
 |  ----------
 |  children      : int
 |                  Number of childrens centered on each parent. Can
 |                  be considered as local intensity.
 |  num_parents   : dictionary
 |                  The key is the index of each realization. The
 |                  value is the number of parent events for each
 |                  realization.
 |  realizations  : dictionary
 |                  The key is the index of each realization, and the
 |                  value is simulated event points for each
 |                  realization. The data type of the value is
 |                  controlled by the parameter "asPP".
 |  parameters    : dictionary
 |                  Dictionary of a dictionary.
 |                  The key is the index of each realization, and the
 |                  value is a dictionary with the key 'n' and the
 |                  value always equal to the parameter n in the
 |                  case of
 |                  N-conditioned process.
 |                  For example, {0:{'n':100},1:{'n':100},2:{'n':100}}
 |                  2. randomly generated from a Possion process in
 |                  the case of lambda-conditioned process.
 |                  For example, {0:{'n':97},1:{'n':100},2:{'n':98}}
 |  
 |  Examples
 |  --------
 |  >>> import libpysal as ps
 |  >>> import numpy as np
 |  >>> from pointpats import Window
 |  >>> from libpysal.cg import shapely_ext
 |  
 |  Open the virginia polygon shapefile
 |  
 |  >>> va = ps.io.open(ps.examples.get_path("virginia.shp"))
 |  
 |  Create the exterior polygons for VA from the union of the county shapes
 |  
 |  >>> polys = [shp for shp in va]
 |  >>> state = shapely_ext.cascaded_union(polys)
 |  
 |  Create window from virginia state boundary
 |  
 |  >>> window = Window(state.parts)
 |  
 |  1. Simulate a Poisson cluster process of size 200 with 10 parents
 |  and 20 children within 0.5 units of each parent
 |  (parent events:  :math:`N`-conditioned CSR)
 |  
 |  >>> np.random.seed(10)
 |  >>> samples1 = PoissonClusterPointProcess(window, 200, 10, 0.5, 1, asPP=True, conditioning=False)
 |  >>> samples1.parameters # number of events for the realization
 |  {0: {'n': 200}}
 |  >>> samples1.num_parents #number of parent events for each realization
 |  {0: 10}
 |  >>> samples1.children # number of children events centered on each parent event
 |  20
 |  
 |  2. Simulate a Poisson cluster process of size 200 with 10 parents
 |  and 20 children within 0.5 units of each parent
 |  (parent events:  :math:`\lambda`-conditioned CSR)
 |  
 |  >>> np.random.seed(10)
 |  >>> samples2 = PoissonClusterPointProcess(window, 200, 10, 0.5, 1, asPP=True, conditioning=True)
 |  >>> samples2.parameters # number of events for the realization might not be equal to 200
 |  {0: {'n': 260}}
 |  >>> samples2.num_parents #number of parent events for each realization
 |  {0: 13}
 |  >>> samples2.children # number of children events centered on each parent event
 |  20
 |  
 |  Method resolution order:
 |      PoissonClusterPointProcess
 |      PointProcess
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, window, n, parents, radius, samples, keep=False, asPP=False, conditioning=False)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  realize(self, n)
 |      Generate n points which are distributed in a clustered
 |      fashion in the minimum bounding box of "window".
 |      
 |      Parameters
 |      ----------
 |      n             : int
 |                      Number of point events.
 |      
 |      Returns
 |      -------
 |      res           : array
 |                      (n,2), n point coordinates.
 |  
 |  setup(self)
 |      Generate the number of events for each realization. If
 |      "conditioning" is False, all the event numbers are the same;
 |      if it is True, the number of parents is a random variable
 |      following a Poisson distribution, resulting in varied number
 |      of events.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from PointProcess:
 |  
 |  draw(self, parameter)
 |      Generate a series of point coordinates within the given window.
 |      
 |      Parameters
 |      ----------
 |      parameter  : dictionary
 |                   Key: 'n'.
 |                   Value: size of the realization.
 |      
 |      Returns
 |      -------
 |                 : array
 |                   A series of point coordinates.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from PointProcess:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

np.random.seed(5)

cluster_samples = PoissonClusterPointProcess(window_square, n=12, parents=4, radius=10, samples=1)
cluster_pp_pcp = cluster_samples.realizations[0]
plt.scatter(cluster_pp_pcp[:,0],cluster_pp_pcp[:,1],s=15,color='red')

plt.xlim(0,100)
plt.ylim(0,100)

(0.0, 100.0)

在这里插入图片描述

cluster_pp = PointPattern(cluster_pp_pcp)
cluster_pp.knn()

plt.hist(ann_list)
plt.axvline(observed_ann,color='red')

<matplotlib.lines.Line2D at 0x292f3f350>

在这里插入图片描述

p_sim = np.sum(ann_list <= observed_ann)/1000
p_sim

0.0

在这种情况下，我们实际上有一个非常小的 p 值，我们观察到的 ANND 太小，以至于没有比我们的模拟中更极端的情况。

Applied Spatial Statistics（三）点模式分析 - 最近邻分析