Elasticsearch(实践一)相似度方法L1、L2 、cos

在文本使用三维向量的相似度时,对三种相似度的对比。 当前基于已经搭建好的Elasticsearch、Kibana。 

1、创建索引库

PUT my-index-000002
{
  "mappings": {
    "properties": {
      "my_dense_vector": {
        "type": "dense_vector",
        "dims": 3
      },
      "status" : {
        "type" : "keyword"
      }
    }
  }
}

创建成功:

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "my-index-000002"
}

2、放入数据

PUT my-index-000002/_doc/1
{
  "my_dense_vector": [1, 0,0],
  "status" : "published"
}
PUT my-index-000002/_doc/2
{
  "my_dense_vector": [0,1,0],
  "status" : "published"
}
PUT my-index-000002/_doc/3
{
  "my_dense_vector": [0,0,1],
  "status" : "published"
}

返回结果类似如下

{
  "_index": "my-index-000002",
  "_id": "3",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 2,
  "_primary_term": 1
}

3、查看所有数据

GET my-index-000002/_search

结果如下: 

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "my-index-000002",
        "_id": "1",
        "_score": 1,
        "_source": {
          "my_dense_vector": [
            1,
            0,
            0
          ],
          "status": "published"
        }
      },
      {
        "_index": "my-index-000002",
        "_id": "2",
        "_score": 1,
        "_source": {
          "my_dense_vector": [
            0,
            1,
            0
          ],
          "status": "published"
        }
      },
      {
        "_index": "my-index-000002",
        "_id": "3",
        "_score": 1,
        "_source": {
          "my_dense_vector": [
            0,
            0,
            1
          ],
          "status": "published"
        }
      }
    ]
  }
}

4、L1方法查询数据

GET my-index-000002/_search
{
  "query": {
    "script_score": {
      "query" : {
        "bool" : {
          "filter" : {
            "term" : {
              "status" : "published"
            }
          }
        }
      },
      "script": {
        "source": "1 / (1 + l1norm(params.queryVector, 'my_dense_vector'))",
        "params": {
          "queryVector": [0, 0, 1]
        }
      }
    }
  }
}
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "my-index-000002",
        "_id": "3",
        "_score": 1,
        "_source": {
          "my_dense_vector": [
            0,
            0,
            1
          ],
          "status": "published"
        }
      },
      {
        "_index": "my-index-000002",
        "_id": "1",
        "_score": 0.33333334,
        "_source": {
          "my_dense_vector": [
            1,
            0,
            0
          ],
          "status": "published"
        }
      },
      {
        "_index": "my-index-000002",
        "_id": "2",
        "_score": 0.33333334,
        "_source": {
          "my_dense_vector": [
            0,
            1,
            0
          ],
          "status": "published"
        }
      }
    ]
  }
}

结果中,id1和id2得分相同,但在文本向量空间中他们不同。

5、使用l2查询

GET my-index-000002/_search
{
  "query": {
    "script_score": {
      "query" : {
        "bool" : {
          "filter" : {
            "term" : {
              "status" : "published"
            }
          }
        }
      },
      "script": {
        "source": "1 / (1 + l2norm(params.queryVector, 'my_dense_vector'))",
        "params": {
          "queryVector": [0, 0, 1]
        }
      }
    }
  }
}
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "my-index-000002",
        "_id": "3",
        "_score": 1,
        "_source": {
          "my_dense_vector": [
            0,
            0,
            1
          ],
          "status": "published"
        }
      },
      {
        "_index": "my-index-000002",
        "_id": "1",
        "_score": 0.41421357,
        "_source": {
          "my_dense_vector": [
            1,
            0,
            0
          ],
          "status": "published"
        }
      },
      {
        "_index": "my-index-000002",
        "_id": "2",
        "_score": 0.41421357,
        "_source": {
          "my_dense_vector": [
            0,
            1,
            0
          ],
          "status": "published"
        }
      }
    ]
  }
}

同样出现相同情况,l1和l2计算文本的距离有相同得分

6、cos 查询

GET my-index-000002/_search
{
  "query": {
    "script_score": {
      "query" : {
        "bool" : {
          "filter" : {
            "term" : {
              "status" : "published"       
            }
          }
        }
      },
      "script": {
        "source": "cosineSimilarity(params.query_vector, 'my_dense_vector') + 1.0",    
        "params": {
          "query_vector": [0, 0, 1]      
        }
      }
    }
  }
}

结果

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 2,
    "hits": [
      {
        "_index": "my-index-000002",
        "_id": "3",
        "_score": 2,
        "_source": {
          "my_dense_vector": [
            0,
            0,
            1
          ],
          "status": "published"
        }
      },
      {
        "_index": "my-index-000002",
        "_id": "1",
        "_score": 1,
        "_source": {
          "my_dense_vector": [
            1,
            0,
            0
          ],
          "status": "published"
        }
      },
      {
        "_index": "my-index-000002",
        "_id": "2",
        "_score": 1,
        "_source": {
          "my_dense_vector": [
            0,
            1,
            0
          ],
          "status": "published"
        }
      }
    ]
  }
}

三种方法都会产生 不同向量的相同分数情况

GET my-index-000002/_search
{
  "query": {
    "script_score": {
      "query" : {
        "bool" : {
          "filter" : {
            "term" : {
              "status" : "published"       
            }
          }
        }
      },
      "script": {
        "source": "cosineSimilarity(params.query_vector, 'my_dense_vector') + 1.0",    
        "params": {
          "query_vector": [0, 0, 100]      
        }
      }
    }
  }
}

结果:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 2,
    "hits": [
      {
        "_index": "my-index-000002",
        "_id": "3",
        "_score": 2,
        "_source": {
          "my_dense_vector": [
            0,
            0,
            1
          ],
          "status": "published"
        }
      },
      {
        "_index": "my-index-000002",
        "_id": "1",
        "_score": 1,
        "_source": {
          "my_dense_vector": [
            1,
            0,
            0
          ],
          "status": "published"
        }
      },
      {
        "_index": "my-index-000002",
        "_id": "2",
        "_score": 1,
        "_source": {
          "my_dense_vector": [
            0,
            1,
            0
          ],
          "status": "published"
        }
      }
    ]
  }
}

三种方法都会存在 不同空间位置,得到向量距离可能相同的情况

相关推荐

  1. Elasticsearch实践相似方法L1L2cos

    2024-01-07 01:22:02       57 阅读
  2. StarkNet架构之L1-L2消息传递机制

    2024-01-07 01:22:02       43 阅读
  3. L1阶段题解方法总结

    2024-01-07 01:22:02       31 阅读
  4. CC攻击l

    2024-01-07 01:22:02       34 阅读
  5. L1-035 情人节

    2024-01-07 01:22:02       48 阅读

最近更新

  1. docker php8.1+nginx base 镜像 dockerfile 配置

    2024-01-07 01:22:02       94 阅读
  2. Could not load dynamic library ‘cudart64_100.dll‘

    2024-01-07 01:22:02       101 阅读
  3. 在Django里面运行非项目文件

    2024-01-07 01:22:02       82 阅读
  4. Python语言-面向对象

    2024-01-07 01:22:02       91 阅读

热门阅读

  1. 探索Elasticsearch内存应用的关键因素

    2024-01-07 01:22:02       55 阅读
  2. C语言中的输入输出详解

    2024-01-07 01:22:02       65 阅读
  3. 第七节 按需导入elementPlus

    2024-01-07 01:22:02       59 阅读
  4. [数理统计]中国科技技术大学缪柏其

    2024-01-07 01:22:02       44 阅读
  5. 常见的深度相机品牌有哪些。

    2024-01-07 01:22:02       54 阅读
  6. 线特征_LSD直线检测算法和LBD直线描述子

    2024-01-07 01:22:02       61 阅读
  7. k8s之pod进阶

    2024-01-07 01:22:02       54 阅读
  8. Android 车联网——CarManager管理器(四)

    2024-01-07 01:22:02       60 阅读
  9. Android.mk 常用模块类型

    2024-01-07 01:22:02       47 阅读
  10. Go语言程序设计-第6章--方法

    2024-01-07 01:22:02       61 阅读