Elasticsearch索引映射定义

前言

映射是ES索引最重要的配置之一，类似于关系数据库中的Scheme。映射决定了文档字段的数据类型，以及一些其它的属性，例如是否是必需的字段、是否允许为空值等。不仅如此，映射还决定了文档是如何被存储和检索的，映射不合理，会导致索引的性能下降，文档检索结果不准确等。

如下示例，就是一个最简单的映射配置

PUT users
{
  "mappings": {
    "properties": {
      "name": {
        "type": "keyword"
      },
      "gender": {
        "type": "keyword"
      }
    }
  }
}

映射的定义

索引是文档的集合，文档是字段的集合，每个字段都有自己的数据类型。在映射数据时，需要创建一个映射定义，其中包含与文档相关的字段列表。映射定义还包括元数据字段，如_source字段，它自定义如何处理文档的关联元数据。

映射类型

ES索引映射包含两部分：动态映射和显式映射。

动态映射

动态映射的优点是，开发者可以在不定义映射，不指定字段名称和字段类型的前提下，直接索引文档，快速上手。缺点是ES动态映射的结果可能不是最理想的，不过这个可以通过设置动态映射模板来解决。

当ES在文档中检测到新的字段时，默认会将其动态的添加到类型映射中，可以通过将属性index.mappings.dynamic 设为false来禁用动态映射。并非所有数据类型都支持动态映射，支持的数据类型有：boolean、float、long、Object、Array、date、string类型，其它类型均会适配成text存储。

如下示例，创建一个“users”索引，在不定义映射的情况下直接索引文档，ES会根据文档的字段类型来动态创建映射：

// 创建索引
PUT users

// 索引文档
POST users/_doc
{
  "name":"Lisa"
}

// 查看索引
GET users
{
  "users": {
    "mappings": {
      "properties": {
        "name": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

显式映射

尽管动态映射很好用，但是更多的时候，还是推荐使用显式映射，毕竟我们比ES更了解我们的数据。显式映射是指开发者在创建索引时就定义好字段的映射关系，类似于关系数据库的Schema，在索引文档前需要先建模。

为了安全，可以将index.mappings.dynamic 属性设为strict，ES检测到新字段时会报如下错误。属性设置为false，新字段可以写入，但不能被检索。

{
  "error": {
    "root_cause": [
      {
        "type": "strict_dynamic_mapping_exception",
        "reason": "[3:9] mapping set to strict, dynamic introduction of [age] within [_doc] is not allowed"
      }
    ],
    "type": "strict_dynamic_mapping_exception",
    "reason": "[3:9] mapping set to strict, dynamic introduction of [age] within [_doc] is not allowed"
  },
  "status": 400
}

运行时字段

ES索引映射还支持运行时字段（Runtime fields）。顾名思义，运行时字段是在运行时动态添加的字段，可以在文档检索时或映射里定义运行时字段。

运行时字段的优点是：首先，因为不会被索引，所以不会占用额外的存储空间；其次它可以和其它字段一样使用，例如用来做排序，聚合等操作。同样地，它也有一些缺点：因为不会被索引，运行时字段是要在运行时根据原始文档计算出来的，运行时字段的生成本身需要时间，如果要基于它做检索，效率就更低了，所以使用运行时字段时要注意性能问题，平衡搜索性能和灵活性。

如下示例，users索引只存储first_name和last_name，而对于full_name直接用运行时字段来实现，无需额外存储。

PUT users
{
  "mappings": {
    "runtime": {
      "full_name": {
        "type": "keyword",
        "script": {
          "source": "emit(doc['first_name'].value+' '+doc['last_name'].value)"
        }
      }
    },
    "properties": {
      "first_name": {
        "type": "keyword"
      },
      "last_name": {
        "type": "keyword"
      }
    }
  }
}

接下来，索引文档并检索，返回full_name

POST users/_doc
{
  "first_name": "Michael",
  "last_name": "Jordan"
}

GET users/_search
{
  "fields": [
    "full_name"
  ]
}

// 数据返回
{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "users",
        "_id": "4Os9mY4BODFb3LbQ3HQN",
        "_score": 1,
        "_source": {
          "first_name": "Michael",
          "last_name": "Jordan"
        },
        "fields": {
          "full_name": [
            "Michael Jordan"
          ]
        }
      }
    ]
  }
}

或者，你也可以在搜索时定义运行时字段，效果是一样的：

GET users/_search
{
  "fields": [
    "full_name_v2"
  ],
  "runtime_mappings": {
    "full_name_v2": {
      "type": "keyword",
      "script": {
        "source": "emit(doc['first_name'].value+' '+doc['last_name'].value)"
      }
    }
  }
}

数据类型

ES支持非常多的数据类型，字段类型按族分组，同一族中的类型具有完全相同的搜索行为，但可能具有不同的空间使用或性能特征。

ES支持的常用数据类型：

binary：编码为Base64字符串的二进制类型
boolean：布尔类型，只接受true和false
Keywords：关键字类型族，用于精准匹配，包括：keyword、constant_keyword、wildcard
Dates：日期类型族，包括：date和date_nanos
object：JSON对象类型
flattened：扁平对象类型，将一整个JSON对象作为单个字段值，避免字段膨胀
nested：嵌套数据类型
join：为同一索引中的文档定义父子关联关系
Range：范围类型，包括：long_range、double_range、date_range和ip_range
ip：ip地址类型
text：文本类型，用于全文检索
geo_point：地理位置坐标类型
Multi-fields：多字段类型，为不同的目的以不同的方式索引同一字段，例如针对同一字段同时有全文检索和聚合的需求，可以定义keyword和text类型

doc_values

文档值（doc_values）属性设为true可以用来给字段建立正排索引。我们知道，倒排索引非常适用于全文检索，但是对于排序、聚合等需求就显得无能为力了，这是正排索引的强项。所以ES默认会给所有非text类型的字段启用doc_values属性，这会占用额外的存储空间，但是可以提高字段排序、聚合的性能。如果明确字段不需要排序、聚合、脚本计算、地理位置过滤等业务场景，可以禁用doc_values属性以节约存储空间。

PUT users
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "doc_values": false
      }
    }
  }
}

fielddata

默认情况下，text类型的字段可以被用于搜索，但是不能被用于排序、聚合或编写脚本，因为text字段数据是分词后再存储的，且text类型不支持开启doc_values属性，如果强行对text字段做聚合，会得到一个异常

// 创建索引
PUT users
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "gender":{
        "type": "keyword"
      }
    }
  }
}

// 聚合
GET users/_search
{
  "size": 0, 
  "aggs": {
    "name_count": {
      "terms": {
        "field": "name"
      }
    }
  }
}

// 结果
"error": {
  "root_cause": [
    {
      "type": "illegal_argument_exception",
      "reason": "Fielddata is disabled on [name] in [users]. Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [name] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
    }
  ]
}

如果非要对text类型做聚合该怎么办呢？可以开启字段的fielddata属性。
fielddata是基于内存的数据结构，ES会从磁盘读取字段的完整倒排索引，反转词项与文档之间的关系，并在内存中构建fielddata用于排序和聚合等操作，因此构建fielddata的代价是很大的，默认是禁用的，一般也不建议开启。

如下示例，给text字段开启fielddata，即可用于聚合

PUT users
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "fielddata": true
      },
      "gender":{
        "type": "keyword"
      }
    }
  }
}

因为在内存中构建fielddata非常昂贵，如果真的需要同时对text字段做全文检索和排序聚合等需求，建议使用多字段类型，给字段同时设置text和keyword类型即可

PUT users
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "fields": {
          "keyword":{
            "type":"keyword"
          }
        }
      },
      "gender":{
        "type": "keyword"
      }
    }
  }
}

//基于name.keyword做聚合
GET users/_search
{
  "size": 0, 
  "aggs": {
    "name_count": {
      "terms": {
        "field": "name.keyword"
      }
    }
  }
}

_source

默认情况下，每个文档都会有一个“_source"字段来存储被索引的原始文档，_source字段本身只会被存储，但是不会被索引，意味着它不可以用来检索，可以检索时跟随文档被召回。

如下示例，索引一个用户，查询时返回原始文档

POST users/_doc
{
 "name":"张三",
 "gender":"男"
}

GET users/_search
{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "users",
        "_id": "6-uhnI4BODFb3LbQGHSD",
        "_score": 1,
        "_source": {
          "name": "张三",
          "gender": "男"
        }
      }
    ]
  }
}

_source字段会占用额外的存储空间，如果只是做文档检索不需要获取原始文档，可以考虑将其禁用以节省存储空间。

PUT users
{
  "mappings": {
    "_source": {
      "enabled": false
    }, 
    "properties": {
      "name": {
        "type": "keyword"
      },
      "gender":{
        "type": "keyword"
      }
    }
  }
}

禁用_source字段以后，update、update_by_query、reindex API和高亮显示将不可用，因为ES没有原始文档了。

store

默认情况下，字段值会被索引，但是不会被存储。这意味着你可以基于字段做检索，但是拿不到字段的原始值。通常来说一般也没什么问题，因为原始文档_source字段已经包含了所有的字段值。
但是，如果_source字段被禁用了，或者你不想返回整个原始文档而是只想提取几个特定的字段，那么就可以为单个字段开启store属性单独存储。

如下示例，为name字段开启store，查询时可以只返回name字段值

PUT users
{
  "mappings": {
    "_source": {
      "enabled": false
    }, 
    "properties": {
      "name": {
        "type": "keyword",
        "store": true
      },
      "gender":{
        "type": "keyword"
      }
    }
  }
}

GET users/_search
{
  "stored_fields": ["name"]
}

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "users",
        "_id": "7Ou4nI4BODFb3LbQ93Sw",
        "_score": 1,
        "fields": {
          "name": [
            "张三"
          ]
        }
      }
    ]
  }
}

null_value

默认情况下，null值是不会被索引且不能被搜索的，当文档字段值为null，ES会认为该字段没有值，但是业务需求可能需要对null值做检索。

如下示例，检索gender为null的用户

// 创建索引
PUT users
{
  "mappings": {
    "properties": {
      "name": {
        "type": "keyword"
      },
      "gender":{
        "type": "keyword"
      }
    }
  }
}

// 索引文档
POST users/_doc
{
 "name":"张三",
 "gender":null
}

// 
GET users/_search
{
  "query": {
    "term": {
      "gender": {
        "value": null
      }
    }
  }
}

会得到一个异常，检索值不能为null

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "value cannot be null"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "value cannot be null"
  },
  "status": 400
}

此时，我们可以利用ES的 null_value 属性来用给定值替换空值，以达到对空值索引和检索的目的。
如下示例，我们用字符串”NULL“来代替空值

PUT users
{
  "mappings": {
    "properties": {
      "name": {
        "type": "keyword"
      },
      "gender":{
        "type": "keyword",
        "null_value": "NULL"
      }
    }
  }
}

索引文档后再检索，就可以找回gender为空值的文档了

GET users/_search
{
  "query": {
    "term": {
      "gender": {
        "value": "NULL"
      }
    }
  }
}

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "users",
        "_id": "7uvEnI4BODFb3LbQB3QL",
        "_score": 0.2876821,
        "_source": {
          "name": "张三",
          "gender": null
        }
      }
    ]
  }
}