Link Search Menu Expand Document Documentation Menu

Concepts

This page defines key terms and concepts related to OpenSearch.

Basic concepts

  • Document: The basic unit of information in OpenSearch, stored in JSON format.
  • Index: A collection of related documents.
  • JSON (JavaScript object notation): A text format used to store data in OpenSearch, representing information as key-value pairs.
  • Mapping: The schema definition for an index that specifies how documents and their fields should be stored and indexed.

Cluster architecture

  • Node: A single server that is part of an OpenSearch cluster.
  • Cluster: A collection of OpenSearch nodes working together.
  • Cluster manager: The node responsible for managing cluster-wide operations.
  • Shard: A subset of an index’s data; indexes are split into shards for distribution across nodes.
  • Primary shard: The original shard containing index data.
  • Replica shard: A copy of a primary shard for redundancy and search performance.

Data structures and storage

  • Doc values: An on-disk data structure for efficient sorting and aggregating of field values.
  • Inverted index: A data structure that maps words to the documents containing them.
  • Lucene: The underlying search library that OpenSearch uses to index and search data.
  • Segment: An immutable unit of data storage within a shard.

Data operations

  • Ingestion: The process of adding data to OpenSearch.
  • Indexing: The process of storing and organizing data in OpenSearch to make it searchable.
  • Bulk indexing: The process of indexing multiple documents in a single request.

Text analysis

  • Text analysis: A process of splitting the unstructured free text content of a document into a sequence of terms, which are then stored in an inverted index.
  • Analyzer: A component that processes text to prepare it for search. Analyzers convert text into terms that are stored in the inverted index.
  • Tokenizer: The component of an analyzer that splits text into individual tokens (usually words) and records metadata about their positions.
  • Token filter: The final component of an analyzer, which modifies, adds, or removes tokens after tokenization. Examples include lowercase conversion, stopword removal, and synonym addition.
  • Token: A unit of text created by a tokenizer during text analysis. Tokens can be modified by token filters and contain metadata used in the text analysis process.
  • Term: A data value that is directly stored in the inverted index and used for matching during search operations. Terms have minimal associated metadata.
  • Character filter: The first component of an analyzer that processes raw text by adding, removing, or modifying characters before tokenization.
  • Normalizer: A special type of analyzer that processes text without tokenization. It can only perform character-level operations and cannot modify whole tokens.
  • Stemming: The process of reducing words to their root or base form, known as the stem.

Search and query concepts

  • Query: A request to OpenSearch that describes what you’re searching for in your data.
  • Query clause: A single condition within a query that specifies criteria for matching documents.
  • Filter: A query component that finds exact matches without scoring.
  • Filter context: A query clause in a filter context asks the question “Does the document match the query clause?”
  • Query context: A query clause in a query context asks the question “How well does the document match the query clause?”
  • Full-text search: Search that analyzes and matches text fields, considering variations in word forms.
  • Keyword search: Search that requires exact text matches.
  • Query domain-specific language (DSL): OpenSearch’s primary query language for creating complex, customizable searches.
  • Query string query language: A simplified query syntax that can be used in URL parameters.
  • Dashboards Query Language (DQL): A simple text-based query language used specifically for filtering data in OpenSearch Dashboards.
  • Piped Processing Language (PPL): A query language that uses pipe syntax (|) to chain commands for data processing and analysis. Primarily used for observability use cases in OpenSearch.
  • Relevance score: A number indicating how well a document matches a query.
  • Aggregation: A way to analyze and summarize data based on a search query.

Vector search concepts

See Vector search concepts.

Advanced concepts

The following section describes more advanced OpenSearch concepts.

Update lifecycle

The lifecycle of an update operation consists of the following steps:

  1. An update is received by a primary shard and is written to the shard’s transaction log (translog). The translog is flushed to disk (followed by an fsync) before the update is acknowledged. This guarantees durability.
  2. The update is also passed to the Lucene index writer, which adds it to an in-memory buffer.
  3. On a refresh operation, the Lucene index writer flushes the in-memory buffers to disk (with each buffer becoming a new Lucene segment), and a new index reader is opened over the resulting segment files. The updates are now visible for search.
  4. On a flush operation, the shard fsyncs the Lucene segments. Because the segment files are a durable representation of the updates, the translog is no longer needed to provide durability, so the updates can be purged from the translog.

Translog

An indexing or bulk call responds when the documents have been written to the translog and the translog is flushed to disk, so the updates are durable. The updates will not be visible to search requests until after a refresh operation.

Refresh

Periodically, OpenSearch performs a refresh operation, which writes the documents from the in-memory Lucene index to files. These files are not guaranteed to be durable because an fsync is not performed. A refresh makes documents available for search.

Flush

A flush operation persists the files to disk using fsync, ensuring durability. Flushing ensures that the data stored only in the translog is recorded in the Lucene index. OpenSearch performs a flush as needed to ensure that the translog does not grow too large.

Merge

In OpenSearch, a shard is a Lucene index, which consists of segments (or segment files). Segments store the indexed data and are immutable. Periodically, smaller segments are merged into larger ones. Merging reduces the overall number of segments on each shard, frees up disk space, and improves search performance. Eventually, segments reach a maximum size specified in the merge policy and are no longer merged into larger segments. The merge policy also specifies how often merges are performed.