BUSINESS !NTELLIGENCE .: Indexing ( Storing data ) in elastic

Theory

An index in elastic search is like a ‘database’ in a relational database. It has a mapping which defines multiple types.
An index is a logical namespace which maps to one or more primary shards and can have zero or more replica shards.

To Compare to database

Oracle / SQL Server ⏩ Databases ⏩ Tables ⏩ Rows

Elasticsearch ⏩ Indices ⏩ Types ⏩ Documents

Pushing data in Elasticsearch is called indexing, but before we can index a document, we need to decide where to store it. An Elasticsearch cluster can contain multiple indices, which in turn contain multiple types. These types hold multiple documents, and each document has multiple fields.

from below diagram we can understand overall how data can be pushed or pulled from elasticsearch .

Common terms we are going to use in elasticserach

1. Fields

Fields are the smallest individual unit of data in Elasticsearch. Each field has a defined type and contains a single piece of data that can be, for example, a boolean, string or array expression. A collection of fields are together a single Elasticsearch document.

2. Documents

Documents are JSON objects that are stored within an Elasticsearch index and are considered the base unit of storage. In the world of relational databases, documents can be compared to a row in table.

For example, let’s assume that you are running an e-commerce application. You could have one document per product or one document per order. There is no limit to how many documents you can store in a particular index.

Data in documents is defined with fields comprised of keys and values. A key is the name of the field, and a value can be an item of many different types such as a string, a number, a boolean expression, another object, or an array of values.

Documents also contain reserved fields that constitute the document metadata such as:

_index – The index where the document resides.
_type – The type that the document represents.
_id – The unique identifier for the document.

3. Types

Elasticsearch types are used within documents to subdivide similar types of data wherein each type represents a unique class of documents. Types consist of a name and a mapping (see below) and are used by adding the _type field. This field can then be used for filtering when querying a specific type.

An index can have any number of types, and you can store documents belonging to these types in the same index.

4. Mapping

Like a schema in the world of relational databases, mapping defines the different types that reside within an index. It defines the fields for documents of a specific type — the data type (such as string and integer) and how the fields should be indexed and stored in Elasticsearch.

A mapping can be defined explicitly or generated automatically when a document is indexed using templates. (Templates include settings and mappings that can be applied automatically to a new index.)

5. Index

Indices, the largest unit of data in Elasticsearch, are logical partitions of documents and can be compared to a database in the world of relational databases.

Continuing our e-commerce app example, you could have one index containing all of the data related to the products and another with all of the data related to the customers.

You can have as many indices defined in Elasticsearch as you want. These in turn will hold documents that are unique to each index.

Indices are identified by lowercase names that refer to actions that are performed actions (such as searching and deleting) against the documents that are inside each index.

6. Shards

Put simply, shards are a single Lucene index. They are the building block of Elasticsearch and are what facilitate its scalability.

Index size is a common cause of Elasticsearch crashes. Since there is no limit to how many documents you can store on each index, an index may take up an amount of disk space that exceeds the limits of the hosting server. As soon as an index approaches this limit, indexing will begin to fail.

One way to counter this problem is to split up indices horizontally into pieces called shards. This allows you to distribute operations across shards and nodes to improve performance.

When you create an index, you can define how many shards you want. Each shard is an independent Lucene index that can be hosted anywhere in your cluster:

7. Replicas

Replicas, as the name implies, are Elasticsearch fail-safe mechanisms and are basically copies of your index’s shards. This is a useful backup system for a rainy day — or, in other words, when a node crashes. Replicas also serve read requests, so adding replicas can help to increase search performance.

To ensure high availability, replicas are not placed on the same node as the original shards (called the “primary” shard) from which they were replicated.

Like with shards, the number of replicas can be defined per index when the index is created. Unlike shards, however, you may change the number of replicas anytime after the index is created.

See the example in the “Shards” section above.

8. Analyzers

Analyzers are used during indexing to break down phrases or expressions into terms. Defined within an index, an analyzer consists of a single tokenizer and any number of token filters.

For example, a tokenizer could split a string into specifically defined terms when encountering a specific expression.

By default, Elasticsearch will apply the “standard” analyzer, which contains a grammar-based tokenizer that removes common English words and applies additional filters. Elasticsearch comes bundled with a series of built-in tokenizers as well, and you can also use a custom tokenizer.

9. Nodes

The heart of any ELK setup is the Elasticsearch instance, which has the crucial task of storing and indexing data.

In a cluster, different responsibilities are assigned to the various node types:

Data nodes — stores data and executes data-related operations such as search and aggregation
Master nodes — in charge of cluster-wide management and configuration actions such as adding and removing nodes
Client nodes — forwards cluster requests to the master node and data-related requests to data nodes
Tribe nodes — act as a client node, performing read and write operations against all of the nodes in the cluster
Ingestion nodes (this is new in Elasticsearch 5.0) — for pre-processing documents before indexing
By default, each node is automatically assigned a unique identifier, or name, that is used for management purposes and becomes even more important in a multi-node, or clustered, environment.

When installed, a single node will form a new single-node cluster entitled “elasticsearch,” but it can also be configured to join an existing cluster (see below) using the cluster name. Needless to say, these nodes need to be able to identify each other to be able to connect.

In a development or testing environment, you can set up multiple nodes on a single server. In production, however, due to the amount of resources that an Elasticsearch node consumes, it is recommended to have each Elasticsearch instance run on a separate server.

10. Cluster

An Elasticsearch cluster is comprised of one or more Elasticsearch nodes. As with nodes, each cluster has a unique identifier that must be used by any node attempting to join the cluster. By default, the cluster name is “elasticsearch,” but this name can be changed, of course.

One node in the cluster is the “master” node, which is in charge of cluster-wide management and configurations actions (such as adding and removing nodes). This node is chosen automatically by the cluster, but it can be changed if it fails. (See above on the other types of nodes in a cluster.)

Any node in the cluster can be queried, including the “master” node. But nodes also forward queries to the node that contains the data being queried.

As a cluster grows, it will reorganize itself to spread the data.

There are a number of useful cluster APIs that can query the general status of the cluster.

For example, the cluster health API returns health status reports of either “green” (all shards are allocated), “yellow” (the primary shard is allocated but replicas are not), or “red” (the shard is not allocated in the cluster). More about cluster APIs is here.

Inverted Index Concept

Elasticsearch uses a structure called an inverted index, which is designed to allow very fast full-text searches. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears.

For example, let’s say we have two documents, each with a content field containing the following:

The quick brown fox jumped over the lazy dog
Quick brown foxes leap over lazy dogs in summer

To create an inverted index, we first split the content field of each document into separate words (which we call terms, or tokens), create a sorted list of all the unique terms, and then list in which document each term appears. The result looks something like this: