Indexing

When dealing with a lot of data, you need to be able to fetch data in a deterministic manner using a language which is easy to understand for both the user as well as the machine. For structured data (data that can be represented as columns), this is done via SQL or Structured Query Language. But when we move over to unstructured data (data like a big chunk of text), SQL fails because it requires the underlying data to be present as a set of columns. This is where indexing comes in. Indexing enables searching over unstructured data in a fast manner. The only perceptible difference for a user is how they are querying the data since indexes usually don’t work over SQL and introduce their own custom query language.

Basic Overview of Indexing

One of the many different kinds of indexes that can be made over unstructured data is an Inverted Index. This kind of index enables performing a Full Text Search on top of chunks of unstructured text data. An Inverted Index can be thought of like a hash-map which enables fast lookup for documents matching the given query.

While indexing a bunch of documents, an indexer will split up the text into terms. For instance, the sentence Parseable is awesome can be split up at spaces and be broken down into three terms Parseable, is, and awesome. An indexer will input large chunks of text, split them into terms based on a pre-defined logic, and store important information about each term like in which all document does the term exist, if yes, then at what position does it exist, so on and so forth. All this information is used to answer a user’s query and return with the relevant documents.

Parseable Methodology

Since the data ingested through Parseable is structured, indexing in parseable can take work with that and create queries pertaining to certain columns. To get indexing to work in Parseable, one needs to spin up an index server including the following environment variables-

P_INDEXER_ENDPOINT=ADDR:PORT
P_MODE=index

The index server expects a query server to have started first and will throw an error in case a query server is not found.

Query Language

API Reference

Parseable exposes all the basic functionalities that any indexing service is expected to support. (ElasticSearch compatible APIs will be made available soon)

Create Index

POST /api/search/v1/index

Parseable supports creating two types of indexes over a dataset. The first is a Rolling Window index. This type of index starts indexing live data as soon as a request is sent and automatically deletes data older than a given range.

The second is a static date-range index which indexes data for a given date range only and retains the indexed data for only a given number of days.

For one dataset, multiple indexes can’t overlap on a given time-range.

Payload

// Payload for Rolling Window type index
{
    "streamName": "stream_name",
    "startTime": "3d",
    "endTime": "now",
    "indexName": "index_name",
    "columnsToIndex": null
}

// Payload for static date-range index
{
    "streamName": "stream_name",
    "startTime": "2021-01-01T00:00:00Z",
    "endTime": "2021-01-02T00:00:00Z",
    "indexName": "index_name",
    "retention": 30,
    "columnsToIndex": ["column1", "column2"]
}

streamName : String - Name of the dataset to create index over

startTime : String - For rolling window, it needs to be a human-readable duration (3d = 3 days). For static date-range index, it needs to be a rfc 3339 timestamp

endTime : String - For rolling window, it needs to be now. For static date-range index, it needs to be a rfc 3339 timestamp

indexName : String - Unique name of the indexing following the regex - ^[a-zA-Z0-9_]1,30$

columnsToIndex : Option<Vec<String>> - An optional vector of strings which will contain the columns to be indexed. If null, all columns will be indexed. (regardless of its value, all columns will still be stored and returned upon querying)

retention : Option<u64> - Only required for static date-range index stating the number of days for which the indexed data needs to be retained. (The cleanup task runs at 00:00 every day and will start counting from the time when the indexing task finishes)

Response

Started indexing job

Search Index

POST /api/search/v1/index/{index}

API to perform a search on the given index.

// Payload for search index API
{
    "query":"status_code:200",
    "startTime":"20h",
    "endTime":"now",
    "limit":1
}

query : String - The query to run against the index

startTime : String - Either a human readable duration or an rfc 3339 timestamp. This is used to search from within a given time-range

endTime : String - Either now or an rfc 3339 timestamp

limit : u64 - Number of documents to be sent back

Response

{
    "records": [
        {
            "device_id": 3509.0,
            "host": "192.168.1.100",
            "level": "warn",
            "message": "Application is failing",
            "meta-containerimage": "ghcr.io/parseablehq/quest",
            "meta-containername": "log-generator",
            "meta-host": "10.116.0.3",
            "meta-namespace": "go-apasdp",
            "meta-podlabels": "app=go-app,pod-template-hash=6c87bc9cc9",
            "meta-source": "quest-test",
            "os": "Windows",
            "p_src_ip": "127.0.0.1",
            "p_timestamp": "2025-03-23T11:45:59.987Z",
            "p_user_agent": "k6/0.57.0 (https://k6.io/)",
            "process_id": 627.0,
            "response_time": 112.0,
            "runtime": "lmj",
            "session_id": "abc",
            "source_time": "2025-03-23T11:45:59.976Z",
            "status_code": 200.0,
            "user_id": 67637.0,
            "uuid": "ad8a2643-232c-47cf-8b77-d403e3ed7afb",
            "version": "1.0.0"
        }
    ],
    "totalRecords": 285170,
    "totalTime": "17.30946ms"
}

response : Vec<Value> - An array of JSON records found. Empty array in case no records found

totalRecords : u64 - Total number of relevant records found by the indexer (only limit number of records are sent back)

totalTime : String - Time taken for the backend to get the records

Delete Index

DELETE /api/search/v1/index/{index}

API to delete the given index. This will stop any indexing jobs that may be running for the given index, delete it from in-memory map, and the disk/object store

Response

Index {index} deleted

List indexes for given Dataset

GET /api/search/v1/index/list/{dataset}

API to retrieve the indexes for a given dataset

Response

[
    {
        "indexStatus": {
            "taskCompletionTime": null,
            "taskStartTime": "2025-03-22T04:55:14.565141577Z",
            "taskState": "Running"
        },
        "indexMetadata": {
            "indexName": "ts_rolling",
            "streamName": "teststream",
            "startTime": "2d",
            "endTime": "now",
            "retention": 2,
            "indexedColumns": null
        }
    },
    {
        "indexStatus": {
            "taskCompletionTime": "2025-03-22T04:59:08.880628628Z",
            "taskStartTime": "2025-03-22T04:55:52.132062456Z",
            "taskState": "Completed"
        },
        "indexMetadata": {
            "indexName": "static2",
            "streamName": "teststream",
            "startTime": "2025-03-08T02:00:00.000Z",
            "endTime": "2025-03-08T04:00:00.000Z",
            "retention": 20,
            "indexedColumns": null
        }
    }
]