When dealing with a lot of data, you need to be able to fetch data in a deterministic manner using a language which is easy to understand for both the user as well as the machine. For structured data (data that can be represented as columns), this is done via SQL or Structured Query Language. But when we move over to unstructured data (data like a big chunk of text), SQL fails because it requires the underlying data to be present as a set of columns. This is where indexing comes in. Indexing enables searching over unstructured data in a fast manner. The only perceptible difference for a user is how they are querying the data since indexes usually don’t work over SQL and introduce their own custom query language.
Basic Overview of Indexing
One of the many different kinds of indexes that can be made over unstructured data is an Inverted Index. This kind of index enables performing a Full Text Search on top of chunks of unstructured text data. An Inverted Index can be thought of like a hash-map which enables fast lookup for documents matching the given query.
While indexing a bunch of documents, an indexer will split up the text into terms. For instance, the sentence Parseable is awesome can be split up at spaces and be broken down into three terms Parseable, is, and awesome. An indexer will input large chunks of text, split them into terms based on a pre-defined logic, and store important information about each term like in which all document does the term exist, if yes, then at what position does it exist, so on and so forth. All this information is used to answer a user’s query and return with the relevant documents.
Parseable Methodology
Since the data ingested through Parseable is structured, indexing in parseable can take work with that and create queries pertaining to certain columns. To get indexing to work in Parseable, one needs to spin up an index server including the following environment variables-
P_INDEXER_ENDPOINT=ADDR:PORT
P_MODE=index
The index server expects a query server to have started first and will throw an error in case a query server is not found.
Query Language
API Reference
Parseable exposes all the basic functionalities that any indexing service is expected to support. (ElasticSearch compatible APIs will be made available soon)
Create Index
POST /api/search/v1/index
Parseable supports creating two types of indexes over a dataset. The first is a Rolling Window index. This type of index starts indexing live data as soon as a request is sent and automatically deletes data older than a given range.
The second is a static date-range index which indexes data for a given date range only and retains the indexed data for only a given number of days.
For one dataset, multiple indexes can’t overlap on a given time-range.
Payload
// Payload for Rolling Window type index
{
"streamName": "stream_name",
"startTime": "3d",
"endTime": "now",
"indexName": "index_name",
"columnsToIndex": null
}
// Payload for static date-range index
{
"streamName": "stream_name",
"startTime": "2021-01-01T00:00:00Z",
"endTime": "2021-01-02T00:00:00Z",
"indexName": "index_name",
"retention": 30,
"columnsToIndex": ["column1", "column2"]
}
streamName : String
- Name of the dataset to create index over
startTime : String
- For rolling window, it needs to be a human-readable duration (3d
= 3 days
). For static date-range index, it needs to be a rfc 3339 timestamp
endTime : String
- For rolling window, it needs to be now
. For static date-range index, it needs to be a rfc 3339 timestamp
indexName : String
- Unique name of the indexing following the regex - ^[a-zA-Z0-9_]1,30$
columnsToIndex : Option<Vec<String>>
- An optional vector of strings which will contain the columns to be indexed. If null, all columns will be indexed. (regardless of its value, all columns will still be stored and returned upon querying)
retention : Option<u64>
- Only required for static date-range index stating the number of days for which the indexed data needs to be retained. (The cleanup task runs at 00:00 every day and will start counting from the time when the indexing task finishes)
Response
Started indexing job
Search Index
POST /api/search/v1/index/{index}
API to perform a search on the given index.
// Payload for search index API
{
"query":"status_code:200",
"startTime":"20h",
"endTime":"now",
"limit":1
}
query : String
- The query to run against the index
startTime : String
- Either a human readable duration or an rfc 3339 timestamp. This is used to search from within a given time-range
endTime : String
- Either now
or an rfc 3339 timestamp
limit : u64
- Number of documents to be sent back
Response
{
"records": [
{
"device_id": 3509.0,
"host": "192.168.1.100",
"level": "warn",
"message": "Application is failing",
"meta-containerimage": "ghcr.io/parseablehq/quest",
"meta-containername": "log-generator",
"meta-host": "10.116.0.3",
"meta-namespace": "go-apasdp",
"meta-podlabels": "app=go-app,pod-template-hash=6c87bc9cc9",
"meta-source": "quest-test",
"os": "Windows",
"p_src_ip": "127.0.0.1",
"p_timestamp": "2025-03-23T11:45:59.987Z",
"p_user_agent": "k6/0.57.0 (https://k6.io/)",
"process_id": 627.0,
"response_time": 112.0,
"runtime": "lmj",
"session_id": "abc",
"source_time": "2025-03-23T11:45:59.976Z",
"status_code": 200.0,
"user_id": 67637.0,
"uuid": "ad8a2643-232c-47cf-8b77-d403e3ed7afb",
"version": "1.0.0"
}
],
"totalRecords": 285170,
"totalTime": "17.30946ms"
}
response : Vec<Value>
- An array of JSON records found. Empty array in case no records found
totalRecords : u64
- Total number of relevant records found by the indexer (only limit
number of records are sent back)
totalTime : String
- Time taken for the backend to get the records
Delete Index
DELETE /api/search/v1/index/{index}
API to delete the given index. This will stop any indexing jobs that may be running for the given index, delete it from in-memory map, and the disk/object store
Response
Index {index} deleted
List indexes for given Dataset
GET /api/search/v1/index/list/{dataset}
API to retrieve the indexes for a given dataset
Response
[
{
"indexStatus": {
"taskCompletionTime": null,
"taskStartTime": "2025-03-22T04:55:14.565141577Z",
"taskState": "Running"
},
"indexMetadata": {
"indexName": "ts_rolling",
"streamName": "teststream",
"startTime": "2d",
"endTime": "now",
"retention": 2,
"indexedColumns": null
}
},
{
"indexStatus": {
"taskCompletionTime": "2025-03-22T04:59:08.880628628Z",
"taskStartTime": "2025-03-22T04:55:52.132062456Z",
"taskState": "Completed"
},
"indexMetadata": {
"indexName": "static2",
"streamName": "teststream",
"startTime": "2025-03-08T02:00:00.000Z",
"endTime": "2025-03-08T04:00:00.000Z",
"retention": 20,
"indexedColumns": null
}
}
]