Storage

This document covers the all the storage options and their respective configurations.

Overview

Parseable is a disk-less system by design. Parseable uses object storage (AWS S3, GCS, Azure, or others) as the primary storage system, allowing clean separation of compute and storage.

Decoupling storage and compute allows independent scaling of compute (ingest & query) and storage. This is even more important in the context of log and observability data, given the sheer volume of this data.

Independent scaling of storage means you don’t have to choose between running a certain number of nodes to store high volumes of data.

Supported storage systems

Parseable supports all the major object storage systems out of the box.

For local deployments, Parseable also has a local-store mode that can store data on locally mounted volumes.

Please note that the local-store mode is only supported when Parseable is set up in a single node configuration.

For production deployments we recommend using an object storage system.

Storage structure

Parseable stores data on object storage in a set of prefixes. This section explains the different set of prefixes, based on different settings.

Default format

By default, data in a Parseable log-stream is structured as explained below.

  • Within the Parseable bucket, there are prefixes for every log stream on this cluster.

  • Within the log stream prefix, there are prefixes for date, hour and minute of ingestion. This timestamp is in UTC, and by default the date / time here is the time of ingestion.

  • Finally inside the minute prefix, you’ll see parquet files with some custom metadata added in the name to identify the ingestor (that ingested this data).

For example, if the ingestion time was 24th January 2024, 13:05 UTC, the prefix structure will look like this.

{bucket-name}
| - {stream-name}/
| -- date=2024-01-24/
| --- hour=13/
| ---- minute=05/
| ----- {ingesto1-hostname-hash}.data.{hash}.parquet
| ----- {ingesto2-hostname-hash}.data.{hash}.parquet

Time partition enabled

In case you’ve enabled time partition for a stream, the format changes slightly. The date, hour and minute prefixes are now based on the value of the selected time column (instead of ingestion time).

Refer to the time partition section to learn more about it and how to set it up.

Using the example in the previous section, ingestion time was 24th January 2024, 13:05 UTC. And there is a column in the event called timestamp (which is the selected time partition column). The value of this column is 20th January 2024, 14:20 UTC. Then the prefixes will be like this.

{bucket-name}
| - {stream-name}/
| -- date=2024-01-20/
| --- hour=12/
| ---- minute=20/
| ----- {ingesto1-hostname-hash}.data.{hash}.parquet
| ----- {ingesto2-hostname-hash}.data.{hash}.parquet

Custom partition enabled

In case you’ve enabled custom partitions for a stream, there are additional prefixes added for the columns (that are selected for partitioning).

Refer to the custom partition section to learn more about it and how to set it up.

Using the same example as earlier, the path remains the same till the minute prefix. After that, there are additional prefixes with key=value.

{bucket-name}
| - {stream-name}/
| -- date=2024-01-24/
| --- hour=13/
| ---- minute=05/
| ----- status=200/
| ------ {ingesto1-hostname-hash}.data.{hash}.parquet
| ------ {ingesto2-hostname-hash}.data.{hash}.parquet
| ----- status=404/
| ------ {ingesto1-hostname-hash}.data.{hash}.parquet
| ------ {ingesto2-hostname-hash}.data.{hash}.parquet
Updated on