Skip to main content

Optimize Data Transfer from Parseable with Apache Arrow Flight

· 6 min read
Nikhil Sinha
Head of Engineering @ Parseable

Written in Rust, Parseable leverages Apache Arrow and Parquet as its underlying data structures, offering high throughput and low latency without the overhead of traditional indexing methods. This makes it an ideal solution for environments that require efficient log management, whether deployed on public or private clouds, containers, VMs, or bare metal environments. This guide will delve into the integration of Arrow Flight with Parseable, providing a comprehensive setup for your client.

Understanding Apache Arrow Flight for data transfer

What is Apache Arrow Flight?

Apache Arrow Flight is a general-purpose client-server framework that simplifies the high-performance transport of Apache Arrow format files (Apache Arrow record batches) over gRPC.

One of Apache Arrow Flight's most important features is the streaming response, which allows data to be streamed from multiple servers in a cluster in parallel. This enables better query performance over other transport frameworks such as HTTP/S.

Key Features of Apache Arrow Flight

  • High-Performance Data Transfer: Apache Arrow Flight uses the Apache Arrow format for rapid data serialization and deserialization, minimizing overhead and enhancing throughput.
  • Built-in Parallelism: The framework supports streaming responses from multiple servers in parallel, optimizing query performance.
  • Standardized Data Formats: Apache Arrow Flight ensures compatibility and ease of integration with various data processing tools by standardizing the format.

Use Cases of Apache Arrow Flight

Apache Arrow Flight is ideal for a range of applications including:

  • Big Data Analytics: Fast transfer of large datasets for real-time analytics.
  • Machine Learning: Efficient movement of training data between storage and processing units.
  • Financial Services: Rapid data exchanges for trading systems and market analysis.

Why Use Apache Arrow Flight with Parseable

Integration Benefits

The Parseable server supports the DoGet method, sending a data stream as record batches to a client. Integrating Arrow Flight with Parseable offers several advantages:

  • Enhanced Performance: Parseable's log observability capabilities and Apache Arrow Flight's data transport efficiency significantly improve streaming performance.
  • Seamless Integration: Both technologies utilize Apache Arrow, ensuring smooth interoperability and reducing the complexity of data transformations.
  • Scalability: Apache Arrow Flight's support for parallelism aligns well with Parseable's high-throughput design, making it easier to scale operations.

Performance Improvements

A comparative analysis reveals that integrating Apache Arrow Flight with Parseable leads to notable performance gains. Data transfer times are reduced, and optimized system resource utilization allows faster query responses and more efficient data processing.

Cost Efficiency

By minimizing data transfer overhead and maximizing throughput, Apache Arrow Flight helps reduce the costs associated with data transfer. This includes lower network bandwidth usage and decreased computational expenses, making it a cost-effective solution for high-volume data operations.

Prerequisites for Setting Up Apache Arrow Flight Client

Technical Requirements

Before setting up your Apache Arrow Flight client, ensure you meet the following technical requirements:

  • Operating System: Compatible with major operating systems (Linux, macOS, Windows).
  • Python: Python 3.12 or older installed.
  • Hardware: Sufficient memory and CPU resources to handle data processing tasks.

Required Libraries and Tools

Install the following libraries and tools to set up your Apache Arrow Flight client:

  • Apache Arrow: For handling Arrow formatted data.
  • Pyarrow: Python bindings for Apache Arrow.
  • Pandas: For data manipulation and analysis.

Step-by-Step Guide to Setting Up an Apache Arrow Flight Client

Server Side Configuration

Ensure your running the latest Parseable server (version v1.3.0 or above). If you want to configure the flight port, set the environment variable P_FLIGHT_PORT with the required flight port as the value (default is 8002). Ensure that you have event logs ingested into Parseable for querying. Refer to the Parseable documentation for more details.

Installation of Required Libraries

First, set up a virtual environment and install the necessary libraries:

python3 -m venv venv
source venv/bin/activate
pip install pyarrow pandas

Writing the Client Code

Create a file with the name main.py with:

import pyarrow.flight as flight

if __name__ == "__main__":
ticket_data = b'{"query": "select count(*) from backend limit 10", "startTime": "2024-07-15T00:00:00.000Z", "endTime": "2024-07-15T05:00:00.000Z"}'
## if you want to use grpc tls, you need to set the location as follows:
location = flight.Location.for_grpc_tls("demo.parseable.com", 8002)

## if you want to use grpc without tls, you need to set the location as follows:
#location = flight.Location.for_grpc_tcp("demo.parseable.com", 8002)

client = flight.FlightClient(location, disable_server_verification=True)
call_options = flight.FlightCallOptions(
headers=[(b"authorization", b"Basic YWRtaW46YWRtaW4=")],
)
reader = client.do_get(flight.Ticket(ticket_data), options=call_options)
data = reader.read_all()
df = data.to_pandas()
json_data = df.to_json(orient="records")
print(json_data)

Description of the Client Code

The client code demonstrates how to send a query to the Parseable server and receive data as Arrow record batches, which are then converted to a pandas DataFrame for further processing. If the Parseable server has TLS enabled, use below to set the location to:

## if you want to use grpc tls you need to set the location as follows:
location = flight.Location.for_grpc_tls("localhost", 8002)

If not enabled, use this code to set the location:

# location = flight.Location.for_grpc_tcp("localhost", 8002)

Testing Data Query

Run the client script and verify the output:

python3 main.py

Common Issues and Solutions

  • Connection Issues: Ensure the server address and port are correct.
  • Authorization Errors: Verify the authorization headers and credentials.
  • Data Format Errors: Check that the query and data formats are correct.
  • Log Detailed Errors: Use logging to capture detailed error messages.
  • Validate Environment Configuration: Double-check all environment settings and variables.
  • Test with Sample Data: Start with simple queries to ensure basic functionality.

Conclusion

We hope this guide will help you delve deeper into Apache Arrow Flight and its applications. The combination of Apache Arrow Flight and Parseable opens up new opportunities for efficient data transfer and processing in your projects.

Additional Resources

Documentation and Tutorials

Community and Support

Get Updates from Parseable

Subscribe to keep up with latest news, updates and new features on Parseable