hardwood java parquet

Hardwood, an open-source library for the Java Virtual Machine (JVM), has reached version 1.0, offering a high-performance, near zero-dependency alternative for reading Apache Parquet files. Initiated by Gunnar Morling, the project utilizes multi-threaded page decoding to maximize CPU utilization, achieving throughputs of 16.5 million rows per second on 8 vCPUs.

How Hardwood Improves Parquet Performance

Traditional Apache Parquet implementations for Java often rely on single-threaded core readers and carry significant dependency overhead. According to project documentation, Hardwood bypasses these limitations by employing a multi-threaded approach that distributes page decoding across all available CPU cores. This architecture reduces the latency inherent in sequential processing, allowing the library to better saturate system I/O and CPU bandwidth.

The library provides two distinct APIs to balance engineering needs: a structured row reader API for general record access and a batch-oriented column reader API for high-throughput analytical tasks. Furthermore, the library implements branchless, batch-at-a-time predicate evaluation during filtered scans, which minimizes CPU branch mispredictions—a common performance bottleneck in analytical data processing.

Pro Tip: When working with high-throughput analytical workloads, leverage the batch-oriented column reader API to minimize overhead and maximize the efficiency of your CPU resources.

Why Zero-Dependency Design Matters

Hardwood is built with a zero-mandatory-dependency profile to mitigate risks associated with supply chain attacks and classpath conflicts. By utilizing Java’s minimal logging abstraction, which has been available since version 9, the library avoids external logging dependencies entirely. Developers can opt into additional functionality—such as LZ4 or GZip compression and S3 object storage support—by pulling in specific optional dependencies only when necessary.

Gunnar Morling Built a New Parquet Engine with AI | Ep. 31 | Confluent Developer Podcast

This modular approach contrasts with older, monolithic Java data libraries that often force developers to include large, unnecessary dependency trees. The inclusion of a command-line interface (CLI) with a text-based user interface (TUI) further reduces the need for heavy frameworks, allowing engineers to inspect file schemas and verify data integrity directly from the terminal.

What Lies Ahead for the Project

Since its inception in early 2026, Hardwood has grown to include 20 contributors, including Andres Almiray and Bruno Borges. While version 1.0 currently focuses on read capabilities, the project roadmap explicitly lists write support as a future priority. This addition is highly anticipated by the community, as indicated by feedback from early users.

Did you know? Despite the heavy reliance on complex algorithms for Parquet decoding, the project utilized AI-assisted coding during its development phase, though the critical design and code review processes remained under human ownership.

Frequently Asked Questions

What is the primary benefit of using Hardwood over standard Parquet implementations?
Hardwood provides significantly higher throughput by utilizing multi-threaded page decoding and eliminates heavy dependency overhead, reducing both runtime latency and the risk of classpath conflicts.
Does Hardwood support writing Parquet files?
Not yet. Version 1.0 is limited to read capabilities, but the project roadmap confirms that write support is planned for future releases.
Can I use Hardwood with AWS S3?
Yes, Hardwood supports object storage services like S3 through optional dependencies that can be added to your project as needed.
Is the Hardwood CLI suitable for production environments?
The CLI is primarily designed as a diagnostic tool for developers and data engineers to verify file structure and inspect metadata without the need for heavy processing frameworks.

Are you currently integrating Hardwood into your data pipelines? Share your performance benchmarks or questions in the comments section below to join the discussion on the future of high-performance JVM data processing.