Hoodie manages storage of large analytical datasets on HDFS and serve them out via two types of tables
- Read Optimized Table - Provides excellent query performance via purely columnar storage (e.g. Parquet)
- Near-Real time Table - Provides queries on real-time data, using a combination of columnar & row based storage (e.g Parquet + Avro)
By carefully managing how data is laid out in storage & how it’s exposed to queries, Hoodie is able to power a rich data ecosystem where external sources can be ingested into Hadoop in near real-time. The ingested data is then available for interactive SQL Engines like Presto & Spark, while at the same time capable of being consumed incrementally from processing/ETL frameworks like Hive & Spark to build derived (Hoodie) datasets.
Hoodie broadly consists of a self contained Spark library to build datasets and integrations with existing query engines for data access.