Skip to content

Commit 306c533

Browse files
committed
feat: replace SSTables with Parquet, add predicate partitioning and tiered cache (Phase 2)
Replace custom SSTable binary format with Apache Parquet columnar storage, introduce vertical partitioning by predicate, and add a three-tier cache (Caffeine heap -> local disk LRU -> S3). Storage redesign: - Parquet files on S3 with ZSTD compression and dictionary encoding - Predicate-based partitioning (data/predicates/{id}/) eliminates predicate column from files, tightening column statistics - Three sort orders per partition (SOC, OSC, CSO) for optimal query performance regardless of access pattern - Single MemTable in SPOC order, partitioned on flush - JSON catalog with per-file column statistics for catalog-level pruning Cache system: - L1: Caffeine heap cache (configurable, default 256 MB) - L2: Local disk LRU cache (configurable, default 10 GB) - L3: S3 source of truth - Write-through on flush avoids cold reads Compaction: - L0->L1 merge when epoch count >= 8 per predicate - L1->L2 merge when epoch count >= 4 per predicate - Tombstone suppression at highest level Hadoop dependency elimination: - Zero Hadoop JARs in dependency tree - PlainParquetConfiguration + custom SimpleCodecFactory bypass all Hadoop runtime paths - 14 minimal stub classes in org.apache.hadoop.* satisfy parquet-hadoop JVM class loading requirements Deleted: SSTable, SSTableWriter, Manifest (replaced by Parquet + Catalog) All 529 tests pass.
1 parent e5379d5 commit 306c533

39 files changed

Lines changed: 3036 additions & 1178 deletions

core/sail/s3/pom.xml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,26 @@
5252
<groupId>com.fasterxml.jackson.core</groupId>
5353
<artifactId>jackson-databind</artifactId>
5454
</dependency>
55+
<dependency>
56+
<groupId>org.apache.parquet</groupId>
57+
<artifactId>parquet-hadoop</artifactId>
58+
<version>1.15.2</version>
59+
<exclusions>
60+
<exclusion>
61+
<groupId>javax.annotation</groupId>
62+
<artifactId>javax.annotation-api</artifactId>
63+
</exclusion>
64+
</exclusions>
65+
</dependency>
66+
<!-- No Hadoop jars needed. Parquet-hadoop references a few Hadoop classes
67+
in method signatures and superclass declarations, but we bypass all
68+
Hadoop runtime paths via PlainParquetConfiguration + SimpleCodecFactory.
69+
Minimal stub classes in org.apache.hadoop.* satisfy JVM class loading. -->
70+
<dependency>
71+
<groupId>com.github.ben-manes.caffeine</groupId>
72+
<artifactId>caffeine</artifactId>
73+
<version>3.1.8</version>
74+
</dependency>
5575
<dependency>
5676
<groupId>${project.groupId}</groupId>
5777
<artifactId>rdf4j-sail-testsuite</artifactId>
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
/*
2+
* Minimal stub for org.apache.hadoop.conf.Configuration.
3+
*
4+
* Parquet-hadoop references this class in abstract method signatures
5+
* (WriteSupport.init, ParquetWriter.Builder.getWriteSupport). Our code
6+
* overrides the ParquetConfiguration variants instead, so this class is
7+
* never instantiated or used at runtime. It exists only to satisfy the
8+
* JVM class loader.
9+
*/
10+
package org.apache.hadoop.conf;
11+
12+
public class Configuration {
13+
public Configuration() {
14+
}
15+
}
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
/*
2+
* Minimal stub — satisfies JVM class loading for parquet-hadoop.
3+
* Never instantiated at runtime.
4+
*/
5+
package org.apache.hadoop.fs;
6+
7+
public class FileStatus {
8+
}
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
/*
2+
* Minimal stub — satisfies JVM class loading for parquet-hadoop.
3+
* Never instantiated at runtime.
4+
*/
5+
package org.apache.hadoop.fs;
6+
7+
public abstract class FileSystem {
8+
}
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
/*
2+
* Minimal stub — satisfies JVM class loading for parquet-hadoop.
3+
* Never instantiated at runtime.
4+
*/
5+
package org.apache.hadoop.fs;
6+
7+
public class Path {
8+
}
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
/*
2+
* Minimal stub — satisfies JVM class loading for parquet-hadoop.
3+
* Never instantiated at runtime.
4+
*/
5+
package org.apache.hadoop.fs;
6+
7+
public interface PathFilter {
8+
}
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
/*
2+
* Minimal stub — satisfies JVM class loading for parquet-hadoop.
3+
* Never instantiated at runtime.
4+
*/
5+
package org.apache.hadoop.mapred;
6+
7+
import org.apache.hadoop.conf.Configuration;
8+
9+
public class JobConf extends Configuration {
10+
}
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
/*
2+
* Minimal stub — satisfies JVM class loading for parquet-hadoop.
3+
* Never instantiated at runtime.
4+
*/
5+
package org.apache.hadoop.mapreduce;
6+
7+
public abstract class InputFormat<K, V> {
8+
}
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
/*
2+
* Minimal stub — satisfies JVM class loading for parquet-hadoop.
3+
* Never instantiated at runtime.
4+
*/
5+
package org.apache.hadoop.mapreduce;
6+
7+
public abstract class InputSplit {
8+
}
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
/*
2+
* Minimal stub — satisfies JVM class loading for parquet-hadoop.
3+
* Never instantiated at runtime.
4+
*/
5+
package org.apache.hadoop.mapreduce;
6+
7+
public class Job {
8+
}

0 commit comments

Comments
 (0)