Back to all posts
development
USE CASES
Company
Infrastructure
Workflows
Apr 1, 2025
April 2, 2025

Use Iceberg with UltiHash to power large-scale analytics

See how to combine UltiHash with Iceberg's structured metadata, boosting data query speed and management at scale.

Working with large-scale datasets is about efficient access, updates, and governance. And yet, this is where traditional table formats (e.g CSV and Hives tables) struggle:

  • Querying isn’t efficient: Pulling specific data (like “all events from the last 24 hours”) often means scanning an entire dataset, slowing everything down.
  • Schema evolution is not adaptable: Adding a new column or changing data types usually requires costly migrations or workarounds.
  • Complex version control & rollback systems: If something goes wrong, reverting to a previous state is complicated and not always reliable.
  • Partitioning overhead: Traditional formats rely on manual partitioning, which can lead to inefficient query planning or excessive small files.

Open Table Formats (like Iceberg) are the solution

Large datasets are rarely static: they change over time as new records are added, existing ones are updated, and outdated data gets deleted.

Open Table Formats, such as Iceberg, solve this by acting as a structured table format that keeps large datasets efficient and manageable. It allows data to be stored in a way that supports ACID (atomicity, consistency, isolation, durability) transactions (ensuring safe modifications without corruption) and time travel (the ability to query previous versions of the dataset). This makes handling real-world data changes far more reliable.

But Iceberg (and other Open Table Formats) isn’t a database or a storage system, but rather a framework that organizes and tracks changes in datasets. The actual data files still need to be stored typically in an object storage, which is designed for durability, scalability, and cost-efficiency - making it the most practical option for managing large volumes of data files across distributed environments. Unlike traditional file systems, object storage can handle petabytes of data without running into performance bottlenecks.

This separation makes Iceberg powerful because:

  • ACID transactions ensure that updates, inserts, and deletes happen correctly, even when multiple processes modify the same dataset.
  • Time travel lets you access historical versions of your data, which is useful for audits, debugging, or comparing past and present records.
  • Flexible storage options mean that data is stored externally (usually in object storage), keeping costs down while maintaining scalability.

Think of Iceberg as a smart metadata layer: it doesn’t store data itself but keeps track of where files are, how they’re structured, and how changes should be handled. Without a scalable storage backend, Iceberg wouldn’t have anywhere to put the Parquet, ORC, or Avro files it organizes. That’s why object storage is essential to making Iceberg work at scale.

Note: Open Table Formats do not replace file formats like CSV, JSON, or Parquet. They work on top of them: when data is written into an Iceberg table for example, it’s stored in object storage as Parquet (or ORC/Avro), but with Iceberg managing the metadata, schema, and versioning. In other words, users can still interact with data in the same formats they’re used to, while benefiting from Iceberg’s structured querying, schema evolution, and time travel capabilities.

Iceberg + UltiHash: Handling Large-Scale Data Efficiently

Using Iceberg relies on object storage to manage the actual data files. UltiHash provides that storage layer, offering an S3-compatible backend that’s optimized for large-scale AI and analytics workloads.

With UltiHash as the object store for Iceberg, you get:

  • Faster Queries: Query performance in Iceberg depends on how quickly it can read Parquet files from storage. UltiHash is designed for high-throughput access, and making analytics and ML pipelines more efficient.
  • Storage Efficiency: Iceberg’s versioning and time-travel capabilities enables it to keep a lot of historical data. UltiHash reduces storage overhead by handling deduplication automatically, helping to manage costs while keeping full query access.
  • Simple Setup: UltiHash has an S3 compatible API, which means there’s no extra configuration needed - just point Iceberg’s warehouse to an UltiHash bucket.

By using Iceberg for structured query management and UltiHash for scalable storage, you keep flexibility and performance without storage becoming a bottleneck. This setup is what forms the foundation of a lakehouse, combining the best of data lakes and data warehouses.

Writing data to UltiHash while converting to Iceberg

Let’s walk through setting up an Iceberg table and writing data to UltiHash as its storage backend.

1. Start a UltiHash Cluster + integrate data

First things first! We need to set up an UltiHash cluster (guide here). Next, we use S3 API commands to create a bucket called iceberg, and put our raw images inside (guide here).

2. Start a PySpark session

We need to configure Spark to use Iceberg and connect to UltiHash:

pyspark \
  --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.262 \
  --conf spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog \
  --conf spark.sql.catalog.iceberg.type=hadoop \
  --conf spark.sql.catalog.iceberg.warehouse=s3a://iceberg \
  --conf spark.hadoop.fs.s3a.endpoint=http://127.0.0.1:8080 \
  --conf spark.hadoop.fs.s3a.access.key=TEST-USER \
  --conf spark.hadoop.fs.s3a.secret.key=SECRET \
  --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
  --conf spark.hadoop.fs.s3a.path.style.access=true \
  --conf spark.hadoop.fs.s3a.connection.ssl.enabled=false \
  --conf spark.driver.bindAddress=127.0.0.1 \
  --conf spark.driver.host=127.0.0.1


3. Create an Iceberg table

First, define a namespace and table:

spark.sql("CREATE NAMESPACE IF NOT EXISTS iceberg.ulti")

spark.sql("""
    CREATE TABLE IF NOT EXISTS iceberg.ulti.test_iceberg_table (
        id INT,
        name STRING,
        price DOUBLE,
        category STRING,
        ts TIMESTAMP
    )
    USING iceberg
    TBLPROPERTIES (
        'format-version'='2',
        'write.metadata.previous-versions-max'='5'
    )
""")

Check if the table was created:

spark.sql("SHOW TABLES IN iceberg.ulti").show()


4. Load data from a CSV

Now, let’s load a sample CSV into a DataFrame:

df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("/Users/ultihash/Downloads/iceberg/test_data.csv")

df.show()
df.printSchema()

Example CSV data:

id,name,price,category,timestamp
1,Item_1,1.1,A,2024-01-01
2,Item_2,2.2,B,2024-01-02
3,Item_3,3.3,C,2024-01-03

5. Ensure timestamp format matches

Before writing, convert the timestamp column to match Iceberg’s schema:

from pyspark.sql.functions import col, to_timestamp

df = df.withColumnRenamed("timestamp", "ts") \
       .withColumn("ts", to_timestamp(col("ts")))

df.printSchema()

Now, the schema should match Iceberg’s table definition:

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- price: double (nullable = true)
 |-- category: string (nullable = true)
 |-- ts: timestamp (nullable = true)


6. Write data to Iceberg (stored in UltiHash)

Now, append the DataFrame to the Iceberg table:

df.write.format("iceberg") \
    .mode("append") \
    .save("iceberg.ulti.test_iceberg_table")


Read data to Iceberg (stored in UltiHash)

After writing, query the table to confirm the data is stored:

spark.sql("SELECT * FROM iceberg.ulti.test_iceberg_table").show()

Expected output:

+---+-------+-----+--------+-------------------+
| id|   name|price|category|                 ts|
+---+-------+-----+--------+-------------------+
|  1| Item_1|  1.1|       A|2024-01-01 00:00:00|
|  2| Item_2|  2.2|       B|2024-01-02 00:00:00|
|  3| Item_3|  3.3|       C|2024-01-03 00:00:00|
+---+-------+-----+--------+-------------------+

With this setup, Iceberg manages structured metadata while UltiHash stores the actual files. This means you get the flexibility of SQL-like queries combined with the scalability of UltiHash, without having to rethink your entire data pipeline.

Whether you're working with massive analytics datasets, AI training pipelines, or real-time updates, this combination keeps data storage efficient and queries fast.

To test this setup yourself, start an UltiHash cluster here.

Share this post:
Check this out:
Use Iceberg with UltiHash to power large-scale analytics
See how to combine UltiHash with Iceberg's structured metadata, boosting data query speed and management at scale.
Posted by
Juliette Lehmann
Founder's Associate
Build faster AI infrastructure with less storage resources
Get 10TB Free

Use Iceberg with UltiHash to power large-scale analytics

See how to combine UltiHash with Iceberg's structured metadata, boosting data query speed and management at scale.

Working with large-scale datasets is about efficient access, updates, and governance. And yet, this is where traditional table formats (e.g CSV and Hives tables) struggle:

  • Querying isn’t efficient: Pulling specific data (like “all events from the last 24 hours”) often means scanning an entire dataset, slowing everything down.
  • Schema evolution is not adaptable: Adding a new column or changing data types usually requires costly migrations or workarounds.
  • Complex version control & rollback systems: If something goes wrong, reverting to a previous state is complicated and not always reliable.
  • Partitioning overhead: Traditional formats rely on manual partitioning, which can lead to inefficient query planning or excessive small files.

Open Table Formats (like Iceberg) are the solution

Large datasets are rarely static: they change over time as new records are added, existing ones are updated, and outdated data gets deleted.

Open Table Formats, such as Iceberg, solve this by acting as a structured table format that keeps large datasets efficient and manageable. It allows data to be stored in a way that supports ACID (atomicity, consistency, isolation, durability) transactions (ensuring safe modifications without corruption) and time travel (the ability to query previous versions of the dataset). This makes handling real-world data changes far more reliable.

But Iceberg (and other Open Table Formats) isn’t a database or a storage system, but rather a framework that organizes and tracks changes in datasets. The actual data files still need to be stored typically in an object storage, which is designed for durability, scalability, and cost-efficiency - making it the most practical option for managing large volumes of data files across distributed environments. Unlike traditional file systems, object storage can handle petabytes of data without running into performance bottlenecks.

This separation makes Iceberg powerful because:

  • ACID transactions ensure that updates, inserts, and deletes happen correctly, even when multiple processes modify the same dataset.
  • Time travel lets you access historical versions of your data, which is useful for audits, debugging, or comparing past and present records.
  • Flexible storage options mean that data is stored externally (usually in object storage), keeping costs down while maintaining scalability.

Think of Iceberg as a smart metadata layer: it doesn’t store data itself but keeps track of where files are, how they’re structured, and how changes should be handled. Without a scalable storage backend, Iceberg wouldn’t have anywhere to put the Parquet, ORC, or Avro files it organizes. That’s why object storage is essential to making Iceberg work at scale.

Note: Open Table Formats do not replace file formats like CSV, JSON, or Parquet. They work on top of them: when data is written into an Iceberg table for example, it’s stored in object storage as Parquet (or ORC/Avro), but with Iceberg managing the metadata, schema, and versioning. In other words, users can still interact with data in the same formats they’re used to, while benefiting from Iceberg’s structured querying, schema evolution, and time travel capabilities.

Iceberg + UltiHash: Handling Large-Scale Data Efficiently

Using Iceberg relies on object storage to manage the actual data files. UltiHash provides that storage layer, offering an S3-compatible backend that’s optimized for large-scale AI and analytics workloads.

With UltiHash as the object store for Iceberg, you get:

  • Faster Queries: Query performance in Iceberg depends on how quickly it can read Parquet files from storage. UltiHash is designed for high-throughput access, and making analytics and ML pipelines more efficient.
  • Storage Efficiency: Iceberg’s versioning and time-travel capabilities enables it to keep a lot of historical data. UltiHash reduces storage overhead by handling deduplication automatically, helping to manage costs while keeping full query access.
  • Simple Setup: UltiHash has an S3 compatible API, which means there’s no extra configuration needed - just point Iceberg’s warehouse to an UltiHash bucket.

By using Iceberg for structured query management and UltiHash for scalable storage, you keep flexibility and performance without storage becoming a bottleneck. This setup is what forms the foundation of a lakehouse, combining the best of data lakes and data warehouses.

Writing data to UltiHash while converting to Iceberg

Let’s walk through setting up an Iceberg table and writing data to UltiHash as its storage backend.

1. Start a UltiHash Cluster + integrate data

First things first! We need to set up an UltiHash cluster (guide here). Next, we use S3 API commands to create a bucket called iceberg, and put our raw images inside (guide here).

2. Start a PySpark session

We need to configure Spark to use Iceberg and connect to UltiHash:

pyspark \
  --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.1,org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.262 \
  --conf spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog \
  --conf spark.sql.catalog.iceberg.type=hadoop \
  --conf spark.sql.catalog.iceberg.warehouse=s3a://iceberg \
  --conf spark.hadoop.fs.s3a.endpoint=http://127.0.0.1:8080 \
  --conf spark.hadoop.fs.s3a.access.key=TEST-USER \
  --conf spark.hadoop.fs.s3a.secret.key=SECRET \
  --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
  --conf spark.hadoop.fs.s3a.path.style.access=true \
  --conf spark.hadoop.fs.s3a.connection.ssl.enabled=false \
  --conf spark.driver.bindAddress=127.0.0.1 \
  --conf spark.driver.host=127.0.0.1


3. Create an Iceberg table

First, define a namespace and table:

spark.sql("CREATE NAMESPACE IF NOT EXISTS iceberg.ulti")

spark.sql("""
    CREATE TABLE IF NOT EXISTS iceberg.ulti.test_iceberg_table (
        id INT,
        name STRING,
        price DOUBLE,
        category STRING,
        ts TIMESTAMP
    )
    USING iceberg
    TBLPROPERTIES (
        'format-version'='2',
        'write.metadata.previous-versions-max'='5'
    )
""")

Check if the table was created:

spark.sql("SHOW TABLES IN iceberg.ulti").show()


4. Load data from a CSV

Now, let’s load a sample CSV into a DataFrame:

df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("/Users/ultihash/Downloads/iceberg/test_data.csv")

df.show()
df.printSchema()

Example CSV data:

id,name,price,category,timestamp
1,Item_1,1.1,A,2024-01-01
2,Item_2,2.2,B,2024-01-02
3,Item_3,3.3,C,2024-01-03

5. Ensure timestamp format matches

Before writing, convert the timestamp column to match Iceberg’s schema:

from pyspark.sql.functions import col, to_timestamp

df = df.withColumnRenamed("timestamp", "ts") \
       .withColumn("ts", to_timestamp(col("ts")))

df.printSchema()

Now, the schema should match Iceberg’s table definition:

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- price: double (nullable = true)
 |-- category: string (nullable = true)
 |-- ts: timestamp (nullable = true)


6. Write data to Iceberg (stored in UltiHash)

Now, append the DataFrame to the Iceberg table:

df.write.format("iceberg") \
    .mode("append") \
    .save("iceberg.ulti.test_iceberg_table")


Read data to Iceberg (stored in UltiHash)

After writing, query the table to confirm the data is stored:

spark.sql("SELECT * FROM iceberg.ulti.test_iceberg_table").show()

Expected output:

+---+-------+-----+--------+-------------------+
| id|   name|price|category|                 ts|
+---+-------+-----+--------+-------------------+
|  1| Item_1|  1.1|       A|2024-01-01 00:00:00|
|  2| Item_2|  2.2|       B|2024-01-02 00:00:00|
|  3| Item_3|  3.3|       C|2024-01-03 00:00:00|
+---+-------+-----+--------+-------------------+

With this setup, Iceberg manages structured metadata while UltiHash stores the actual files. This means you get the flexibility of SQL-like queries combined with the scalability of UltiHash, without having to rethink your entire data pipeline.

Whether you're working with massive analytics datasets, AI training pipelines, or real-time updates, this combination keeps data storage efficient and queries fast.

To test this setup yourself, start an UltiHash cluster here.