CONTENTS
What is a GAN?What are GANs used for?How does a GAN function?A data-architecture-first strategy for successful GAN operationsGAN with an UltiHash Data Lake ArchitectureAny questions?
USE CASE

Fast, efficient Object Storage for Generative AI Infrastructures

The case of GANs

What is a GAN?

A generative adversarial network (GAN) is a deep learning model that aims to produce realistic data. One of a GAN’s defining features is its unique architecture, which consists of two neural networks engaged in a zero-sum game: one network’s gain is the other’s loss. The first network, referred to as the generator, manufactures data (fake data), while the second network, referred to as the discriminator, distinguishes between fake and real data. The generator aims to fool the discriminator, a goal achieved once the discriminator cannot differentiate between the generator’s output and real data. In the business world, GANs add value by generating realistic datasets. These datasets can be leveraged for further machine learning operations or other research purposes, where generating artificial data is faster than waiting for an event to occur naturally.
Did you know?

Back in 2014, Ian Goodfellow and a group of friends met at Les 3 Brasseurs, a bar in Montreal, to celebrate a friend’s graduation. During that evening, the group discussed a project: how to create a computer that can generate photos by itself. Ian suggested using two neural networks that would learn from each other—one generating images and the second assessing the realism of the first's output. Later that evening, he went home and programmed the first GAN.

What are GANs used for?

GANs are designed to generate images for various purposes: creating artificial (training) datasets, completing missing information, and transforming image styles.
Generating datasets

Creating artificial data to train new machine learning models is called data augmentation. Astronomy is a field where data augmentation is a real asset, as certain celestial events happen only every few years, resulting in a long waiting time, which is quite costly in research. With GANs, astronomical researchers can train them to produce artificial data points for rare celestial events, thereby reducing the waiting cost.
Completing missing information

GANs aim to create new artificial data by understanding the patterns underlying real data to complete missing information in an image. For example, a satellite image can be impaired by clouds covering half of it, hindering the view and resulting in an image that is only 50% exploitable. However, thanks to its training, a GAN can complete the missing information hidden by the clouds.
Converting an image’s style to another

Transforming data’s style allows users to observe a preview of the transformation output with one click. This is particularly beneficial for users who often follow a draft-to-maquette workflow, such as architects quickly turning building drawings into models, potentially avoiding a lengthy and costly process. Similarly, GAN-powered data transformation can produce maps from satellite images, facilitating and accelerating this process, increasing accuracy, and significantly saving resources.
Whether it is about creating satellite images, completing part of them or even transforming satellites images into maps, GANs can do it.

Another example for GANs is in high-energy particle physics research, where CERN is evaluating working with GANs to replace compute-intensive simulations. It would allow CERN to produce data points without actually running simulations, which will save time, computational power, and costs.

Limitations of GANs

It is important to note that GANs are still a major area of research. Many GAN nuances were introduced since 2014 to try and solve classic GANs’ limits.

Classic GANs have several limitations due to their unique architecture. The zero-sum game between the generator and discriminator can lead to mode collapse, where the generator produces only similar data, reducing output variety and defeating the GAN's purpose. Another issue is vanishing gradients, where an overly effective discriminator causes the generator's learning rate to drop, preventing it from improving and becoming stuck. Additionally, GANs often face non-convergence problems, where neither network reaches an optimal state.

While there are a few major challenges, the potential benefits of balanced training for GANs are significant, and so much hard work is being done to innovate solutions.

How does a GAN function?

THE MODEL ARCHITECTURE
Let’s dive deeper into a GAN’s architecture and explore how the generator and discriminator work both independently and together, playing a zero-sum game, to improve their respective performances and create realistic synthetic data.

Therefore, training a GAN actually means training two neural networks to collaborate. The discriminator learns how to distinguish real data from the generator’s output, and the generator learns how to produce realistic data. Each has a goal that supports the purpose of working together to create artificial data. The way it works is that the generator attempts to create data that looks realistic and sends it to the discriminator, which then labels the data as artificial or real. After the discriminator labels the data, the system evaluates the discriminator’s output to provide feedback to both neural networks on their actions: did the generator fool the discriminator or not?

This feedback is looped back to both neural networks, helping them perform better in the next epoch (training session). This highlights the uniqueness of a GAN: Even though they are separate neural networks, their training unifies them. The training stops when the discriminator cannot distinguish between the generator’s output and data from the training dataset. Post-training, metrics such as the Fréchet Inception Distance and the Inception Score allow users to quantify the realism and diversity of images generated by the GAN.
More about GANs

When discussing a GAN’s training, we refer to the training of the whole entity; however, a GAN is composed of two neural networks that are each trained alternately at different paces, in a setup where one neural network leverages the other’s feedback. Usually, the discriminator is trained for several epochs (training steps) before the generator effectuates one epoch. This allows the discriminator to always be a good classifier to facilitate the generator’s training.

In a GAN, the generator never actually sees the training data; it merely attempts to imitate the training data distribution.

First step: Set a loop where the discriminator is trained for several iterations (k > 1) to become a strong classifier. This helps ensure the discriminator can effectively guide the generator. A sample from the training dataset is fed to the discriminator, which learns to classify this sample as real.

Second step: A random vector with normal distribution is input into the generator, which outputs a random sample. Initially, this output is random because the generator has not yet learned to produce realistic data.

Third step: The generator's output is sent to the discriminator, which assesses whether the generated sample is real or fake. The discriminator outputs a probability value between 0 and 1, indicating how likely it believes the sample is real (closer to 1) or fake (closer to 0).

The generator's learning process depends on the discriminator's assessment of the samples it receives. The discriminator evaluates individual samples against the training data it has seen. If the generated sample is classified as fake, the generator is penalized; if classified as real, the generator is rewarded. Loss functions are generated for both networks: the generator aims to minimize its loss function, while the discriminator aims to maximize its own. Optimization is achieved through gradient descent via backpropagation applied separately to each network. This feedback helps the generator improve, enabling it to produce more realistic samples.

For each training step, the discriminator is trained first for several iterations before training the generator. This alternating training process ensures that each network improves iteratively while the other remains constant during its respective update phase.

Here’s a recap per neural network once the three first steps have been completed.

Complete loop with discriminator training

The generator training is paused.

Data from the training dataset or produced by the generator is passed to the discriminator, which assesses the probability of it being real. Then, the discriminator’s loss function is computed and backpropagated via gradient descent. This updates the discriminator's parameters to improve its classification ability: if the discriminator had correctly labeled the data, its correct classification is reinforced, and if the discriminator labeled the data incorrectly, it is updated to better distinguish between real and artificial data, making it a better classifier for the next epoch.

Focus on what’s happening in the discriminator

  1. Data is received by the discriminator.
  2. The discriminator examines the data and outputs a probability indicating whether the data is artificial (closer to 0) or real (closer to 1).
  3. The loss function is computed based on the discriminator’s output versus the true label of the data (artificial or real).
  4. Compute the gradient of the loss function with respect to the discriminator parameters and update the weights in the discriminator. Back to step 1.

Complete loop with generator training

The discriminator training is paused.

The generator produces artificial data that is passed to the discriminator. The discriminator classifies the data received as real or artificial. The loss function of the generator is calculated based on the discriminator’s classification: if the discriminator identified the data as artificial, the generator's loss increases; if the discriminator was fooled and identified the data as real, the generator's loss decreases. This loss is then backpropagated via gradient descent. This process updates the generator's parameters, enabling it to produce more realistic data that is harder for the discriminator to distinguish from real data in the next epoch.

Focus on what’s happening in the generator

  1. The generator has its weights adjusted via gradient descent based on the discriminator's feedback.
  2. It generates new data guided by this adjustment to better imitate the training data distribution.
  3. This newly generated data is passed to the discriminator for classification.
  4. Post-discriminator classification, the generator’s loss function is computed and backpropagated to the neural network via gradient descent. Back to step 1.

A data-architecture-first strategy for successful GAN operations

Let’s have a look at how to implement a GAN and then check out the implications it has on your data infrastructure.

GANs are machine learning models where training is based on images. Users therefore need a training dataset of images whose resolution matches the desired resolution of the output. We’ve looked for best practices to create an optimal data corpus for your GAN, and unsurprisingly, the more images the GAN is trained on, the better the output.

GANs are "data gluttons," requiring large volumes of unstructured data at rest for three key occasions:


1. Training Datasets

GANs need large training datasets, often consisting of several hundreds of thousands of raw images. These images need to be stored in their raw form and then pre-processed before being put into batches to feed the model.

2. Checkpointing

During training, GANs generate data by storing the state of the neural network after each epoch. This frequent checkpointing increases storage demands as it requires saving numerous intermediate states for documentation and recovery purposes.

3. Post-Training Data Generation

Even after training, GANs continue to generate images, which must be stored for their entire lifetime. This ongoing data generation adds continuous pressure to scale, relying on existing storage systems to accommodate the growing volume of generated images.
GAN operations require a robust and scalable data infrastructure to manage the increasing volumes of data throughout their lifecycle. The foundation storage layer must not only accommodate large capacity but also support high IOPS (Input/Output Operations Per Second) due to the frequent read and write operations involved. During training, datasets are read in small batches and sent to the model training layer, checkpointing files are written to the storage layer after each epoch, and post-training images generated by the GAN are also stored. This high level of IOPS is critical for achieving good time-to-value, ensuring the model produces the desired outcomes as quickly as possible. As data volumes grow, the data infrastructure must expand accordingly, handling open formats while maintaining high-performance operations. The lakehouse architecture fits this description perfectly, facilitating GAN operations while relying on object storage as the foundation storage layer to ensure robustness and speed.
In the current market, users seeking storage solutions face a binary situation: expensive storage that allows for fast data retrieval or affordable storage with longer retrieval times. Additionally, increasing data volumes lead to higher hardware and cloud costs, increased maintenance, and overall surging resource consumption.

UltiHash is a new generation of object storage designed for AI/ML infrastructure, enabling users to regain control over resource consumption. It is high-performing and resource-efficient, creating space savings through byte-level data deduplication.

The end result: data growth is no longer exponential. It can also be accessed in record time, improving the ROI on your data infrastructure investments and avoiding the cycle of endless stop-gap infrastructure upgrades.

GAN with an UltiHash Data Lake Architecture

What does a GAN data infrastructure look like? Well, It should involve UltiHash object storage as a storage layer for your lakehouse architecture, potentially integrated with ML tools like SageMaker or compute instances like EC2.
UltiHash is the primary underlying storage foundation for AIRAs (AI-ready architectures), and container-native for cloud and on-premises applications. It is designed to handle peta- to exabyte-scale data volumes while maintaining high speed and being resource-efficient.

Resource-efficient scalable storage

UltiHash provides resource-efficient scalable storage with fine granular sub-object-deduplication across the entire storage pool. This blended technology allows for high and optimal scalability. The result? Storage volumes do not grow linearly with your total data. In the context of GANs, this has a most significant impact during training on the training dataset size and checkpointing files, and on post-training on the generated images.

Lightweight, CPU-optimized deduplication

UltiHash achieves performant and efficient operations thanks to an architecture designed to handle high IOPS, characteristics that can be attributed to its optimised architecture and lightweight deduplication algorithm that keeps CPU time to a minimum.

Flexible + interoperable via S3 API

UltiHash offers high interoperability through its native S3-compatible API. It integrates with processing engines (Flink, Pyspark), ETL tools (Airflow, AWS Glue), open table formats (Delta Lake, Iceberg) and ML tools (SageMaker). If you’re using a tool we are not supporting yet, let us know and we’ll look into it!

Any questions?

What is UltiHash?

UltiHash is the neat foundation for data-intensive applications. It is powered by deduplication algorithms and streamlined storage techniques. It leverages on past data integrations to generate significant space savings while delivering high-speed access. UltiHash enhances your data management as it makes large datasets, and data growth having a synergistic effect on your infrastructure.

What does UltiHash offer?

UltiHash facilitates data growth within the same existing storage capacity. UltiHash deduplicates per and across datasets from terabytes to exabytes: users store only what they truly need. It’s fast, efficient, and works at a byte level, making it agnostic to data format or type. With UltiHash, the trade-off between high costs and low performance is a thing of the past.

What is an object storage?

Object storage is a data storage solution that is suitable to store all data types (structured, semi-structured and unstructured) as objects. Each object includes the data itself, its metadata, and a unique identifier, allowing for easy retrieval and management. Unlike traditional file or block storage, object storage is highly scalable, making it ideal for managing large amounts of unstructured data.

How does data deduplication work in UltiHash?


Data is analysed on a byte level and dynamically split into fragments, which allows the system to separate fragments that are unique from those that contain duplicates. UltiHash matches duplicates per and across datasets, leveraging the entirety of the data. Fragments that are unique and were not matched across the dataset or past integrations are then added to UltiHash, while matches are added to an existing fragment. This is our process to keep your storage footprint growth sustainable.

What is unique about UltiHash?

UltiHash efficiently stores your desired data volume, providing significant space savings, high speed and the flexibility to scale up seamlessly. Increase your data volumes within the existing storage capacity, without compromising on speed.

Can UltiHash be integrated in existing cloud environments?

Absolutely - UltiHash can be integrated to existing cloud environments, such those that leverage EBS. UltiHash was designed to be deployed in the cloud, and we can suggest specific machine configurations for optimal performance. The cloud environment remains in the hands of the administrator, who can configure it as preferred.

What API does UltiHash provide and connect to my other applications?

UltiHash provides an S3-compatible API. The decision for our API to be S3 compatible was made with its utility in mind - any S3 compatible application qualifies as a native integration. We want our users to have smooth and seamless integration.

How does UltiHash ensure data security and privacy?

The user is in full control of the data. UltiHash is a foundation layer that slides into an existing IT system. The infrastructure, and data stored, are the sole property of the user: UltiHash merely configures the infrastructure as code.

Is UltiHash suitable for both large and small scale enterprises?

UltiHash was designed to empower small and large data-driven organisations to pursue their thirst to innovate at a high pace.

What type of data can be stored in UltiHash?

The data integrated through UltiHash is read on a byte level; in other words, UltiHash processes are not impacted by the type or format of data integrated and works with structured, semi-structured and unstructured data.

What are the pricing models for UltiHash services?

UltiHash currently charges a fixed fee of $6 per TB per month - whether on-premises or in the cloud.

Need more answers?