Back to all posts
development
USE CASES
Company
Infrastructure
Workflows
Mar 8, 2024

The dangers of data swamps full of dark data

How a data lake's strength -all that data - can also be their greatest weakness

At the start, your data lake was great: a vast repository full of useful insights.

But over time, the clear water became more murky. Everybody dumped data without a thought for quality or usability. With no clear catalog, searching for real insights was tough. Everyone had the vague feeling that there was a treasure trove under heaps of useless data - if only they could get at it!

So data lakes’ great strength can also be their greatest weakness. As volume and variety grow, you risk a ‘data swamp’ - full of disorganized, low-quality data. On the flipside, there's ‘dark data’: high-value, but unused or even unknown.

Luckily, none of this is inevitable. You can shield your lake from this fate with data governance, setting standards for data quality, security, and compliance - and implicitly enforce their execution - ensuring consistency across the board:

  • Establish a data catalog. Look for tools to add descriptions, tags, and categories, like Apache Atlas, Collibra, or Alation. On GCP, use Google’s Dataplex Data Catalog; for Amazon’s cloud, AWS Glue is a great bet.
  • Perform regular audits for obsolete / redundant data. Visibility into access patterns is also helpful for general security, so look into behavior analytics tools like Varonis DatAdvantage. Plus, all cloud platforms allow you to set up retention rules that can relocate data to cooler storage, or delete them permanently, as time passes.
  • Use roles to specify access rights and manage compliance. You can use built-in platform tools, or third-party tools like SailPoint and Okta for authentication.
  • Ingestion checks can prevent low-quality data from entering the lake. DataCleaner is an excellent off-the-shelf tool for tidying up data, but you can also build custom scripts of your own.

Seems like a lot? For small- to medium-size companies new to data lakes, start with an out-of-the-box solution with built-in functionality - like Databricks and Delta Lake. Once you start to find the limits of these solutions and know your needs, explore building your own custom lake - and then apply these techniques.

Remember: strategies like these need a dedicated team (or person) responsible for data governance. Make sure all this aligns with company objectives: without buy-in from internal stakeholders, the data lake might go underused anyway, making all your work to make data readily available pointless. With everyone on the same page, your data lake can remain a valuable asset for insights - instead of a swamp.

Share this post:
Check this out:
The dangers of data swamps full of dark data
How a data lake's strength -all that data - can also be their greatest weakness
Posted by
Simon Camp
Product Designer
Build faster AI infrastructure with less storage resources
Get 10TB Free

The dangers of data swamps full of dark data

How a data lake's strength -all that data - can also be their greatest weakness

At the start, your data lake was great: a vast repository full of useful insights.

But over time, the clear water became more murky. Everybody dumped data without a thought for quality or usability. With no clear catalog, searching for real insights was tough. Everyone had the vague feeling that there was a treasure trove under heaps of useless data - if only they could get at it!

So data lakes’ great strength can also be their greatest weakness. As volume and variety grow, you risk a ‘data swamp’ - full of disorganized, low-quality data. On the flipside, there's ‘dark data’: high-value, but unused or even unknown.

Luckily, none of this is inevitable. You can shield your lake from this fate with data governance, setting standards for data quality, security, and compliance - and implicitly enforce their execution - ensuring consistency across the board:

  • Establish a data catalog. Look for tools to add descriptions, tags, and categories, like Apache Atlas, Collibra, or Alation. On GCP, use Google’s Dataplex Data Catalog; for Amazon’s cloud, AWS Glue is a great bet.
  • Perform regular audits for obsolete / redundant data. Visibility into access patterns is also helpful for general security, so look into behavior analytics tools like Varonis DatAdvantage. Plus, all cloud platforms allow you to set up retention rules that can relocate data to cooler storage, or delete them permanently, as time passes.
  • Use roles to specify access rights and manage compliance. You can use built-in platform tools, or third-party tools like SailPoint and Okta for authentication.
  • Ingestion checks can prevent low-quality data from entering the lake. DataCleaner is an excellent off-the-shelf tool for tidying up data, but you can also build custom scripts of your own.

Seems like a lot? For small- to medium-size companies new to data lakes, start with an out-of-the-box solution with built-in functionality - like Databricks and Delta Lake. Once you start to find the limits of these solutions and know your needs, explore building your own custom lake - and then apply these techniques.

Remember: strategies like these need a dedicated team (or person) responsible for data governance. Make sure all this aligns with company objectives: without buy-in from internal stakeholders, the data lake might go underused anyway, making all your work to make data readily available pointless. With everyone on the same page, your data lake can remain a valuable asset for insights - instead of a swamp.