You know that feeling when a concept everyone's discussing on HackerNews just… doesn't quite click? Like, you get the vibe, but the practical application feels miles away. That was me with manifold for a while. It kept popping up – 292 points, 90 comments, clearly people were engaged – and I initially thought, is this some deep mathematical thing that doesn't apply to my everyday data problems? Turns out, it absolutely applies, especially when you're wrestling with a messy data lake and trying to make sense of different data sources.

Executive Summary

We faced a pretty common headache: an analytics platform drowning in fragmented data. Our analysts needed to combine stuff. Think user behaviour from clickstream logs, order history from Postgres, and inventory details from an ancient ERP system. All this data was just sitting in different places in our infrastructure. This wasn't just slow; it was a constant source of data inconsistencies and frustration. We realised we needed one clear way to get to and ask questions of this data without building a huge data warehouse or copying everything. The solution, which I now think of as defining data manifolds, came through using pg_lake to create a logical, simple way to get to our Postgres and Iceberg data. This approach cut query times from 45 seconds to under 8 seconds, reduced data prep time by 70%, and really made our data more reliable. It was a proper game-changer; it showed us that sometimes, a new way of thinking – a manifold for your data – is what you need.

Company/Project Background

I was leading the backend team for "Project Horizon," an e-commerce platform that had grown naturally over five years. We'd started small, just a single, big Postgres database for everything. But as we scaled, we spread things out. Clickstream data went into S3, then became Iceberg tables for better schema evolution and performance. Order data stayed in a super-tuned Postgres cluster. Customer profiles lived in another, slightly older Postgres instance. Then we had product catalogues and inventory in a totally separate, vendor-managed system, you could only get to it through a REST API or scheduled CSV dumps. It was a classic data sprawl.

Our analytics team, bless their hearts, were trying to make sense of user journeys. They needed to know: "What did a user browse (Iceberg), what did they add to cart (Postgres), and what was their overall purchase history (Postgres), factoring in product availability (external API)?" This meant complex joins across all sorts of different data sources. Our initial setup involved a bunch of Python scripts pulling from S3, pushing into a separate Postgres data warehouse (which was basically just a big staging area), and then joining it with live production Postgres data. It was brittle, slow, and a nightmare to maintain. We tried building something similar to My Claude Code Infrastructure Journey for our ETL, but the data itself was the problem, not just the pipes.

Challenge Description

The data fragmentation wasn't just a technical challenge; it was a business blocker. Our analysts were spending about 15-20 hours a week just on data wrangling. They'd pull clickstream logs, change them around, load them into a Postgres table, then query that with our main Postgres order data. If a schema changed in the Iceberg tables, their scripts would break, which meant our data didn't match up. We were seeing 2-3 major data consistency issues flagged weekly in reports, which made people stop trusting our analytics. Queries for cross-source analysis often took 30-45 seconds, sometimes even timing out, so you couldn't really explore data on the fly.

We tried putting everything into a single, huge Postgres instance. But that came with its own issues: managing semi-structured data in Postgres was awkward, and the huge amount of Iceberg data meant either massive storage costs or complex ETL pipelines that copied data and made things slower. Plus, keeping data in sync across multiple systems for real-time access felt like trying to hit a moving target with a thermostat on the fritz.

This is where the idea of a "data manifold" started to become clear. How could we offer one clear place – a single, logical view – through which analysts could access all this data like it was just one big thing, even though physically it was spread across S3/Iceberg and multiple Postgres instances? We needed to hide where the data was actually stored and its different setups, so we could ask questions about how things naturally related without always moving and changing data, which was a pain. Our existing access patterns were just too chaotic.

Solution Implementation Details

The manifold concept, for us, became about creating a flexible, combined idea layer over our different types of data. It wasn't about moving all the data into one place, but about giving Postgres the power to query it right where it was.

Enter pg_lake and Iceberg: This combination was the main building block. pg_lake is a Postgres extension that allows Postgres to query data lake formats like Iceberg directly from S3. This meant Postgres could act as our data manifold engine without needing to pull all the data in. We could keep our huge clickstream data in Iceberg on S3, getting the good stuff from its schema evolution and partitioning, and still query it alongside our day-to-day Postgres data.

Setting up pg_lake: We installed pg_lake on our main analytics Postgres instance. It was pretty simple: CREATE EXTENSION pg_lake; Then, we configured a foreign data wrapper for S3:

sql

CREATE SERVER pg_lake_server
    FOREIGN DATA WRAPPER pg_lake
    OPTIONS (
        type 's3',
        access_key_id 'YOUR_ACCESS_KEY',
        secret_access_key 'YOUR_SECRET_KEY',
        region 'eu-west-2' -- Our AWS region
    );

Gotcha: Initially, I forgot to set the region option, leading to S3 connection refused errors. It took me a frustrating 3 hours to debug, staring at the stack trace, until I realised my mistake. Small detail, big impact.

Defining Iceberg Foreign Tables: Next, we defined foreign tables in Postgres that told Postgres where to find our Iceberg datasets on S3. For example, our clickstream_data:

sql

CREATE FOREIGN TABLE iceberg_clickstream (
        user_id text,
        event_type text,
        event_time timestamptz,
        page_url text,
        session_id text
    )
    SERVER pg_lake_server OPTIONS (
        table_format 'iceberg',
        location 's3://my-prod-data-lake/clickstream_data/'
    );

This basically made our Iceberg data available as if it were a native Postgres table. The Iceberg format's schema evolution capabilities were a massive help here. When our marketing team added a new referrer_url column to the clickstream, Iceberg dealt with it smoothly

Unravelling Manifolds for Your Data Lake

Executive Summary

Company/Project Background

Challenge Description

Solution Implementation Details

TOPICS

Share This Article

Related Articles

AI Dev Tools Are More Human Than You Think

Why Open Sources Matter More Than Ever

My Claude Code Infrastructure Journey

Archives

Categories

The Dev Insights Team