Technical Discussion: Distributed SQL Engine Requirements for Ultra-Wide Tables in ML and Multi-Omics Data
By
synsqlbythesea
Hot, fresh, and worth queueing round the block for.
Summary
A technical discussion about the limitations of current SQL databases and data processing systems when handling ultra-wide tables with thousands to tens of thousands of columns, particularly in ML feature engineering and multi-omics data contexts. The author describes practical limitations where standard SQL databases cap at around 1,000-1,600 columns, columnar formats like Parquet require Spark or Python pipelines, and OLAP engines have their own constraints. The post seeks recommendations for distributed SQL engines that can handle such wide tables effectively.
Key quotes
· 4 pulledAt some point, the problem stops being 'how many rows' and becomes 'how many columns'. Thousands, then tens of thousands, sometimes more.
Standard SQL databases usually cap out around ~1,000–1,600 columns.
Columnar formats like Parquet can handle width, but typically require Spark or Python pipelines.
OLAP engines are fast, but tend to assume a certain column count range.
You might also wanna read
Airbnb's Chronon: Open-Source Data Platform for AI/ML Feature Engineering and Serving
Chronon is an open-source data platform developed by Airbnb that simplifies feature engineering and data serving for AI/ML applications. It
Kore: A New High-Performance Columnar File Format for Big Data Analytics
Kore is a new high-performance binary file format for analytical workloads, claiming superior compression (38% vs 63% for Parquet), 131x que
How Mindbox replaced PySpark with YAML-based pipelines using dlt, dbt, and Trino
Data engineer Kiril Kazlou describes how Mindbox replaced PySpark-based data pipelines with a stack using dlt, dbt, and Trino, configured th
How Modal reduced inference cold starts by 40x using LP, FUSE, C/R, and cuda-checkpoint
Modal presents a deep technical analysis of how they reduced inference cold starts by 40x using a combination of techniques including LP (li
Six SQL patterns for detecting transaction fraud in benefit programs
A data professional on a program-integrity team shares six practical SQL patterns for detecting transaction fraud in government benefit prog
Rocky: A Rust-Based Control Plane for Data Warehouse Pipeline Management
Rocky is a Rust-based control plane for data warehouse pipelines that provides branching, replay, column-level lineage, compile-time safety,
