All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

Cedana: AI/HPC GPU Checkpointing Startup Seeks Forward Deployed Engineer

By

neelm

2d ago· 4 min readen

Summary

Cedana is a Y Combinator-backed startup that provides automated GPU checkpointing infrastructure to maximize AI and HPC cluster utilization and reliability. Their solution enables transparent migration of GPU workloads across instances without losing work, operating at the kernel/OS level with no code changes required. The company is hiring a Forward Deployed Engineer to lead customer integrations, deploy into SLURM, Kubernetes, and Dynamo environments, and drive product innovation from the field. The role requires 3-10 years of software engineering experience with SLURM deployment expertise, strong Linux fundamentals, and Kubernetes operations knowledge. The position is remote US-based with ~25% travel and offers $140K-$180K base salary plus equity.

Key quotes

· 5 pulled
Cedana maximizes AI+HPC cluster utilization and reliability with automated GPU checkpointing infrastructure.
We enable transparent and fast migration of GPU workloads across instances, without losing work.
Our system is at the kernel/OS level, requiring no code or config changes, and works seamlessly with Kubernetes, SLURM, and NVIDIA Dynamo.
This role will expose you to the cutting edge of AI and HPC infrastructure, working with the world's leading research and commercial customers to deliver a breakthrough solution.
Cedana's founding team has spent over a decade making computation run fast, productively, and reliably for AI.
Snippet from the RSS feed
Introducing Cedana The Problem AI and HPC  infrastructure suffers from scarcity and high costs, so when failures happen they are costly in terms of time and money. Cluster productivity directly determines research output and revenue. Achieving high util

You might also wanna read

Startup SPAN plans to install mini AI data centers in residential neighborhoods, offering homeowners subsidized utilities

SPAN, a San Francisco startup, is piloting a program to install mini data centers (XFRA nodes with liquid-cooled Nvidia RTX Pro 6000 GPUs) i

arstechnica.com·5h ago

Female CFOs at major tech firms navigate massive AI infrastructure spending decisions

The article examines how the CFO role in Big Tech has evolved from focusing on margins and investor discipline to grappling with massive AI

fortune.com·9h ago

Female CFOs at major tech firms navigate massive AI infrastructure spending decisions

The article examines how the CFO role in Big Tech has evolved from focusing on margins and investor discipline to grappling with massive AI

fortune.com·9h ago

ByteDance Plans $70B AI Infrastructure Spend for 2026, Tripling Investment to Bypass US Export Controls

ByteDance, the parent company of TikTok, is planning up to $70 billion in AI capital expenditure for 2026, nearly tripling its 2025 spend of

awesomeagents.ai·12h ago

Utah tightens regulations on Kevin O'Leary's 40,000-acre AI data center project after public opposition

Utah regulators have tightened rules for Kevin O'Leary's proposed Stratos Project, a massive AI data center campus spanning 40,000 acres, fo

flip.it·14h ago

Utah tightens regulations on Kevin O'Leary's 40,000-acre AI data center after public backlash

Utah regulators have tightened rules on Kevin O'Leary's proposed Stratos Project, a massive 40,000-acre AI data center campus, following sig

businessinsider.com·23h ago