All Topics
All Topics
Technology
Technology
Design
Design
Programming
Programming
Science
Science
News
News
Gaming
Gaming
Entertainment
Entertainment
Business
Business
Finance
Finance
Sports
Sports
Health
Health
Food
Food
Travel
Travel
Art
Art
Music
Music
Books
Books
Education
Education
Politics
Politics
Personal
Personal
No algorithm. No AI slop. No ads. Just RSS. Pro-human. Indie writers. Real journalism. Open web. Chronological. Hand toasted.

PyTorch Monarch: A New Framework for Complex, Dynamic Machine Learning Workflows

By

jarbus

7mo ago· 17 min readenNews

Summary

PyTorch Monarch is a new framework designed to address the challenges of modern ML workflows that are heterogeneous, asynchronous, and dynamic. It moves away from PyTorch's traditional HPC-style multi-controller model (SPMD) to a single-controller architecture that provides a global view of workflow state. This enables better handling of complex ML scenarios like pre-training with advanced parallelism and partial failures, as well as RL models requiring dynamic feedback loops. The framework aims to simplify implementation of complex workflows that are difficult in distributed systems where nodes only have local state visibility.

Key quotes

· 4 pulled
We now live in a world where ML workflows (pre-training, post training, etc) are heterogeneous, must contend with hardware failures, are increasingly asynchronous and highly dynamic.
Traditionally, PyTorch has relied on an HPC-style multi-controller model, where multiple copies of the same script are launched across different machines, each running its own instance of the application (often referred to as SPMD).
ML workflows are becoming more complex: pre-training might combine advanced parallelism with asynchrony and partial failure; while RL models used in post-training require a high degree of dynamism with complex feedback loops.
While the logic of these workflows may be relatively straightforward, they are notoriously difficult to implement well in a multi-controller system, where each node must decide how to act based on only a local view of the workflow's state.
Snippet from the RSS feed
We now live in a world where ML workflows (pre-training, post training, etc) are heterogeneous, must contend with hardware failures, are increasingly asynchronous and highly dynamic. Traditionally, PyTorch has relied on an HPC-style  multi-controller mode

You might also wanna read