Artwork

Contenido proporcionado por The New Stack Podcast and The New Stack. Todo el contenido del podcast, incluidos episodios, gráficos y descripciones de podcast, lo carga y proporciona directamente The New Stack Podcast and The New Stack o su socio de plataforma de podcast. Si cree que alguien está utilizando su trabajo protegido por derechos de autor sin su permiso, puede seguir el proceso descrito aquí https://es.player.fm/legal.
Player FM : aplicación de podcast
¡Desconecta con la aplicación Player FM !

Keeping GPUs Ticking Like Clockwork

27:08
 
Compartir
 

Manage episode 519876205 series 2574278
Contenido proporcionado por The New Stack Podcast and The New Stack. Todo el contenido del podcast, incluidos episodios, gráficos y descripciones de podcast, lo carga y proporciona directamente The New Stack Podcast and The New Stack o su socio de plataforma de podcast. Si cree que alguien está utilizando su trabajo protegido por derechos de autor sin su permiso, puede seguir el proceso descrito aquí https://es.player.fm/legal.

Clockwork began with a narrow goal—keeping clocks synchronized across servers—but soon realized that its precise latency measurements could reveal deeper data center networking issues. This insight led the company to build a hardware-agnostic monitoring and remediation platform capable of automatically routing around faults. Today, Clockwork’s technology is especially valuable for large GPU clusters used in training LLMs, where communication efficiency and reliability are critical. CEO Suresh Vasudevan explains that AI workloads are among the most demanding distributed applications ever, and Clockwork provides building blocks that improve visibility, performance and fault tolerance. Its flagship feature, FleetIQ, can reroute traffic around failing switches, preventing costly interruptions that might otherwise force teams to restart training from hours-old checkpoints. Although the company originated from Stanford research focused on clock synchronization for financial institutions, the team eventually recognized that packet-timing data could underpin powerful network telemetry and dynamic traffic control. By integrating with NVIDIA NCCL, TCP and RDMA libraries, Clockwork can not only measure congestion but also actively manage GPU communication to enhance both uptime and training efficiency.

Learn more from The New Stack about the latest in Clockwork:

Clockwork’s FleetIQ Aims To Fix AI’s Costly Network Bottleneck

What Happens When 116 Makers Reimagine the Clock?

Join our community of newsletter subscribers to stay on top of the news and at the top of your game.

Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

  continue reading

305 episodios

Artwork
iconCompartir
 
Manage episode 519876205 series 2574278
Contenido proporcionado por The New Stack Podcast and The New Stack. Todo el contenido del podcast, incluidos episodios, gráficos y descripciones de podcast, lo carga y proporciona directamente The New Stack Podcast and The New Stack o su socio de plataforma de podcast. Si cree que alguien está utilizando su trabajo protegido por derechos de autor sin su permiso, puede seguir el proceso descrito aquí https://es.player.fm/legal.

Clockwork began with a narrow goal—keeping clocks synchronized across servers—but soon realized that its precise latency measurements could reveal deeper data center networking issues. This insight led the company to build a hardware-agnostic monitoring and remediation platform capable of automatically routing around faults. Today, Clockwork’s technology is especially valuable for large GPU clusters used in training LLMs, where communication efficiency and reliability are critical. CEO Suresh Vasudevan explains that AI workloads are among the most demanding distributed applications ever, and Clockwork provides building blocks that improve visibility, performance and fault tolerance. Its flagship feature, FleetIQ, can reroute traffic around failing switches, preventing costly interruptions that might otherwise force teams to restart training from hours-old checkpoints. Although the company originated from Stanford research focused on clock synchronization for financial institutions, the team eventually recognized that packet-timing data could underpin powerful network telemetry and dynamic traffic control. By integrating with NVIDIA NCCL, TCP and RDMA libraries, Clockwork can not only measure congestion but also actively manage GPU communication to enhance both uptime and training efficiency.

Learn more from The New Stack about the latest in Clockwork:

Clockwork’s FleetIQ Aims To Fix AI’s Costly Network Bottleneck

What Happens When 116 Makers Reimagine the Clock?

Join our community of newsletter subscribers to stay on top of the news and at the top of your game.

Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

  continue reading

305 episodios

Todos los episodios

×
 
Loading …

Bienvenido a Player FM!

Player FM está escaneando la web en busca de podcasts de alta calidad para que los disfrutes en este momento. Es la mejor aplicación de podcast y funciona en Android, iPhone y la web. Regístrate para sincronizar suscripciones a través de dispositivos.

 

Guia de referencia rapida

Escucha este programa mientras exploras
Reproducir