Boosting LLM/RAG Workflows & Scheduling W/ Composable Memory And Checkpointing // Bernie Wu // #270 MLOps.community podcast

Boosting LLM/RAG Workflows & Scheduling w/ Composable Memory and Checkpointing // Bernie Wu // #270

1y ago 55:18

Contenido proporcionado por Demetrios. Todo el contenido del podcast, incluidos episodios, gráficos y descripciones de podcast, lo carga y proporciona directamente Demetrios o su socio de plataforma de podcast. Si cree que alguien está utilizando su trabajo protegido por derechos de autor sin su permiso, puede seguir el proceso descrito aquí https://es.player.fm/legal.

Bernie Wu is VP of Business Development for MemVerge. He has 25+ years of experience as a senior executive for data center hardware and software infrastructure companies, including companies such as Conner/Seagate, Cheyenne Software, Trend Micro, FalconStor, Levyx, and MetalSoft.

Boosting LLM/RAG Workflows & Scheduling w/ Composable Memory and Checkpointing // MLOps Podcast #270 with Bernie Wu, VP Strategic Partnerships/Business Development of MemVerge.

// Abstract

Limited memory capacity hinders the performance and potential of research and production environments utilizing Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) techniques. This discussion explores how leveraging industry-standard CXL memory can be configured as a secondary, composable memory tier to alleviate this constraint.

We will highlight some recent work we’ve done in integrating this novel class of memory into LLM/RAG/vector database frameworks and workflows. Disaggregated shared memory is envisioned to offer high-performance, low-latency caches for model/pipeline checkpoints of LLM models, KV caches during distributed inferencing, LORA adaptors, and in-process data for heterogeneous CPU/GPU workflows. We expect to showcase these types of use cases in the coming months.

// Bio

Bernie is VP of Strategic Partnerships/Business Development for MemVerge. His focus has been on building partnerships in the AI/ML, Kubernetes, and CXL memory ecosystems. He has 25+ years of experience as a senior executive for data center hardware and software infrastructure companies, including companies such as Conner/Seagate, Cheyenne Software, Trend Micro, FalconStor, Levyx, and MetalSoft. He is also on the Board of Directors for Cirrus Data Solutions. Bernie has a BS/MS in Engineering from UC Berkeley and an MBA from UCLA.

// MLOps Swag/Merch

https://mlops-community.myshopify.com/

// Related Links

Website: www.memverge.com

Accelerating Data Retrieval in Retrieval Augmentation Generation (RAG) Pipelines using CXL: https://memverge.com/accelerating-data-retrieval-in-rag-pipelines-using-cxl/

Do Re MI for Training Metrics: Start at the Beginning // Todd Underwood // AIQCON: https://youtu.be/DxyOlRdCofo

Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // MLOps Podcast #228: https://youtu.be/6MY-IgqiTpg

Compute Express Link (CXL) FPGA IP: https://www.intel.com/content/www/us/en/products/details/fpga/intellectual-property/interface-protocols/cxl-ip.htmlUltra Ethernet Consortium: https://ultraethernet.org/

Unified Acceleration (UXL) Foundation: https://www.intel.com/content/www/us/en/developer/articles/news/unified-acceleration-uxl-foundation.html

RoCE networks for distributed AI training at scale: https://engineering.fb.com/2024/08/05/data-center-engineering/roce-network-distributed-ai-training-at-scale/

--------------- ✌️Connect With Us ✌️ -------------

Join our Slack community: https://go.mlops.community/slack

Catch all episodes, blogs, newsletters, and more: https://mlops.community/

Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/

Connect with Bernie on LinkedIn: https://www.linkedin.com/in/berniewu/

Timestamps:

[00:00] Bernie's preferred coffee

[00:11] Takeaways

[01:37] First principles thinking focus

[05:02] Memory Abundance Concept Discussion

[06:45] Managing load spikes

[09:38] GPU checkpointing challenges

[16:29] Distributed memory problem solving

[18:27] Composable and Virtual Memory

[21:49] Interactive chat annotation

[23:46] Memory elasticity in AI

[27:33] GPU networking tests

[29:12] GPU Scheduling workflow optimization

[32:18] Kubernetes Extensions and Tools

[37:14] GPU bottleneck analysis

[42:04] Economical memory strategies

[45:14] Elastic memory management strategies

[47:57] Problem-solving approach

[50:15] AI infrastructure elasticity evolution

[52:33] RDMA and RoCE explained

[54:14] Wrap up

490 episodios