51 subscribers
¡Desconecta con la aplicación Player FM !
Boosting LLM/RAG Workflows & Scheduling w/ Composable Memory and Checkpointing // Bernie Wu // #270
Manage episode 446382145 series 3241972
Bernie Wu is VP of Business Development for MemVerge. He has 25+ years of experience as a senior executive for data center hardware and software infrastructure companies including companies such as Conner/Seagate, Cheyenne Software, Trend Micro, FalconStor, Levyx, and MetalSoft. Boosting LLM/RAG Workflows & Scheduling w/ Composable Memory and Checkpointing // MLOps Podcast #270 with Bernie Wu, VP Strategic Partnerships/Business Development of MemVerge. // Abstract Limited memory capacity hinders the performance and potential of research and production environments utilizing Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) techniques. This discussion explores how leveraging industry-standard CXL memory can be configured as a secondary, composable memory tier to alleviate this constraint. We will highlight some recent work we’ve done in integrating of this novel class of memory into LLM/RAG/vector database frameworks and workflows. Disaggregated shared memory is envisioned to offer high performance, low latency caches for model/pipeline checkpoints of LLM models, KV caches during distributed inferencing, LORA adaptors, and in-process data for heterogeneous CPU/GPU workflows. We expect to showcase these types of use cases in the coming months. // Bio Bernie is VP of Strategic Partnerships/Business Development for MemVerge. His focus has been building partnerships in the AI/ML, Kubernetes, and CXL memory ecosystems. He has 25+ years of experience as a senior executive for data center hardware and software infrastructure companies including companies such as Conner/Seagate, Cheyenne Software, Trend Micro, FalconStor, Levyx, and MetalSoft. He is also on the Board of Directors for Cirrus Data Solutions. Bernie has a BS/MS in Engineering from UC Berkeley and an MBA from UCLA. // MLOps Swag/Merch https://mlops-community.myshopify.com/ // Related Links Website: www.memverge.com Accelerating Data Retrieval in Retrieval Augmentation Generation (RAG) Pipelines using CXL: https://memverge.com/accelerating-data-retrieval-in-rag-pipelines-using-cxl/ Do Re MI for Training Metrics: Start at the Beginning // Todd Underwood // AIQCON: https://youtu.be/DxyOlRdCofo Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // MLOps Podcast #228: https://youtu.be/6MY-IgqiTpg
Compute Express Link (CXL) FPGA IP: https://www.intel.com/content/www/us/en/products/details/fpga/intellectual-property/interface-protocols/cxl-ip.htmlUltra Ethernet Consortium: https://ultraethernet.org/
Unified Acceleration (UXL) Foundation: https://www.intel.com/content/www/us/en/developer/articles/news/unified-acceleration-uxl-foundation.html
RoCE networks for distributed AI training at scale: https://engineering.fb.com/2024/08/05/data-center-engineering/roce-network-distributed-ai-training-at-scale/ --------------- ✌️Connect With Us ✌️ ------------- Join our slack community: https://go.mlops.community/slack Follow us on Twitter: @mlopscommunity Sign up for the next meetup: https://go.mlops.community/register Catch all episodes, blogs, newsletters, and more: https://mlops.community/ Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/ Connect with Bernie on LinkedIn: https://www.linkedin.com/in/berniewu/
Timestamps: [00:00] Bernie's preferred coffee [00:11] Takeaways [01:37] First principles thinking focus [05:02] Memory Abundance Concept Discussion [06:45] Managing load spikes [09:38] GPU checkpointing challenges [16:29] Distributed memory problem solving [18:27] Composable and Virtual Memory [21:49] Interactive chat annotation [23:46] Memory elasticity in AI [27:33] GPU networking tests [29:12] GPU Scheduling workflow optimization [32:18] Kubernetes Extensions and Tools [37:14] GPU bottleneck analysis [42:04] Economical memory strategies [45:14] Elastic memory management strategies [47:57] Problem solving approach [50:15] AI infrastructure elasticity evolution [52:33] RDMA and RoCE explained [54:14] Wrap up
426 episodios
Manage episode 446382145 series 3241972
Bernie Wu is VP of Business Development for MemVerge. He has 25+ years of experience as a senior executive for data center hardware and software infrastructure companies including companies such as Conner/Seagate, Cheyenne Software, Trend Micro, FalconStor, Levyx, and MetalSoft. Boosting LLM/RAG Workflows & Scheduling w/ Composable Memory and Checkpointing // MLOps Podcast #270 with Bernie Wu, VP Strategic Partnerships/Business Development of MemVerge. // Abstract Limited memory capacity hinders the performance and potential of research and production environments utilizing Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) techniques. This discussion explores how leveraging industry-standard CXL memory can be configured as a secondary, composable memory tier to alleviate this constraint. We will highlight some recent work we’ve done in integrating of this novel class of memory into LLM/RAG/vector database frameworks and workflows. Disaggregated shared memory is envisioned to offer high performance, low latency caches for model/pipeline checkpoints of LLM models, KV caches during distributed inferencing, LORA adaptors, and in-process data for heterogeneous CPU/GPU workflows. We expect to showcase these types of use cases in the coming months. // Bio Bernie is VP of Strategic Partnerships/Business Development for MemVerge. His focus has been building partnerships in the AI/ML, Kubernetes, and CXL memory ecosystems. He has 25+ years of experience as a senior executive for data center hardware and software infrastructure companies including companies such as Conner/Seagate, Cheyenne Software, Trend Micro, FalconStor, Levyx, and MetalSoft. He is also on the Board of Directors for Cirrus Data Solutions. Bernie has a BS/MS in Engineering from UC Berkeley and an MBA from UCLA. // MLOps Swag/Merch https://mlops-community.myshopify.com/ // Related Links Website: www.memverge.com Accelerating Data Retrieval in Retrieval Augmentation Generation (RAG) Pipelines using CXL: https://memverge.com/accelerating-data-retrieval-in-rag-pipelines-using-cxl/ Do Re MI for Training Metrics: Start at the Beginning // Todd Underwood // AIQCON: https://youtu.be/DxyOlRdCofo Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // MLOps Podcast #228: https://youtu.be/6MY-IgqiTpg
Compute Express Link (CXL) FPGA IP: https://www.intel.com/content/www/us/en/products/details/fpga/intellectual-property/interface-protocols/cxl-ip.htmlUltra Ethernet Consortium: https://ultraethernet.org/
Unified Acceleration (UXL) Foundation: https://www.intel.com/content/www/us/en/developer/articles/news/unified-acceleration-uxl-foundation.html
RoCE networks for distributed AI training at scale: https://engineering.fb.com/2024/08/05/data-center-engineering/roce-network-distributed-ai-training-at-scale/ --------------- ✌️Connect With Us ✌️ ------------- Join our slack community: https://go.mlops.community/slack Follow us on Twitter: @mlopscommunity Sign up for the next meetup: https://go.mlops.community/register Catch all episodes, blogs, newsletters, and more: https://mlops.community/ Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/ Connect with Bernie on LinkedIn: https://www.linkedin.com/in/berniewu/
Timestamps: [00:00] Bernie's preferred coffee [00:11] Takeaways [01:37] First principles thinking focus [05:02] Memory Abundance Concept Discussion [06:45] Managing load spikes [09:38] GPU checkpointing challenges [16:29] Distributed memory problem solving [18:27] Composable and Virtual Memory [21:49] Interactive chat annotation [23:46] Memory elasticity in AI [27:33] GPU networking tests [29:12] GPU Scheduling workflow optimization [32:18] Kubernetes Extensions and Tools [37:14] GPU bottleneck analysis [42:04] Economical memory strategies [45:14] Elastic memory management strategies [47:57] Problem solving approach [50:15] AI infrastructure elasticity evolution [52:33] RDMA and RoCE explained [54:14] Wrap up
426 episodios
Все серии
×
1 Agents of Innovation: AI-Powered Product Ideation with Synthetic Consumer Testing // Luca Fiaschi // #306 1:02:23

1 We're All Finetuning Incorrectly // Tanmay Chopra // #304 1:00:30

1 From Rules to Reasoning Engines // George Mathew // #296 1:05:26

1 GenAI Traffic: Why API Infrastructure Must Evolve... Again // Erica Hughberg // #296 1:06:24

1 Future of Software, Agents in the Enterprise, and Inception Stage Company Building // Eliot Durbin // #293 54:26
Bienvenido a Player FM!
Player FM está escaneando la web en busca de podcasts de alta calidad para que los disfrutes en este momento. Es la mejor aplicación de podcast y funciona en Android, iPhone y la web. Regístrate para sincronizar suscripciones a través de dispositivos.