Artwork

Contenido proporcionado por Nicolay Gerold. Todo el contenido del podcast, incluidos episodios, gráficos y descripciones de podcast, lo carga y proporciona directamente Nicolay Gerold o su socio de plataforma de podcast. Si cree que alguien está utilizando su trabajo protegido por derechos de autor sin su permiso, puede seguir el proceso descrito aquí https://es.player.fm/legal.
Player FM : aplicación de podcast
¡Desconecta con la aplicación Player FM !

Building Robust AI and Data Systems, Data Architecture, Data Quality, Data Storage | ep 10

45:33
 
Compartir
 

Manage episode 428522573 series 3585930
Contenido proporcionado por Nicolay Gerold. Todo el contenido del podcast, incluidos episodios, gráficos y descripciones de podcast, lo carga y proporciona directamente Nicolay Gerold o su socio de plataforma de podcast. Si cree que alguien está utilizando su trabajo protegido por derechos de autor sin su permiso, puede seguir el proceso descrito aquí https://es.player.fm/legal.

In this episode of "How AI is Built", data architect Anjan Banerjee provides an in-depth look at the world of data architecture and building complex AI and data systems. Anjan breaks down the basics using simple analogies, explaining how data architecture involves sorting, cleaning, and painting a picture with data, much like organizing Lego bricks to build a structure.

Summary by Section

Introduction

  • Anjan Banerjee, a data architect, discusses building complex AI and data systems
  • Explains the basics of data architecture using Lego and chat app examples

Sources and Tools

  • Identifying data sources is the first step in designing a data architecture
  • Pick the right tools to extract data based on use cases (block storage for images, time series DB, etc.)
  • Use one tool for most activities if possible, but specialized tools offer benefits
  • Multi-modal storage engines are gaining popularity (Snowflake, Databricks, BigQuery)

Airflow and Orchestration

  • Airflow is versatile but has a learning curve; good for orgs with Python/data engineering skills
  • For less technical orgs, GUI-based tools like Talend, Alteryx may be better
  • AWS Step Functions and managed Airflow are improving native orchestration capabilities
  • For multi-cloud, prefer platform-agnostic tools like Astronomer, Prefect, Airbyte

AI and Data Processing

  • ML is key for data-intensive use cases to avoid storing/processing petabytes in cloud
  • TinyML and edge computing enable ML inference on device (drones, manufacturing)
  • Cloud batch processing still dominates for user targeting, recommendations

Data Lakes and Storage

  • Storage choice depends on data types, use cases, cloud ecosystem
  • Delta Lake excels at data versioning and consistency; Iceberg at partitioning and metadata
  • Pulling data into separate system often needed for advanced analytics beyond source system

Data Quality and Standardization

  • "Poka-yoke" error-proofing of input screens is vital for downstream data quality
  • Impose data quality rules and unified schemas (e.g. UTC timestamps) during ingestion
  • Complexity arises with multi-region compliance (GDPR, CCPA) requiring encryption, sanitization

Hot Takes and Wishes

  • Snowflake is overhyped; great UX but costly at scale. Databricks is preferred.
  • Automated data set joining and entity resolution across systems would be a game-changer

Anjan Banerjee:

Nicolay Gerold:

00:00 Understanding Data Architecture

12:36 Choosing the Right Tools

20:36 The Benefits of Serverless Functions

21:34 Integrating AI in Data Acquisition

24:31 The Trend Towards Single Node Engines

26:51 Choosing the Right Database Management System and Storage

29:45 Adding Additional Storage Components

32:35 Reducing Human Errors for Better Data Quality

39:07 Overhyped and Underutilized Tools

Data architecture, AI, data systems, data sources, data extraction, data storage, multi-modal storage engines, data orchestration, Airflow, edge computing, batch processing, data lakes, Delta Lake, Iceberg, data quality, standardization, poka-yoke, compliance, entity resolution

  continue reading

30 episodios

Artwork
iconCompartir
 
Manage episode 428522573 series 3585930
Contenido proporcionado por Nicolay Gerold. Todo el contenido del podcast, incluidos episodios, gráficos y descripciones de podcast, lo carga y proporciona directamente Nicolay Gerold o su socio de plataforma de podcast. Si cree que alguien está utilizando su trabajo protegido por derechos de autor sin su permiso, puede seguir el proceso descrito aquí https://es.player.fm/legal.

In this episode of "How AI is Built", data architect Anjan Banerjee provides an in-depth look at the world of data architecture and building complex AI and data systems. Anjan breaks down the basics using simple analogies, explaining how data architecture involves sorting, cleaning, and painting a picture with data, much like organizing Lego bricks to build a structure.

Summary by Section

Introduction

  • Anjan Banerjee, a data architect, discusses building complex AI and data systems
  • Explains the basics of data architecture using Lego and chat app examples

Sources and Tools

  • Identifying data sources is the first step in designing a data architecture
  • Pick the right tools to extract data based on use cases (block storage for images, time series DB, etc.)
  • Use one tool for most activities if possible, but specialized tools offer benefits
  • Multi-modal storage engines are gaining popularity (Snowflake, Databricks, BigQuery)

Airflow and Orchestration

  • Airflow is versatile but has a learning curve; good for orgs with Python/data engineering skills
  • For less technical orgs, GUI-based tools like Talend, Alteryx may be better
  • AWS Step Functions and managed Airflow are improving native orchestration capabilities
  • For multi-cloud, prefer platform-agnostic tools like Astronomer, Prefect, Airbyte

AI and Data Processing

  • ML is key for data-intensive use cases to avoid storing/processing petabytes in cloud
  • TinyML and edge computing enable ML inference on device (drones, manufacturing)
  • Cloud batch processing still dominates for user targeting, recommendations

Data Lakes and Storage

  • Storage choice depends on data types, use cases, cloud ecosystem
  • Delta Lake excels at data versioning and consistency; Iceberg at partitioning and metadata
  • Pulling data into separate system often needed for advanced analytics beyond source system

Data Quality and Standardization

  • "Poka-yoke" error-proofing of input screens is vital for downstream data quality
  • Impose data quality rules and unified schemas (e.g. UTC timestamps) during ingestion
  • Complexity arises with multi-region compliance (GDPR, CCPA) requiring encryption, sanitization

Hot Takes and Wishes

  • Snowflake is overhyped; great UX but costly at scale. Databricks is preferred.
  • Automated data set joining and entity resolution across systems would be a game-changer

Anjan Banerjee:

Nicolay Gerold:

00:00 Understanding Data Architecture

12:36 Choosing the Right Tools

20:36 The Benefits of Serverless Functions

21:34 Integrating AI in Data Acquisition

24:31 The Trend Towards Single Node Engines

26:51 Choosing the Right Database Management System and Storage

29:45 Adding Additional Storage Components

32:35 Reducing Human Errors for Better Data Quality

39:07 Overhyped and Underutilized Tools

Data architecture, AI, data systems, data sources, data extraction, data storage, multi-modal storage engines, data orchestration, Airflow, edge computing, batch processing, data lakes, Delta Lake, Iceberg, data quality, standardization, poka-yoke, compliance, entity resolution

  continue reading

30 episodios

Todos los episodios

×
 
Loading …

Bienvenido a Player FM!

Player FM está escaneando la web en busca de podcasts de alta calidad para que los disfrutes en este momento. Es la mejor aplicación de podcast y funciona en Android, iPhone y la web. Regístrate para sincronizar suscripciones a través de dispositivos.

 

Guia de referencia rapida