
Sign up to save your podcasts
Or


KServe: vLLM-ஐப் பயன்படுத்தி Kubernetes-இல் GenAI அனுமானம்
This episode of Exploring Modern AI in Tamil podcast provides a step-by-step narrative for deploying Llama3 using the vLLM backend.
- Breaks down the core concepts of KServe components for a beginner audience.
- Provides simple explanations for why specific inference runtimes are selected for Llama3.
- Explains how to configure KEDA for event-driven autoscaling of this deployment.
- Details how to use node selectors and tolerations to pin pods to GPU nodes.
- Describes how to configure a Persistent Volume Claim for model storage.
- Includes best practices for securing deployments with custom CA certificates and secrets.
- Explains how to configure worker specifications for multi-node and multi-GPU inference setups.
- Describes using multiple storage URIs for fetching artifacts from various locations.
- Compares Single versus Multiple Storage URI configurations for model data.
- Outlines methods for authenticating cloud storage with KServe secrets.
- Details how to monitor inference performance using OpenTelemetry and custom Prometheus metrics.
- Describes configuring push-based metrics with OpenTelemetry to improve autoscaling responsiveness.
- Contrasts the KEDA polling mechanism with real-time push metrics from OpenTelemetry.
- Explains trade-offs when selecting between HPA and KEDA for model autoscaling.
- Describes how to correctly calculate tensor parallel and pipeline parallel size configurations.
- Explains how to manage Hugging Face token secrets for secure model downloads.
- Outlines the full production lifecycle from model storage configuration to performance monitoring.
- Details advanced optimization strategies like KV cache offloading and streaming API usage.
- Highlights techniques to reduce model latency during high concurrency inference scenarios.
- Compares the vLLM backend performance benefits against the standard Hugging Face backend.
- Simplifies the core concepts of InferenceServices for developers new to KServe.
- Analyzes the infrastructure trade offs between serverless and raw deployment modes.
- Provides a developer roadmap for setting up authentication and testing inference endpoints.
- Summarizes architectural best practices for scaling production model serving environments.
- Compares the trade-offs between using Knative serverless versus raw deployment modes.
By Sivakumar ViyalanKServe: vLLM-ஐப் பயன்படுத்தி Kubernetes-இல் GenAI அனுமானம்
This episode of Exploring Modern AI in Tamil podcast provides a step-by-step narrative for deploying Llama3 using the vLLM backend.
- Breaks down the core concepts of KServe components for a beginner audience.
- Provides simple explanations for why specific inference runtimes are selected for Llama3.
- Explains how to configure KEDA for event-driven autoscaling of this deployment.
- Details how to use node selectors and tolerations to pin pods to GPU nodes.
- Describes how to configure a Persistent Volume Claim for model storage.
- Includes best practices for securing deployments with custom CA certificates and secrets.
- Explains how to configure worker specifications for multi-node and multi-GPU inference setups.
- Describes using multiple storage URIs for fetching artifacts from various locations.
- Compares Single versus Multiple Storage URI configurations for model data.
- Outlines methods for authenticating cloud storage with KServe secrets.
- Details how to monitor inference performance using OpenTelemetry and custom Prometheus metrics.
- Describes configuring push-based metrics with OpenTelemetry to improve autoscaling responsiveness.
- Contrasts the KEDA polling mechanism with real-time push metrics from OpenTelemetry.
- Explains trade-offs when selecting between HPA and KEDA for model autoscaling.
- Describes how to correctly calculate tensor parallel and pipeline parallel size configurations.
- Explains how to manage Hugging Face token secrets for secure model downloads.
- Outlines the full production lifecycle from model storage configuration to performance monitoring.
- Details advanced optimization strategies like KV cache offloading and streaming API usage.
- Highlights techniques to reduce model latency during high concurrency inference scenarios.
- Compares the vLLM backend performance benefits against the standard Hugging Face backend.
- Simplifies the core concepts of InferenceServices for developers new to KServe.
- Analyzes the infrastructure trade offs between serverless and raw deployment modes.
- Provides a developer roadmap for setting up authentication and testing inference endpoints.
- Summarizes architectural best practices for scaling production model serving environments.
- Compares the trade-offs between using Knative serverless versus raw deployment modes.