Global News Prism - Kaltura reduces observability operational costs by 60% with Amazon OpenSearch Service

Kaltura reduces observability operational costs by 60% with Amazon OpenSearch Service

2025-07-03 13:59:58 by Amazon.com

This post is co-written with Ido Ziv from Kaltura. As organizations grow, managing observability across multiple teams and applications becomes increasingly complex. Logs, metrics, and traces generate vast amounts of data, making it challenging to maintain performance, reliability, and cost-efficiency. At Kaltura, an AI-infused video-first company serving millions of users across hundreds of applications, observability is mission-critical. Understanding system behavior at scale isn’t just about troubleshooting—it’s about providing seamless experiences for customers and employees alike. But achieving effective observability at this scale comes with challenges: managing spans; correlating logs, traces, and events across distributed systems; and maintaining visibility without overwhelming teams with noise. Balancing granularity, cost, and actionable insights requires constant tuning and thoughtful architecture. In this post, we share how Kaltura transformed its observability strategy and technological stack by migrating from a software as a service (SaaS) logging solution to Amazon OpenSearch Service—achieving higher log retention, a 60% reduction in cost, and a centralized platform that empowers multiple teams with real-time insights. Observability challenges at scale Kaltura ingests over 8TB of logs and traces daily, processing more than 20 billion events across 6 production AWS Regions and over 200 applications—with log spikes reaching up to 6 GB per second. This immense data volume, combined with a highly distributed architecture, created significant challenges in observability. Historically, Kaltura relied on a SaaS-based observability solution that met initial requirements but became increasingly difficult to scale. As the platform evolved, teams generated disparate log formats, applied retention policies that no longer reflected data value, and operated more than 10 organically grown observability sources. The lack of standardization and visibility required extensive manual effort to correlate data, maintain pipelines, and troubleshoot issues – leading to rising operational complexity and fixed costs that didn’t scale efficiently with usage. Kaltura’s DevOps team recognized the need to reassess their observability solution and began exploring a variety of options, from self-managed platforms to fully managed SaaS offerings. After a comprehensive evaluation, they made the strategic decision to migrate to OpenSearch Service, using its advanced features such as Amazon OpenSearch Ingestion, the Observability plugin, UltraWarm storage, and Index State Management. Solution overview Kaltura created a new AWS account that would be a dedicated observability account, where OpenSearch Service was deployed. Logs and traces were collected from different accounts and producers such as microservices on Amazon Elastic Kubernetes Service (Amazon EKS) and services running on Amazon Elastic Compute Cloud (Amazon EC2). By using AWS services such as AWS Identity and Access Management (IAM), AWS Key Management Service (AWS KMS), and Amazon CloudWatch, Kaltura was able to meet the standards to create a production-grade system while keeping security and reliability in mind. The following figure shows a high-level design of the environment setup. Ingestion As seen in the following diagram, logs are shipped using log shippers, also known as collectors. In Kaltura’s case, they used Fluent Bit. A log shipper is a tool designed to collect, process, and transport log data from various sources to a centralized location, such as log analytics platforms, management systems, or an aggregator system. Fluent Bit was used in all sources and also provided light processing abilities. Fluent Bit was deployed as a daemonset in Kubernetes. The application development teams didn’t change their code, because the Fluent Bit pods were reading the stdout of the application pods. The following code is an example of FluentBit configurations for Amazon EKS: [INPUT] Name tail Path /var/log/containers/*.log Tag kube.* Skip_Long_Lines On multiline.parser docker, cri [FILTER] alias k8s # kubernetes filter to parse all logs Name kubernetes Match kube.* Kube_Tag_Prefix kube.var.log.containers. Annotations On Labels Off Merge_Log On Keep_Log Off Kube_URL https://kubernetes.default.svc.cluster.local:443 [FILTER] alias apps Name rewrite_tag Match kube.* Rule $kubernetes['annotations']['kaltura.com/observability'] ^apps$ [OUTPUT] Name http Match apps.* Alias apps Host xxxxx.us-east-1.osis.amazonaws.com Port 443 URI /log/apps Format json aws_auth true aws_region us-east-1 aws_service osis aws_role_arn arn:aws:iam::xxxxx:role/osis-ingestion-role Log_Level trace tls On Spans and traces were collected directly from the application layer using a seamless integration approach. To facilitate this, Kaltura deployed an OpenTelemetry Collector (OTEL) using the OpenTelemetry Operator for Kubernetes. Additionally, the team developed a custom OTEL code library, which was incorporated into the application code to efficiently capture and log traces and spans, providing comprehensive observability across their system. Data from Fluent Bit and OpenTelemetry Collector was sent to OpenSearch Ingestion, a fully managed, serverless data collector that delivers real-time log, metric, and trace data to OpenSearch Service domains and Amazon OpenSearch Serverless collections. Each producer sent data to a specific pipeline, one for logs and one for traces, where data was transformed, aggregated, enriched, and normalized before being sent to OpenSearch Service. The trace pipeline used the otel_trace and service_map processors, while using the OpenSearch Ingestion OpenTelemetry trace analytics blueprint. The following code is an example of the OpenSearch Ingestion pipeline for logs: version: "2" entry-pipeline: source: http: path: "/log/apps" processor: - add_entries: entries: - key: "log_type" value: "default" - key: "log_type" value: "api" add_when: 'contains(/filename, "api.log")' overwrite_if_key_exists: true - key: "log_type" value: "stats" add_when: 'contains(/filename, "stats.log")' overwrite_if_key_exists: true - key: "log_type" value: "event" add_when: 'contains(/filename, "event.log")' overwrite_if_key_exists: true - key: "log_type" value: "login" add_when: 'contains(/filename, "login.log")' overwrite_if_key_exists: true - grok: grok_when: '/log_type == "api"' match: log: ['^\[%%{DATA:timestamp}] \[%%{DATA:logIp}\] \[%%{DATA:host}\] \[%%{WORD:id}\] %%{WORD:priorityName}$%%{NUMBER:priority}$: \[memory: %%{DATA:memory} MB, real: %%{DATA:real}MB\] %%{GREEDYDATA:message}'] - date: match: - key: timestamp patterns: ["dd-MMM-yyyy HH:mm:ss", "dd/MMM/yyyy:HH:mm:ss Z", "EEE MMM dd HH:mm:ss.SSSSSS yyyy"] destination: "@timestamp" output_format: "yyyy-MM-dd'T'HH:mm:ss" - rename_keys: entries: - from_key: "timestamp" to_key: "@timestamp" overwrite_if_to_key_exists: false - from_key: "date" to_key: "@timestamp" overwrite_if_to_key_exists: false - drop_events: drop_when: 'contains(/filename, "simplesamlphp.log")' sink: - opensearch: hosts: ["${opensearch_host}"] index: '$${/env}-api-$${/log_type}-app-logs' index_type: custom action: create bulk_size: 20 aws: sts_role_arn: ${sts_role_arn} region: ${region} dlq: s3: bucket: "${bucket}" key_path_prefix: 'my-app-dlq-files' region: "${region}" sts_role_arn: "${sts_role_arn}" The preceding example shows the use of processors such as grok, date, add_entries, rename_keys, and drop_events: add_entries: Adds a new field log_type based on filename Default: “default” If the filename contains specific substrings (such as api.log or stats.log), it assigns a more specific type grok: Applies Grok parsing to logs of type “api” Extracts fields like timestamp, logIp, host, priorityName, priority, memory, real, and message using a custom pattern date: Parses timestamp strings into a standard datetime format Stores it in a field called @timestamp based on ISO8601 format Handles multiple timestamp patterns rename_keys: timestamp or date are renamed into @timestamp Does not overwrite if @timestamp already exists drop_events: Drops logs where filename contains simplesamlphp.log This is a filtering rule to ignore noisy or irrelevant logs The following is an example of the input of a log line: "log": "[25-Mar-2025 18:23:18] [127.0.0.1] [the-most-awesome-server-in-kaltura] [67e2f496cc321] INFO(6): [memory: 4.51 MB, real: 6MB] [request: 1] [time: 0.0263s / total: 0.0263s]", After processing, we get the following code: "log_type": "api", "priorityName": "INFO", "memory": "4.51", "host": "the-most-awesome-server-in-kaltura", "real": "6", "priority": "6", "message": "[request: 1] [time: 0.0263s / total: 0.0263s]", "logIp": "127.0.0.1", "id": "67e2f496cc321", "@timestamp": "2025-03-25T18:23:18" Kaltura followed some OpenSearch Ingestion best practices, such as: Including a dead-letter queue (DLQ) in pipeline configuration. This can significantly help troubleshoot pipeline issues. Starting and stopping pipelines to optimize cost-efficiency, when possible. During the proof of concept stage: Installing Data Prepper locally for faster development iterations. Disabling persistent buffering to expedite blue-green deployments. Achieving operational excellence with efficient log and trace management Logs and traces play a vital role in identifying operational issues, but they come with unique challenges. First, they represent time series data, which inherently evolves over time. Second, their value typically diminishes as time passes, making efficient management crucial. Third, they are append-only in nature. With OpenSearch, Kaltura faced distinct trade-offs between cost, data retention, and latency. The goal was to make sure valuable data remained accessible to engineering teams with minimal latency, but the solution also needed to be cost-effective. Balancing these factors required thoughtful planning and optimization. Data was ingested to OpenSearch data streams, which simplifies the process of ingesting append-only time series data. Several Index State Management (ISM) policies were applied to different data streams, which were dependent on log retention requirements. ISM policies handled moving indexes from hot storage to UltraWarm, and eventually deleting the indexes. This allowed a customizable and cost-effective solution, with low latency for querying new data and reasonable latency for querying historical data. The following example ISM policy makes sure indexes are managed efficiently, rolled over, and moved to different storage tiers based on their age and size, and eventually deleted after 60 days. If an action fails, it is retried with an exponential backoff strategy. In case of failures, notifications are sent to relevant teams to keep them informed. { "id": "retention", "policy": { "description": "production ISM", }, "default_state": "hot", "states": [ { "name": "hot", "actions": [ { "retry": { "count": 5, "backoff": "exponential", "delay": "1h" }, "rollover": { "min_primary_shard_size": "30gb", "copy_alias": false } } ], "transitions": [ { "state_name": "warm", "conditions": { "min_index_age": "2d" } } ] }, { "name": "warm", "actions": [ { "retry": { "count": 5, "backoff": "exponential", "delay": "1h" }, "warm_migration": {} } ], "transitions": [ { "state_name": "cold", "conditions": { "min_index_age": "14d" } } ] }, { "name": "cold", "actions": [ { "retry": { "count": 5, "backoff": "exponential", "delay": "1h" }, "cold_migration": { "start_time": null, "end_time": null, "timestamp_field": "@timestamp", "ignore": "none" } } ], "transitions": [ { "state_name": "delete", "conditions": { "min_index_age": "60d" } } ] }, { "name": "delete", "actions": [ { "retry": { "count": 3, "backoff": "exponential", "delay": "1m" }, "cold_delete": {} } ], "transitions": [] } ], "ism_template": [ { "index_patterns": [ "*-logs" ], "priority": 50, } ] } } To create a data stream in OpenSearch, a definition of index template is required, which configures how the data stream and its backing indexes will behave. In the following example, the index template specifies key index settings such as the number of shards, replication, and refresh interval—controlling how data is distributed, replicated, and refreshed across the cluster. It also defines the mappings, which describe the structure of the data—what fields exist, their types, and how they should be indexed. These mappings make sure the data stream knows how to interpret and store incoming log data efficiently. Finally, the template enables the @timestamp field as the time-based field required for a data stream. { "index_patterns": [ "*my-app-logs" ], "template": { "settings": { "index.number_of_shards": "32", "index.number_of_replicas": "0", "index.refresh_interval": "60s" }, "mappings": { "properties": { "priorityName": { "type": "keyword" }, "log_type": { "type": "keyword" }, "@timestamp": { "type": "date" }, "memory": { "type": "float" }, "host": { "type": "keyword" }, "pid": { "type": "keyword" }, "real": { "type": "float" }, "env": { "type": "keyword" }, "message": { "type": "text" }, "priority": { "type": "integer" }, "logIp": { "type": "ip" } } } }, "composed_of": [], "priority": "100", "_meta": { "flow": "simple" }, "data_stream": { "timestamp_field": { "name": "@timestamp" } }, "name": "my-app-logs" } Implementing role-based access control and user access The new observability platform is accessed by many types of users; internal users log in to OpenSearch Dashboards using SAML-based federation with Okta. The following diagram illustrates the user flow. Each user accesses the dashboards to view observability items relevant to their role. Fine-grained access control (FGAC) is enforced in OpenSearch using built-in IAM role and SAML group mappings to implement role-based access control (RBAC).When users log in to the OpenSearch domain, they are automatically routed to the appropriate tenant based on their assigned role. This setup makes sure developers can create dashboards tailored to debugging within development environments, and support teams can build dashboards focused on identifying and troubleshooting production issues. The SAML integration alleviates the need to manage internal OpenSearch users entirely. For each role in Kaltura, a corresponding OpenSearch role was created with only the necessary permissions. For instance, support engineers are granted access to the monitoring plugin to create alerts based on logs, whereas QA engineers, who don’t require this functionality, are not granted that access. The following screenshot shows the role of the DevOps engineers defined with cluster permissions. These users are routed to their own dedicated DevOps tenant, to which they only have write access. This makes it possible for different users from different roles in Kaltura to create the dashboard items that focus on their priorities and needs. OpenSearch supports backend role mapping; Kaltura mapped the Okta group to the role so when a user logs in from Okta, they automatically get assigned based on their role. This also works with IAM roles to facilitate automations in the cluster using external services, such as OpenSearch Ingestion pipelines, as can be seen in the following screenshot. Using observability features and service mapping for enhanced trace and log correlation After a user is logged in, they can use the Observability plugins, view surrounding events in logs, correlate logs and traces, and use the Trace Analytics plugin. Users can inspect traces and spans, and group traces with latency information using built-in dashboards. Users can also drill down to a specific trace or span and correlate it back to log events. The service_map processor used in OpenSearch Ingestion sends OpenTelemetry data to create a distributed service map for visualization in OpenSearch Dashboards. Using the combined signals of traces and spans, OpenSearch discovers the application connectivity and maps them to a service map. After OpenSearch ingests the traces and spans from Otel, they are aggregated to groups according to paths and trends. Durations are also calculated and presented to the user over time. With a trace ID, it’s possible to filter out all the relevant spans by the service and see how long each took, identifying issues with external services such as MongoDB and Redis. From the spans, users can discover the relevant logs. Post-migration enhancements After the migration, a strong developer community emerged within Kaltura that embraced the new observability solution. As adoption grew, so did requests for new features and enhancements aimed at improving the overall developer experience. One key improvement was extending log retention. Kaltura achieved this by re-ingesting historical logs from Amazon Simple Storage Service (Amazon S3) using a dedicated OpenSearch Ingestion pipeline with Amazon S3 read permissions. With this enhancement, teams can access and analyze logs from up to a year ago using the same familiar dashboards and filters. In addition to monitoring EKS clusters and EC2 instances, Kaltura expanded its observability stack by integrating more AWS services. Amazon API Gateway and AWS Lambda were introduced to support log ingestion from external vendors, allowing for seamless correlation with existing data and broader visibility across systems. Finally, to empower teams and promote autonomy, data stream templates and ISM policies are managed directly by developers within their own repositories. By using infrastructure as code tools like Terraform, developers can define index mappings, alerts, and dashboards as code—versioned in Git and deployed consistently across environments. Conclusion Kaltura successfully implemented a smart log retention strategy, extending real time retention from 5 days for all log types to 30 days for critical logs, while maintaining cost-efficiency through the use of UltraWarm nodes. This approach led to a 60% reduction in costs compared to their previous solution. Additionally, Kaltura consolidated their observability platform, streamlining operations by merging 10 separate systems into a unified, all-in-one solution. This consolidation not only improved operational efficiency but also sparked increased engagement from developer teams, driving feature requests, fostering internal design collaborations, and attracting early adopters for new enhancements. If Kaltura’s journey has inspired you and you’re thinking about implementing a similar solution in your organization, consider these steps: Start by understanding the requirements and setting expectations with the engineering teams in your organization Start with a quick proof of concept to get hands-on experience Refer to the following resources to help you get started: Observability overview in OpenSearch Microservice Observability with Amazon OpenSearch Service Workshop Key concepts for Amazon OpenSearch Ingestion Amazon OpenSearch Service Migrations About the authors Ido Ziv is a DevOps team leader in Kaltura with over 6 years of experience. His hobbies include sailing and Kubernetes (but not at the same time). Roi Gamliel is a Senior Solutions Architect helping startups build on AWS. He is passionate about the OpenSearch Project, helping customers fine-tune their workloads and maximize results. Yonatan Dolan is a Principal Analytics Specialist at Amazon Web Services. He is located in Israel and helps customers harness AWS analytical services to use data, gain insights, and derive value.

Kaltura reduces observability operational costs by 60% with Amazon OpenSearch Service

Never forget.