Monitoring and Observability Resources for Engineers

DZone's Featured Monitoring and Observability Resources

Observations on Cloud-Native Observability: A Journey From the Foundations of Observability to Surviving Its Challenges at Scale

By Eric D. Schabell

CORE

Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC. Cloud native and observability are an integral part of developer lives. Understanding their responsibilities within observability at scale helps developers tackle the challenges they are facing on a daily basis. There is more to observability than just collecting and storing data, and developers are essential to surviving these challenges. Observability Foundations Gone are the days of monitoring a known application environment, debugging services within our development tooling, and waiting for new resources to deploy our code to. This has become dynamic, agile, and quickly available with auto-scaling infrastructure in the final production deployment environments. Developers are now striving to observe everything they are creating, from development to production, often owning their code for the entire lifecycle. The tooling from days of old, such as Nagios and HP OpenView, can't keep up with constantly changing cloud environments that contain thousands of microservices. The infrastructure for cloud-native deployments is designed to dynamically scale as needed, making it even more essential for observability platforms to help condense all that data noise to detect trends leading to downtime before they happen. Splintering of Responsibilities in Observability Cloud-native complexity not only changed the developer world but also impacted how organizations are structured. The responsibilities of creating, deploying, and managing cloud-native infrastructure have split into a series of new organizational teams. Developers are being tasked with more than just code creation and are expected to adopt more hybrid roles within some of these new teams. Observability teams have been created to focus on a specific aspect of the cloud-native ecosystem to provide their organization a service within the cloud infrastructure. In Table 1, we can see the splintering of traditional roles in organizations into these teams with specific focuses. Table 1. Who's who in the observability game Team Focus maturity goals DevOps Automation and optimization of the app development lifecycle, including post-launch fixes and updates Early stages: developer productivity Platform engineering Designing and building toolchains and workflows that enable self-service capabilities for developers Early stages: developer maturity and productivity boost CloudOps Provides organizations proper (cloud) resource management, using DevOps principles and IT operations applied to cloud-based architectures to speed up business processes Later stages: cloud resource management, costs, and business agility SRE All-purpose role aiming to manage reliability for any type of environment; a full-time job avoiding downtime and optimizing performance of all apps and supporting infrastructure, regardless of whether it's cloud native Early to late stages: on-call engineers trying to reduce downtime Central observability team Responsible for defining observability standards and practices, delivering key data to engineering teams, and managing tooling and observability data storage Later stages, owning: Define monitoring standards and practices Deliver monitoring data to engineering teams Measure reliability and stability of monitoring solutions Manage tooling and storage of metrics data To understand how these teams work together, imagine a large, mature, cloud native organization that has all the teams featured in Table 1: The DevOps team is the first line for standardizing how code is created, managed, tested, updated, and deployed. They work with toolchains and workflow provided by the platform engineering team. DevOps advises on new tooling and/or workflows, creating continuous improvements to both. A CloudOps team focuses on cloud resource management and getting the most out of the budgets spent on the cloud by the other teams. An SRE team is on call to manage reliability, avoiding downtime for all supporting infrastructure in the organization. They provide feedback for all the teams to improve tools, processes, and platforms. The overarching central observability team sets the observability standards for all teams to adhere to, delivering the right observability data to the right teams and managing tooling and data storage. Why Observability Is Important to Cloud Native Today, cloud native usage has seen such growth that developers are overwhelmed by their vast responsibilities that go beyond just coding. The complexity introduced by cloud-native environments means that observability is becoming essential to solving many of the challenges developers are facing. Challenges Increasing cloud native complexity means that developers are providing more code faster and passing more rigorous testing to ensure that their applications work at cloud native scale. These challenges expanded the need for observability within what was traditionally the developers' coding environment. Not only do they need to provide code and testing infrastructure for their applications, they are also required to instrument that code so that business metrics can be monitored. Over time, developers learned that fully automating metrics was overkill, with much of that data being unnecessary. This led developers to fine tune their instrumentation methods and turn to manual instrumentation, where only the metrics they needed were collected. Another challenge arises when decisions are made to integrate existing application landscapes with new observability practices in an organization. The time developers spend manually instrumenting existing applications so that they provide the needed data to an observability platform is an often overlooked burden. New observability tools designed to help with metrics, logs, and traces are introduced to the development teams — leading to more challenges for developers. Often, these tools are mastered by few, leading to siloed knowledge, which results in organizations paying premium prices for advanced observability tools only to have them used as if one is engaging in observability as a toy. Finally, when exploring the ingested data from our cloud infrastructure, the first thing that becomes obvious is that we don't need to keep everything that is being ingested. We need the ability to have control over our telemetry data and find out what is unused by our observability teams. There are some questions we need to answer about how we can: Identify ingested data not used in dashboards, alerting rules, nor touched in ad hoc queries by our observability teams Control telemetry data with aggregation and rules before we put it into expensive, longer-term storage Use only telemetry data needed to support the monitoring of our application landscape Tackling the flood of cloud data in such a way as to filter out the unused telemetry data, keeping only that which is applied for our observability needs, is crucial to making this data valuable to the organization. Cloud Native at Scale The use of cloud-native infrastructure brings with it a lot of flexibility, but when done at scale, the small complexities can become overwhelming. This is due to the premise of cloud native where we describe how our infrastructure should be set up, how our applications and microservices should be deployed, and finally, how it automatically scales when needed. This approach reduces our control over how our production infrastructure reacts to surges in customer usage of an organization's services. Empowering Developers Empowering developers starts with platform engineering teams that focus on developer experiences. We create developer experiences in our organization that treat observability as a priority, dedicating resources for creating a telemetry strategy from day one. In this culture, we're setting up development teams for success with cloud infrastructure, using observability alongside testing, continuous integration, and continuous deployment. Developers are not only owning the code they deliver but are now encouraged and empowered to create, test, and own the telemetry data from their applications and microservices. This is a brave new world where they are the owners of their work, providing agility and consensus within the various teams working on cloud solutions. Rising to the challenges of observability in a cloud native world is a success metric for any organization, and they can't afford to get it wrong. Observability needs to be front of mind with developers, considered a first-class citizen in their daily workflows, and consistently helping them with challenges they face. Artificial Intelligence and Observability Artificial intelligence (AI) has risen in popularity within not only developer tooling but also in the observability domain. The application of AI in observability falls within one of two use cases: Monitoring machine learning (ML) solutions or large language model (LLM) systems Embedding AI into observability tooling itself as an assistant The first case is when you want to monitor specific AI workloads, such as ML or LLMs. They can be further split into two situations that you might want to monitor, the training platform and the production platform. Training infrastructure and the process involved can be approached just like any other workload: easy-to-achieve monitoring using instrumentation and existing methods, such as observing specific traces through a solution. This is not the complete monitoring process that goes with these solutions, but out-of-the-box observability solutions are quite capable of supporting infrastructure and application monitoring of these workloads. The second case is when AI assistants, such as chatbots, are included in the observability tooling that developers are exposed to. This is often in the form of a code assistant, such as one that helps fine tune a dashboard or query our time series data ad hoc. While these are nice to have, organizations are very mindful of developer usage when inputting queries that include proprietary or sensitive data. It's important to understand that training these tools might include using proprietary data in their training sets, or even the data developers input, to further train the agents for future query assistance. Predicting the future of AI-assisted observability is not going to be easy as organizations consider their data one of their top valued assets and will continue to protect its usage outside of their control to help improve tooling. To that end, one direction that might help adoption is to have agents trained only on in-house data, but that means the training data is smaller than publicly available agents. Cloud-Native Observability: The Developer Survival Pattern While we spend a lot of time on tooling as developers, we all understand that tooling is not always the fix for the complex problems we face. Observability is no different, and while developers are often exposed to the mantra of metrics, logs, and traces for solving their observability challenges, this is not the path to follow without considering the big picture. The amount of data generated in cloud-native environments, especially at scale, makes it impossible to continue collecting all data. This flood of data, the challenges that arise, and the inability to sift through the information to find the root causes of issues becomes detrimental to the success of development teams. It would be more helpful if developers were supported with just the right amount of data, in just the right forms, and at the right time to solve issues. One does not mind observability if the solution to problems are found quickly, situations are remediated faster, and developers are satisfied with the results. If this is done with one log line, two spans from a trace, and three metric labels, then that's all we want to see. To do this, developers need to know when issues arise with their applications or services, preferably before it happens. They start troubleshooting with data that has been determined by their instrumented applications to succinctly point to areas within the offending application. Any tooling allows the developer who's investigating to see dashboards reporting visual information that directs them to the problem and potential moment it started. It is crucial for developers to be able to remediate the problem, maybe by rolling back a code change or deployment, so the application can continue to support customer interactions. Figure 1 illustrates the path taken by cloud native developers when solving observability problems. The last step for any developer is to determine how issues encountered can be prevented going forward. Figure 1. Observability pattern Conclusion Observability is essential for organizations to succeed in a cloud native world. The splintering of responsibilities in observability, along with the challenges that cloud-native environments bring at scale, cannot be ignored. Understanding the challenges that developers face in cloud native organizations is crucial to achieving observability happiness. Empowering developers, providing ways to tackle observability challenges, and understanding how the future of observability might look are the keys to handling observability in modern cloud environments. DZone Refcard resources: Full-Stack Observability Essentials by Joana Carvalho Getting Started With OpenTelemetry by Joana Carvalho Getting Started With Prometheus by Colin Domoney Getting Started With Log Management by John Vester Monitoring and the ELK Stack by John Vester This is an excerpt from DZone's 2024 Trend Report,Cloud Native: Championing Cloud Development Across the SDLC.Read the Free Report More

Implementing SLAs, SLOs, and SLIs: A Practical Guide for SREs

By Karthigayan Devan

In today’s Information Technology (IT) digital transformation world, many applications are getting hosted in cloud environments every day. Monitoring and maintaining these applications daily is very challenging and we need proper metrics in place to measure and take action. This is where the importance of implementing SLAs, SLOs, and SLIs comes into the picture and it helps in effective monitoring and maintaining the system performance. Defining SLA, SLO, SLI, and SRE What Is an SLA? (Commitment) A Service Level Agreement is an agreement that exists between the cloud provider and client/user about measurable metrics; for example, uptime check, etc. This is normally handled by the company's legal department as per business and legal terms. It includes all the factors to be considered as part of the agreement and the consequences if it fails; for example, credits, penalties, etc. It is mostly applicable for paid services and not for free services. What Is an SLO? (Objective) A Service Level Objective is an objective the cloud provider must meet to satisfy the agreement made with the client. It is used to mention specific individual metric expectations that cloud providers must meet to satisfy a client’s expectation (i.e., availability, etc). This will help clients to improve overall service quality and reliability. What Is an SLI? (How Did We Do?) A Service Level Indicator measures compliance with an SLO and actual measurement of SLI. It gives a quantified view of the service's performance (i.e., 99.92% of latency, etc.). Who Is an SRE? A Site Reliability Engineer is an engineer who always thinks about minimizing gaps between software development and operations. This term is slightly related to DevOps, which focuses on identifying the gaps. An SRE creates and uses automation tools to monitor and observe software reliability in production environments. In this article, we will discuss the importance of SLOs/SLIs/SLAs and how to implement them into production applications by a Site Reliability Engineer (SRE). Implementation of SLOs and SLIs Let’s assume we have an application service that is up and running in a production environment. The first step is to determine what an SLO should be and what it should cover. Example of SLOs SLO = Target Above this target, GOOD Below this target, BAD: Needs an action item While setting up a Target, please do not consider it 100% reliable. It is practically not possible and it fails most of the items due to patches, deployments, downtime, etc. This is where Error Budget (EB) comes into the picture. EB is the maximum amount of time that a service can fail without contractual consequences. For example: SLA = 99.99% uptime EB = 55 mins and 35 secs per year, or 4 mins and 23 secs per month, the system can go down without consequences. A step is how to measure this SLO, and it is where SLI comes into the picture, which is an indicator of the level of service that you are providing. Example of SLIs HTTP reqs = No. of success/total requests Common SLI Metrics Durability Response time Latency Availability Error rate Throughput Leverage automation of deployment monitoring and reporting tools to check SLIs and detect deviations from SLOs in real-time (i.e., Prometheus, Grafana, etc.). Category SLO SLI Availability 99.92% uptime/month X % of the time app is available Latency 92% of reqs with response time under 240 ms X average resp time for user reqs Error rate Less than 0.8% of requests result in errors X % of reqs that fail Challenges SLA: Normally, SLAs are written by business or legal teams with no input from technical teams, which results in missing key aspects to measure. SLO: Not able to measure or too broad to calculate SLI: There are too many metrics and differences in capturing and calculating the measures. It leads to lots of effort for the SREs and gives less beneficial results. Best Practices SLA: Involve the technical team when SLAs are written by the company's business/legal team and the provider. This will help to reflect exact tech scenarios into the agreement. SLO: This should be simple, and easily measurable to check, whether we are in line with objectives or not. SLI: Define all standard metrics to monitor and measure. It will help SREs to check the reliability and performance of the services. Conclusion Implementation of SLAs, SLOs, and SLIs should be included as part of the system requirements and design and it should be in continuous improvement mode. SREs need to understand and take responsibility for how the systems serve the business needs and take necessary measures to minimize the impact. More

Deep Work for Site Reliability Engineers

By Krishna Vinnakota

Parsing Structured Environment Variables in Rust

By Nicolas Fränkel

CORE

Beyond Sessions: Centering Users in Mobile App Observability

By Hanson Ho

Telemetry Pipelines Workshop: Avoiding Telemetry Data Loss With Fluent Bit

Are you ready to get started with cloud-native observability with telemetry pipelines? This article is part of a series exploring a workshop guiding you through the open source project Fluent Bit, what it is, a basic installation, and setting up the first telemetry pipeline project. Learn how to manage your cloud-native data from source to destination using the telemetry pipeline phases covering collection, aggregation, transformation, and forwarding from any source to any destination. In the previous article in this series, we explored what backpressure was, how it manifests in telemetry pipelines, and took the first steps to mitigate this with Fluent Bit. In this article, we look at how to enable Fluent Bit features that will help with avoiding telemetry data loss as we saw in the previous article. You can find more details in the accompanying workshop lab. Before we get started it's important to review the phases of a telemetry pipeline. In the diagram below we see them laid out again. Each incoming event goes from input to parser to filter to buffer to routing before they are sent to its final output destination(s). For clarity in this article, we'll split up the configuration into files that are imported into a main fluent bit configuration file we'll name workshop-fb.conf. Tackling Data Loss Previously, we explored how input plugins can hit their ingestion limits when our telemetry pipelines scale beyond memory limits when using default in-memory buffering of our events. We also saw that we can limit the size of our input plugin buffers to prevent our pipeline from failing on out-of-memory errors, but that the pausing of the ingestion can also lead to data loss if the clearing of the input buffers takes too long. To rectify this problem, we'll explore another buffering solution that Fluent Bit offers, ensuring data and memory safety at scale by configuring filesystem buffering. To that end, let's explore how the Fluent Bit engine processes data that input plugins emit. When an input plugin emits events, the engine groups them into a Chunk. The chunk size is around 2MB. The default is for the engine to place this Chunk only in memory. We saw that limiting in-memory buffer size did not solve the problem, so we are looking at modifying this default behavior of only placing chunks into memory. This is done by changing the property storage.type from the default Memory to Filesystem. It's important to understand that memory and filesystem buffering mechanisms are not mutually exclusive. By enabling filesystem buffering for our input plugin we automatically get performance and data safety Filesystem Buffering Tips When changing our buffering from memory to filesystem with the property storage.type filesystem, the settings for mem_buf_limit are ignored. Instead, we need to use the property storage.max_chunks_up for controlling the size of our memory buffer. Shockingly, when using the default settings the property storage.pause_on_chunks_overlimit is set to off, causing the input plugins not to pause. Instead, input plugins will switch to buffering only in the filesystem. We can control the amount of disk space used with storage.total_limit_size. If the property storage.pause_on_chunks_overlimit is set to on, then the buffering mechanism to the filesystem behaves just like our mem_buf_limit scenario demonstrated previously. Configuring Stressed Telemetry Pipeline In this example, we are going to use the same stressed Fluent Bit pipeline to simulate a need for enabling filesystem buffering. All examples are going to be shown using containers (Podman) and it's assumed you are familiar with container tooling such as Podman or Docker. We begin the configuration of our telemetry pipeline in the INPUT phase with a simple dummy plugin generating a large number of entries to flood our pipeline with as follows in our configuration file inputs.conf (note that the mem_buf_limit fix is commented out): # This entry generates a large amount of success messages for the workshop. [INPUT] Name dummy Tag big.data Copies 15000 Dummy {"message":"true 200 success", "big_data": "blah blah blah blah blah blah blah blah blah"} #Mem_Buf_Limit 2MB Now ensure the output configuration file outputs.conf has the following configuration: # This entry directs all tags (it matches any we encounter) # to print to standard output, which is our console. [OUTPUT] Name stdout Match * With our inputs and outputs configured, we can now bring them together in a single main configuration file. Using a file called workshop-fb.conf in our favorite editor, ensure the following configuration is created. For now, just import two files: # Fluent Bit main configuration file. # # Imports section. @INCLUDE inputs.conf @INCLUDE outputs.conf Let's now try testing our configuration by running it using a container image. The first thing that is needed is to ensure a file called Buildfile is created. This is going to be used to build a new container image and insert our configuration files. Note this file needs to be in the same directory as our configuration files, otherwise adjust the file path names: FROM cr.fluentbit.io/fluent/fluent-bit:3.0.4 COPY ./workshop-fb.conf /fluent-bit/etc/fluent-bit.conf COPY ./inputs.conf /fluent-bit/etc/inputs.conf COPY ./outputs.conf /fluent-bit/etc/outputs.conf Now we'll build a new container image, naming it with a version tag as follows using the Buildfile and assuming you are in the same directory: $ podman build -t workshop-fb:v8 -f Buildfile STEP 1/4: FROM cr.fluentbit.io/fluent/fluent-bit:3.0.4 STEP 2/4: COPY ./workshop-fb.conf /fluent-bit/etc/fluent-bit.conf --> a379e7611210 STEP 3/4: COPY ./inputs.conf /fluent-bit/etc/inputs.conf --> f39b10d3d6d0 STEP 4/4: COPY ./outputs.conf /fluent-bit/etc/outputs.conf COMMIT workshop-fb:v6 --> e74b2f228729 Successfully tagged localhost/workshop-fb:v8 e74b2f22872958a79c0e056efce66a811c93f43da641a2efaa30cacceb94a195 If we run our pipeline in a container configured with constricted memory, in our case, we need to give it around a 6.5MB limit, then we'll see the pipeline run for a bit and then fail due to overloading (OOM): $ podman run --memory 6.5MB --name fbv8 workshop-fb:v8 The console output shows that the pipeline ran for a bit; in our case, below to event number 862 before it hit the OOM limits of our container environment (6.5MB): ... [860] big.data: [[1716551898.202389716, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}] [861] big.data: [[1716551898.202389925, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}] [862] big.data: [[1716551898.202390133, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah blah"}] [863] big.data: [[1 <<<< CONTAINER KILLED WITH OOM HERE We can validate that the stressed telemetry pipeline actually failed on an OOM status by viewing our container, and inspecting it for an OOM failure to validate our backpressure worked: # Use the container name to inspect for reason it failed $ podman inspect fbv8 | grep OOM "OOMKilled": true, Already having tried in a previous lab to manage this with mem_buf_limit settings, we've seen that this also is not the real fix. To prevent data loss we need to enable filesystem buffering so that overloading the memory buffer means that events will be buffered in the filesystem until there is memory free to process them. Using Filesystem Buffering The configuration of our telemetry pipeline in the INPUT phase needs a slight adjustment by adding storage.type to as shown, set to filesystem to enable it. Note that mem_buf_limit has been removed: # This entry generates a large amount of success messages for the workshop. [INPUT] Name dummy Tag big.data Copies 15000 Dummy {"message":"true 200 success", "big_data": "blah blah blah blah blah blah blah blah blah"} storage.type filesystem We can now bring it all together in the main configuration file. Using a file called the following workshop-fb.conf in our favorite editor, update the file to include SERVICE configuration is added with settings for managing the filesystem buffering: # Fluent Bit main configuration file. [SERVICE] flush 1 log_Level info storage.path /tmp/fluentbit-storage storage.sync normal storage.checksum off storage.max_chunks_up 5 # Imports section @INCLUDE inputs.conf @INCLUDE outputs.conf A few words on the SERVICE section properties might be needed to explain their function: storage.path - Putting filesystem buffering in the tmp filesystem storage.sync- Using normal and turning off checksum processing storage.max_chunks_up - Set to ~10MB, amount of allowed memory for events Now it's time for testing our configuration by running it using a container image. The first thing that is needed is to ensure a file called Buildfile is created. This is going to be used to build a new container image and insert our configuration files. Note this file needs to be in the same directory as our configuration files, otherwise adjust the file path names: FROM cr.fluentbit.io/fluent/fluent-bit:3.0.4 COPY ./workshop-fb.conf /fluent-bit/etc/fluent-bit.conf COPY ./inputs.conf /fluent-bit/etc/inputs.conf COPY ./outputs.conf /fluent-bit/etc/outputs.conf Now we'll build a new container image, naming it with a version tag, as follows using the Buildfile and assuming you are in the same directory: $ podman build -t workshop-fb:v9 -f Buildfile STEP 1/4: FROM cr.fluentbit.io/fluent/fluent-bit:3.0.4 STEP 2/4: COPY ./workshop-fb.conf /fluent-bit/etc/fluent-bit.conf --> a379e7611210 STEP 3/4: COPY ./inputs.conf /fluent-bit/etc/inputs.conf --> f39b10d3d6d0 STEP 4/4: COPY ./outputs.conf /fluent-bit/etc/outputs.conf COMMIT workshop-fb:v6 --> e74b2f228729 Successfully tagged localhost/workshop-fb:v9 e74b2f22872958a79c0e056efce66a811c93f43da641a2efaa30cacceb94a195 If we run our pipeline in a container configured with constricted memory (slightly larger value due to memory needed for mounting the filesystem) - in our case, we need to give it around a 9MB limit - then we'll see the pipeline running without failure: $ podman run -v ./:/tmp --memory 9MB --name fbv9 workshop-fb:v9 The console output shows that the pipeline runs until we stop it with CTRL-C, with events rolling by as shown below. ... [14991] big.data: [[1716559655.213181639, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah"}] [14992] big.data: [[1716559655.213182181, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah"}] [14993] big.data: [[1716559655.213182681, {}], {"message"=>"true 200 success", "big_data"=>"blah blah blah blah blah blah blah"}] ... We can now validate the filesystem buffering by looking at the filesystem storage. Check the filesystem from the directory where you started your container. While the pipeline is running with memory restrictions, it will be using the filesystem to store events until the memory is free to process them. If you view the contents of the file before stopping your pipeline, you'll see a messy message format stored inside (cleaned up for you here): $ ls -l ./fluentbit-storage/dummy.0/1-1716558042.211576161.flb -rw------- 1 username groupname 1.4M May 24 15:40 1-1716558042.211576161.flb $ cat fluentbit-storage/dummy.0/1-1716558042.211576161.flb ??wbig.data???fP?? ?????message?true 200 success?big_data?'blah blah blah blah blah blah blah blah???fP?? ?p???message?true 200 success?big_data?'blah blah blah blah blah blah blah blah???fP?? ߲???message?true 200 success?big_data?'blah blah blah blah blah blah blah blah???fP?? ?F???message?true 200 success?big_data?'blah blah blah blah blah blah blah blah???fP?? ?d???message?true 200 success?big_data?'blah blah blah blah blah blah blah blah???fP?? ... Last Thoughts on Filesystem Buffering This solution is the way to deal with backpressure and other issues that might flood your telemetry pipeline and cause it to crash. It's worth noting that using a filesystem to buffer the events also introduces the limits of the filesystem being used. It's important to understand that just as memory can run out, so too can the filesystem storage reach its limits. It's best to have a plan to address any possible filesystem challenges when using this solution, but this is outside the scope of this article. This completes our use cases for this article. Be sure to explore this hands-on experience with the accompanying workshop lab. What's Next? This article walked us through how Fluent Bit filesystem buffering provides a data- and memory-safe solution to the problems of backpressure and data loss. Stay tuned for more hands-on material to help you with your cloud-native observability journey.

By Eric D. Schabell

CORE

Revisiting Observability: A Deep Dive Into the State of Monitoring, Costs, and Data Ownership in Today’s Market

Hey internet humans! I’ve recently re-entered the world of observability and monitoring realm after a short detour in the Internal Developer Portal space. Since my return, I have felt a strong urge to discuss the general sad state of observability in the market today. I still have a strong memory of myself, knee-deep in Kubernetes configs, drowning in a sea of technical jargon, not clearly knowing if I’ve actually monitored everything in my stack, deploying heavy agents, and fighting with engineering managers and devs just to get their code instrumented only to find out I don’t have half the stuff I thought I did. Sound familiar? Most of us have been there. It's like trying to find your way out of a maze with a blindfold on while someone repeatedly spins you around and gives you the wrong directions to the way out. Not exactly a walk in the park, right? The three pain points that are top-of-mind for me these days are: The state of instrumentation for observability The horrible surprise bills vendors are springing on customers and the insanely confusing pricing models that can’t even be calculated Ownership and storage of data - data residency issues, compliance, and control Instrumentation The monitoring community has got a fantastic new tool at its disposal: eBPF. Ever heard of it? It's a game-changing tech (a cheat code, if you will, to get around that horrible manual instrumentation) that allows us to trace what's going on in our systems without all the usual headaches. No complex setups, no intrusive instrumentation – just clear, detailed insights into our app's performance. With eBPF, we can dive deep into the inner workings of applications and infrastructure, capturing data at the kernel level with minimal overhead. It's like having X-ray vision for our software stack without the pain of having to corral all of the engineers to instrument the code manually. I’ve had first-hand experience in deploying monitoring solutions at scale during my tenure at companies like Datadog, Splunk, and, before microservices were cool, CA Technologies. I’ve seen the patchwork of APM, infrastructure, logs, OpenTelemetry, custom instrumentation, open-source, etc. that is often patched together (usually poorly) to just try and get at the basics. Each one of these comes usually at a high technical maintenance cost and requires SREs, platform engineers, developers, DevOps, etc. to all coordinate (also usually ineffectively) to instrument code, deploy everywhere they’re aware of, and cross their fingers just hoping they’re going to get most of what should be monitored. At this point, there are two things that happen: Not everything is monitored because we have no idea where everything is. We end up with far less than 100% coverage. We start having those cringe-worthy discussions on “should we monitor this thing” due to the sheer cost of monitoring, often costing more than the infrastructure our applications and microservices are running on. Let’s be clear: this isn’t a conversation we should be having. Indeed, OpenTelemetry is fantastic for a number of things: It solves vendor lock-in and has a much larger community working on it, but I must be brutally honest here: it takes A LOT OF WORK. It takes real collaboration between all of the teams to make sure everyone is instrumenting manually and that every single library we use is well supported, assuming we can properly validate that the legacy code we’re trying to instrument has what you think it does in it. From my observations, this generally results in an incomplete patchwork of things giving us a very incomplete picture 95% of the time. Circling back to eBPF technology: With proper deployment and some secret sauce, these are two core concerns we simply don’t have to worry about as long as there’s a simplified pricing model in place. We can get full-on 360-degree visibility in our environments with tracing, metrics, and logs without the hassle and without wondering if we can really afford to see everything. The Elephant in the Room: Cost and the Awful State of Pricing in the Observability Market Today If I’d have a penny for every time I’ve heard the saying, “I need an observability tool to monitor the cost of my observability tool." Traditional monitoring tools often come with a hefty price tag attached, and often one that’s a big fat surprise when we add a metric or a log line… especially when it’s at scale! It's not just about the initial investment – it's the unexpected overage bills that really sting. You see, these tools typically charge based on the volume of data ingested, and it's easy to underestimate just how quickly those costs can add up. We’ve all been there before - monitoring a Kubernetes cluster with hundreds of pods, each generating logs, traces, and metrics. Before we know it, we're facing a mountain of data and a surprise sky-high bill to match. Or perhaps we’ve decided we need a new facet to that metric and we got an unexpected massive charge for metric cardinality. Or maybe a dev decides it’s a great idea to add that additional log line to our high-volume application and our log bill grows exponentially overnight. It's a tough pill to swallow, especially when we're trying to balance the need for comprehensive and complete monitoring with budget constraints. I’ve seen customers receive multiple tens of thousands of dollars (sometimes multiple hundreds of thousands) in “overage” bills because some developer added a few extra log lines or because someone needed some additional cardinality in a metric. Those costs are very real for those very simple mistakes (when often there are no controls in place to keep them from happening). From my personal experience: I wish you the best of luck in trying to negotiate those bills down. You’re stuck now, as these companies have no interest in customers paying less when they get hit with those bills. As a customer-facing architect, I’ve had customers see red, and boy, that sucks. The ethics behind surprise pricing is dubious at best. That's when a modern solution should step in to save the day. By flipping the script on traditional pricing models, offering transparent pricing that's based on usage, not volume, ingest, egress, or some unknown metric that you have no idea how to calculate, we should be able to get specific about the cost of monitoring and set clear expectations knowing we can see everything end-to-end without sacrificing because the cost may be too high. With eBPF and a bit of secret sauce, we'll never have to worry about surprise overage charges again. We can know exactly what we are paying for upfront, giving us peace of mind and control over our monitoring costs. It's not just about cost – it's about value. We don’t just want a monitoring tool; we want a partner in our quest for observability. We want a team and community that is dedicated to helping us get the most out of our monitoring setup, providing guidance and support every step of the way. It must change from the impersonal, transactional approach of legacy vendors. Ownership and Storage of Data The next topic I'd like to touch upon is the importance of data residency, compliance, and security in the realm of observability solutions. In today's business landscape, maintaining control over where and how data is stored and accessed is crucial. Various regulations, such as GDPR (General Data Protection Regulation), require organizations to adhere to strict guidelines regarding data storage and privacy. Traditional cloud-based observability solutions may present challenges in meeting these compliance requirements, as they often store data on third-party servers dispersed across different regions. I’ve seen this happen and I’ve seen customers take extraordinary steps to avoid going to the cloud while employing massive teams of in-house developers just to keep their data within their walls. Opting for an observability solution that allows for on-premises data storage addresses these concerns effectively. By keeping monitoring data within the organization's data center, businesses gain greater control over its security and compliance. This approach minimizes the risk of unauthorized access or data breaches, thereby enhancing data security and simplifying compliance efforts. Additionally, it aligns with data residency requirements and regulations, providing assurance to stakeholders regarding data sovereignty and privacy. Moreover, choosing an observability solution with on-premises data storage can yield significant cost savings in the long term. By leveraging existing infrastructure and eliminating the need for costly cloud storage and data transfer fees, organizations can optimize their operational expenses. Transparent pricing models further enhance cost efficiency by providing clarity and predictability, ensuring that organizations can budget effectively without encountering unexpected expenses. On the other hand, relying on a Software-as-a-Service (SaaS) based observability provider can introduce complexities, security risks, and issues. With SaaS solutions, organizations relinquish control over data storage and management, placing sensitive information in the hands of third-party vendors. This increases the potential for security breaches and data privacy violations, especially when dealing with regulations like GDPR. Additionally, dependence on external service providers can lead to vendor lock-in, making it challenging to migrate data or switch providers in the future. Moreover, fluctuations in pricing and service disruptions can disrupt operations and strain budgets, further complicating the observability landscape for organizations. For organizations seeking to ensure compliance, enhance data security, and optimize costs, an observability solution that facilitates on-premises data storage offers a compelling solution. By maintaining control over data residency and security while achieving cost efficiencies, businesses can focus on their core competencies and revenue-generating activities with confidence.

By Mike Moore

Reliability Models and Metrics for Test Engineering

Tech teams do their best to develop amazing software products. They spent countless hours coding, testing, and refining every little detail. However, even the most carefully crafted systems may encounter issues along the way. That's where reliability models and metrics come into play. They help us identify potential weak spots, anticipate failures, and build better products. The reliability of a system is a multidimensional concept that encompasses various aspects, including, but not limited to: Availability: The system is available and accessible to users whenever needed, without excessive downtime or interruptions. It includes considerations for system uptime, fault tolerance, and recovery mechanisms. Performance: The system should function within acceptable speed and resource usage parameters. It scales efficiently to meet growing demands (increasing loads, users, or data volumes). This ensures a smooth user experience and responsiveness to user actions. Stability: The software system operates consistently over time and maintains its performance levels without degradation or instability. It avoids unexpected crashes, freezes, or unpredictable behavior. Robustness: The system can gracefully handle unexpected inputs, invalid user interactions, and adverse conditions without crashing or compromising its functionality. It exhibits resilience to errors and exceptions. Recoverability: The system can recover from failures, errors, or disruptions and restore normal operation with minimal data loss or impact on users. It includes mechanisms for data backup, recovery, and rollback. Maintainability: The system should be easy to understand, modify, and fix when necessary. This allows for efficient bug fixes, updates, and future enhancements. This article starts by analyzing mean time metrics. Basic probability distribution models for reliability are then highlighted with their pros and cons. A distinction between software and hardware failure models follows. Finally, reliability growth models are explored including a list of factors for how to choose the right model. Mean Time Metrics Some of the most commonly tracked metrics in the industry are MTTA (mean time to acknowledge), MTBF (mean time before failure), MTTR (mean time to recovery, repair, respond or resolve), and MTTF (mean time to failure). They help tech teams understand how often incidents occur and how quickly the team bounces back from those incidents. The acronym MTTR can be misleading. When discussing MTTR, it might seem like a singular metric with a clear definition. However, it actually encompasses four distinct measurements. The 'R' in MTTR can signify repair, recovery, response, or resolution. While these four metrics share similarities, each carries its own significance and subtleties. Mean Time To Repair: This focuses on the time it takes to fix a failed component. Mean Time To Recovery: This considers the time to restore full functionality after a failure. Mean Time To Respond: This emphasizes the initial response time to acknowledge and investigate an incident. Mean Time To Resolve: This encompasses the entire incident resolution process, including diagnosis, repair, and recovery. While these metrics overlap, they provide a distinct perspective on how quickly a team resolves incidents. MTTA, or Mean Time To Acknowledge, measures how quickly your team reacts to alerts by tracking the average time from alert trigger to initial investigation. It helps assess both team responsiveness and alert system effectiveness. MTBF or Mean Time Between Failures, represents the average time a repairable system operates between unscheduled failures. It considers both the operating time and the repair time. MTBF helps estimate how often a system is likely to experience a failure and require repair. It's valuable for planning maintenance schedules, resource allocation, and predicting system uptime. For a system that cannot or should not be repaired, MTTF, or Mean Time To Failure, represents the average time that the system operates before experiencing its first failure. Unlike MTBF, it doesn't consider repair times. MTTF is used to estimate the lifespan of products that are not designed to be repaired after failing. This makes MTTF particularly relevant for components or systems where repair is either impossible or not economically viable. It's useful for comparing the reliability of different systems or components and informing design decisions for improved longevity. An analogy to illustrate the difference between MTBF and MTTF could be a fleet of delivery vans. MTBF: This would represent the average time between breakdowns for each van, considering both the driving time and the repair time it takes to get the van back on the road. MTTF: This would represent the average lifespan of each van before it experiences its first breakdown, regardless of whether it's repairable or not. Key Differentiators Feature MTBF MTTF Repairable System Yes No Repair Time Considered in the calculation Not considered in the calculation Failure Focus Time between subsequent failures Time to the first failure Application Planning maintenance, resource allocation Assessing inherent system reliability The Bigger Picture MTTR, MTTA, MTTF, and MTBF can also be used all together to provide a comprehensive picture of your team's effectiveness and areas for improvement. Mean time to recovery indicates how quickly you get systems operational again. Incorporating mean time to respond allows you to differentiate between team response time and alert system efficiency. Adding mean time to repair further breaks down how much time is spent on repairs versus troubleshooting. Mean time to resolve incorporates the entire incident lifecycle, encompassing the impact beyond downtime. But the story doesn't end there. Mean time between failures reveals your team's success in preventing or reducing future issues. Finally, incorporating mean time to failure provides insights into the overall lifespan and inherent reliability of your product or system. Probability Distributions for Reliability The following probability distributions are commonly used in reliability engineering to model the time until the failure of systems or components. They are often employed in reliability analysis to characterize the failure behavior of systems over time. Exponential Distribution Model This model assumes a constant failure rate over time. This means that the probability of a component failing is independent of its age or how long it has been operating. Applications: This model is suitable for analyzing components with random failures, such as memory chips, transistors, or hard drives. It's particularly useful in the early stages of a product's life cycle when failure data might be limited. Limitations: The constant failure rate assumption might not always hold true. As hardware components age, they might become more susceptible to failures (wear-out failures), which the Exponential Distribution Model wouldn't capture. Weibull Distribution Model This model offers more flexibility by allowing dynamic failure rates. It can model situations where the probability of failure increases over time at an early stage (infant mortality failures) or at a later stage (wear-out failures). Infant mortality failures: This could represent new components with manufacturing defects that are more likely to fail early on. Wear-out failures: This could represent components like mechanical parts that degrade with use and become more likely to fail as they age. Applications: The Weibull Distribution Model is more versatile than the Exponential Distribution Model. It's a good choice for analyzing a wider range of hardware components with varying failure patterns. Limitations: The Weibull Distribution Model requires more data to determine the shape parameter that defines the failure rate behavior (increasing, decreasing, or constant). Additionally, it might be too complex for situations where a simpler model like the Exponential Distribution would suffice. The Software vs Hardware Distinction The nature of software failures is different from that of hardware failures. Although both software and hardware may experience deterministic as well as random failures, their failures have different root causes, different failure patterns, and different prediction, prevention, and repair mechanisms. Depending on the level of interdependence between software and hardware and how it affects our systems, it may be beneficial to consider the following factors: 1. Root Cause of Failures Hardware: Hardware failures are physical in nature, caused by degradation of components, manufacturing defects, or environmental factors. These failures are often random and unpredictable. Consequently, hardware reliability models focus on physical failure mechanisms like fatigue, corrosion, and material defects. Software: Software failures usually stem from logical errors, code defects, or unforeseen interactions with the environment. These failures may be systematic and can be traced back to specific lines of code or design flaws. Consequently, software reliability models do not account for physical degradation over time. 2. Failure Patterns Hardware: Hardware failures often exhibit time-dependent behavior. Components might be more susceptible to failures early in their lifespan (infant mortality) or later as they wear out. Software: The behavior of software failures in time can be very tricky and usually depends on the evolution of our code, among others. A bug in the code will remain a bug until it's fixed, regardless of how long the software has been running. 3. Failure Prediction, Prevention, Repairs Hardware: Hardware reliability models that use MTBF often focus on predicting average times between failures and planning preventive maintenance schedules. Such models analyze historical failure data from identical components. Repairs often involve the physical replacement of components. Software: Software reliability models like Musa-Okumoto and Jelinski-Moranda focus on predicting the number of remaining defects based on testing data. These models consider code complexity and defect discovery rates to guide testing efforts and identify areas with potential bugs. Repair usually involves debugging and patching, not physical replacement. 4. Interdependence and Interaction Failures The level of interdependence between software and hardware varies for different systems, domains, and applications. Tight coupling between software and hardware may cause interaction failures. There can be software failures due to hardware and vice-versa. Here's a table summarizing the key differences: Feature Hardware Reliability Models Software Reliability Models Root Cause of Failures Physical Degradation, Defects, Environmental Factors Code Defects, Design Flaws, External Dependencies Failure Patterns Time-Dependent (Infant Mortality, Wear-Out) Non-Time Dependent (Bugs Remain Until Fixed) Prediction Focus Average Times Between Failures (MTBF, MTTF) Number of Remaining Defects Prevention Strategies Preventive Maintenance Schedules Code Review, Testing, Bug Fixes By understanding the distinct characteristics of hardware and software failures, we may be able to leverage tailored reliability models, whenever necessary, to gain in-depth knowledge of our system's behavior. This way we can implement targeted strategies for prevention and mitigation in order to build more reliable systems. Code Complexity Code complexity assesses how difficult a codebase is to understand and maintain. Higher complexity often correlates with an increased likelihood of hidden bugs. By measuring code complexity, developers can prioritize testing efforts and focus on areas with potentially higher defect density. The following tools can automate the analysis of code structure and identify potential issues like code duplication, long functions, and high cyclomatic complexity: SonarQube: A comprehensive platform offering code quality analysis, including code complexity metrics Fortify: Provides static code analysis for security vulnerabilities and code complexity CppDepend (for C++): Analyzes code dependencies and metrics for C++ codebases PMD: An open-source tool for identifying common coding flaws and complexity metrics Defect Density Defect density illuminates the prevalence of bugs within our code. It's calculated as the number of defects discovered per unit of code, typically lines of code (LOC). A lower defect density signifies a more robust and reliable software product. Reliability Growth Models Reliability growth models help development teams estimate the testing effort required to achieve desired reliability levels and ensure a smooth launch of their software. These models predict software reliability improvements as testing progresses, offering insights into the effectiveness of testing strategies and guiding resource allocation. They are mathematical models used to predict and improve the reliability of systems over time by analyzing historical data on defects or failures and their removal. Some models exhibit characteristics of exponential growth. Other models exhibit characteristics of power law growth while there exist models that exhibit both exponential and power law growth. The distinction is primarily based on the underlying assumptions about how the fault detection rate changes over time in relation to the number of remaining faults. While a detailed analysis of reliability growth models is beyond the scope of this article, I will provide a categorization that may help for further study. Traditional growth models encompass the commonly used and foundational models, while the Bayesian approach represents a distinct methodology. The advanced growth models encompass more complex models that incorporate additional factors or assumptions. Please note that the list is indicative and not exhaustive. Traditional Growth Models Musa-Okumoto Model It assumes a logarithmic Poisson process for fault detection and removal, where the number of failures observed over time follows a logarithmic function of the number of initial faults. Jelinski-Moranda Model It assumes a constant failure intensity over time and is based on the concept of error seeding. It postulates that software failures occur at a rate proportional to the number of remaining faults in the system. Goel-Okumoto Model It incorporates the assumption that the fault detection rate decreases exponentially as faults are detected and fixed. It also assumes a non-homogeneous Poisson process for fault detection. Non-Homogeneous Poisson Process (NHPP) Models They assume the fault detection rate is time-dependent and follows a non-homogeneous Poisson process. These models allow for more flexibility in capturing variations in the fault detection rate over time. Bayesian Approach Wall and Ferguson Model It combines historical data with expert judgment to update reliability estimates over time. This model considers the impact of both defect discovery and defect correction efforts on reliability growth. Advanced Growth Models Duane Model This model assumes that the cumulative MTBF of a system increases as a power-law function of the cumulative test time. This is known as the Duane postulate and it reflects how quickly the reliability of the system is improving as testing and debugging occur. Coutinho Model Based on the Duane model, it extends to the idea of an instantaneous failure rate. This rate involves the number of defects found and the number of corrective actions made during testing time. This model provides a more dynamic representation of reliability growth. Gooitzen Model It incorporates the concept of imperfect debugging, where not all faults are detected and fixed during testing. This model provides a more realistic representation of the fault detection and removal process by accounting for imperfect debugging. Littlewood Model It acknowledges that as system failures are discovered during testing, the underlying faults causing these failures are repaired. Consequently, the reliability of the system should improve over time. This model also considers the possibility of negative reliability growth when a software repair introduces further errors. Rayleigh Model The Rayleigh probability distribution is a special case of the Weibull distribution. This model considers changes in defect rates over time, especially during the development phase. It provides an estimation of the number of defects that will occur in the future based on the observed data. Choosing the Right Model There's no single "best" reliability growth model. The ideal choice depends on the specific project characteristics and available data. Here are some factors to consider. Specific objectives: Determine the specific objectives and goals of reliability growth analysis. Whether the goal is to optimize testing strategies, allocate resources effectively, or improve overall system reliability, choose a model that aligns with the desired outcomes. Nature of the system: Understand the characteristics of the system being analyzed, including its complexity, components, and failure mechanisms. Certain models may be better suited for specific types of systems, such as software, hardware, or complex systems with multiple subsystems. Development stage: Consider the stage of development the system is in. Early-stage development may benefit from simpler models that provide basic insights, while later stages may require more sophisticated models to capture complex reliability growth behaviors. Available data: Assess the availability and quality of data on past failures, fault detection, and removal. Models that require extensive historical data may not be suitable if data is limited or unreliable. Complexity tolerance: Evaluate the complexity tolerance of the stakeholders involved. Some models may require advanced statistical knowledge or computational resources, which may not be feasible or practical for all stakeholders. Assumptions and limitations: Understand the underlying assumptions and limitations of each reliability growth model. Choose a model whose assumptions align with the characteristics of the system and the available data. Predictive capability: Assess the predictive capability of the model in accurately forecasting future reliability levels based on past data. Flexibility and adaptability: Consider the flexibility and adaptability of the model to different growth patterns and scenarios. Models that can accommodate variations in fault detection rates, growth behaviors, and system complexities are more versatile and applicable in diverse contexts. Resource requirements: Evaluate the resource requirements associated with implementing and using the model, including computational resources, time, and expertise. Choose a model that aligns with the available resources and capabilities of the organization. Validation and verification: Verify the validity and reliability of the model through validation against empirical data or comparison with other established models. Models that have been validated and verified against real-world data are more trustworthy and reliable. Regulatory requirements: Consider any regulatory requirements or industry standards that may influence the choice of reliability growth model. Certain industries may have specific guidelines or recommendations for reliability analysis that need to be adhered to. Stakeholder input: Seek input and feedback from relevant stakeholders, including engineers, managers, and domain experts, to ensure that the chosen model meets the needs and expectations of all parties involved. Wrapping Up Throughout this article, we explored a plethora of reliability models and metrics. From the simple elegance of MTTR to the nuanced insights of NHPP models, each instrument offers a unique perspective on system health. The key takeaway? There's no single "rockstar" metric or model that guarantees system reliability. Instead, we should carefully select and combine the right tools for the specific system at hand. By understanding the strengths and limitations of various models and metrics, and aligning them with your system's characteristics, you can create a comprehensive reliability assessment plan. This tailored approach may allow us to identify potential weaknesses and prioritize improvement efforts.

By Stelios Manioudakis, PhD

CORE

How You Can Use Logs To Feed Security

If your system is facing an imminent security threat—or worse, you’ve just suffered a breach—then logs are your go-to. If you’re a security engineer working closely with developers and the DevOps team, you already know that you depend on logs for threat investigation and incident response. Logs offer a detailed account of system activities. Analyzing those logs helps you fortify your digital defenses against emerging risks before they escalate into full-blown incidents. At the same time, your logs are your digital footprints, vital for compliance and auditing. Your logs contain a massive amount of data about your systems (and hence your security), and that leads to some serious questions: How do you handle the complexity of standardizing and analyzing such large volumes of data? How do you get the most out of your log data so that you can strengthen your security? How do you know what to log? How much is too much? Recently, I’ve been trying to use tools and services to get a handle on my logs. In this post, I’ll look at some best practices for using these tools—how they can help with security and identifying threats. And finally, I’ll look at how artificial intelligence may play a role in your log analysis. How To Identify Security Threats Through Logs Logs are essential for the early identification of security threats. Here’s how: Identifying and Mitigating Threats Logs are a gold mine of streaming, real-time analytics, and crucial information that your team can use to its advantage. With dashboards, visualizations, metrics, and alerts set up to monitor your logs you can effectively identify and mitigate threats. In practice, I’ve used both Sumo Logic and the ELK stack (a combination of Elasticsearch, Kibana, Beats, and Logstash). These tools can help your security practice by allowing you to: Establish a baseline of behavior and quickly identify anomalies in service or application behavior. Look for things like unusual access times, spikes in data access, or logins from unexpected areas of the world. Monitor access to your systems for unexpected connections. Watch for frequent and unusual access to critical resources. Watch for unusual outbound traffic that might signal data exfiltration. Watch for specific types of attacks, such as SQL injection or DDoS. For example, I monitor how rate-limiting deals with a burst of requests from the same device or IP using Sumo Logic’s Cloud Infrastructure Security. Watch for changes to highly critical files. Is someone tampering with config files? Create and monitor audit trails of user activity. This forensic information can help you to trace what happened with suspicious—or malicious—activities. Closely monitor authentication/authorization logs for frequent failed attempts. Cross-reference logs to watch for complex, cross-system attacks, such as supply chain attacks or man-in-the-middle (MiTM) attacks. Using a Sumo Logic dashboard of logs, metrics, and traces to track down security threats It’s also best practice to set up alerts to see issues early, giving you the lead time needed to deal with any threat. The best tools are also infrastructure agnostic and can be run on any number of hosting environments. Insights for Future Security Measures Logs help you with more than just looking into the past to figure out what happened. They also help you prepare for the future. Insights from log data can help your team craft its security strategies for the future. Benchmark your logs against your industry to help identify gaps that may cause issues in the future. Hunt through your logs for signs of subtle IOCs (indicators of compromise). Identify rules and behaviors that you can use against your logs to respond in real-time to any new threats. Use predictive modeling to anticipate future attack vectors based on current trends. Detect outliers in your datasets to surface suspicious activities What to Log. . . And How Much to Log So we know we need to use logs to identify threats both present and future. But to be the most effective, what should we log? The short answer is—everything! You want to capture everything you can, all the time. When you’re first getting started, it may be tempting to try to triage logs, guessing as to what is important to keep and what isn’t. But logging all events as they happen and putting them in the right repository for analysis later is often your best bet. In terms of log data, more is almost always better. But of course, this presents challenges. Who’s Going To Pay for All These Logs? When you retain all those logs, it can be very expensive. And it’s stressful to think about how much money it will cost to store all of this data when you just throw it in an S3 bucket for review later. For example, on AWS a daily log data ingest of 100GB/day with the ELK stack could create an annual cost of hundreds of thousands of dollars. This often leads to developers “self-selecting” what they think is — and isn’t — important to log. Your first option is to be smart and proactive in managing your logs. This can work for tools such as the ELK stack, as long as you follow some basic rules: Prioritize logs by classification: Figure out which logs are the most important, classify them as such, and then be more verbose with those logs. Rotate logs: Figure out how long you typically need logs and then rotate them off servers. You probably only need debug logs for a matter of weeks, but access logs for much longer. Log sampling: Only log a sampling of high-volume services. For example, log just a percentage of access requests but log all error messages. Filter logs: Pre-process all logs to remove unnecessary information, condensing their size before storing them. Alert-based logging: Configure alerts based on triggers or events that subsequently turn logging on or make your logging more verbose. Use tier-based storage: Store more recent logs on faster, more expensive storage. Move older logs to cheaper, slow storage. For example, you can archive old logs to Amazon S3. These are great steps, but unfortunately, they can involve a lot of work and a lot of guesswork. You often don’t know what you need from the logs until after the fact. A second option is to use a tool or service that offers flat-rate pricing; for example, Sumo Logic’s $0 ingest. With this type of service, you can stream all of your logs without worrying about overwhelming ingest costs. Instead of a per-GB-ingested type of billing, this plan bills based on the valuable analytics and insights you derive from that data. You can log everything and pay just for what you need to get out of your logs. In other words, you are free to log it all! Looking Forward: The Role of AI in Automating Log Analysis The right tool or service, of course, can help you make sense of all this data. And the best of these tools work pretty well. The obvious new tool to help you make sense of all this data is AI. With data that is formatted predictably, we can apply classification algorithms and other machine-learning techniques to find out exactly what we want to know about our application. AI can: Automate repetitive tasks like data cleaning and pre-processing Perform automated anomaly detection to alert on abnormal behaviors Automatically identify issues and anomalies faster and more consistently by learning from historical log data Identify complex patterns quickly Use large amounts of historical data to more accurately predict future security breaches Reduce alert fatigue by reducing false positives and false negatives Use natural language processing (NLP) to parse and understand logs Quickly integrate and parse logs from multiple, disparate systems for a more holistic view of potential attack vectors AI probably isn’t coming for your job, but it will probably make your job a whole lot easier. Conclusion Log data is one of the most valuable and available means to ensure your applications’ security and operations. It can help guard against both current and future attacks. And for log data to be of the most use, you should log as much information as you can. The last problem you want during a security crisis is to find out you didn’t log the information you need.

By Alvin Lee

CORE

Operation and Network Administration Management of Telecom 5G Network Functions Using Openshift Kubernetes Tools

The world of Telecom is evolving at a rapid pace, and it is not just important, but crucial for operators to stay ahead of the game. As 5G technology becomes the norm, it is not just essential, but a strategic imperative to transition seamlessly from 4G technology (which operates on OpenStack cloud) to 5G technology (which uses Kubernetes). In the current scenario, operators invest in multiple vendor-specific monitoring tools, leading to higher costs and less efficient operations. However, with the upcoming 5G world, operators can adopt a unified monitoring and alert system for all their products. This single system, with its ability to monitor network equipment, customer devices, and service platforms, offers a reassuringly holistic view of the entire system, thereby reducing complexity and enhancing efficiency. By adopting a Prometheus-based monitoring and alert system, operators can streamline operations, reduce costs, and enhance customer experience. With a single monitoring system, operators can monitor their entire 5G system seamlessly, ensuring optimal performance and avoiding disruptions. This practical solution eliminates the need for a complete overhaul and offers a cost-effective transition. Let's dive deep. Prometheus, Grafana, and Alert Manager Prometheus is a tool for monitoring and alerting systems, utilizing a pull-based monitoring system. It scrapes, collects, and stores Key Performance Indicators (KPI) with labels and timestamps, enabling it to collect metrics from targets, which are the Network Functions' namespaces in the 5G telecom world. Grafana is a dynamic web application that offers a wide range of functionalities. It visualizes data, allowing the building of charts, graphs, and dashboards that the 5G Telecom operator wants to visualize. Its primary feature is the display of multiple graphing and dashboarding support modes using GUI (Graphical user interface). Grafana can seamlessly integrate data collected by Prometheus, making it an indispensable tool for telecom operators. It is a powerful web application that supports the integration of different data sources into one dashboard, enabling continuous monitoring. This versatility improves response rates by alerting the telecom operator's team when an incident emerges, ensuring a minimum 5G network function downtime. The Alert Manager is a crucial component that manages alerts from the Prometheus server via alerting rules. It manages the received alerts, including silencing and inhibiting them and sending out notifications via email or chat. The Alert Manager also removes duplications, grouping, and routing them to the centralized webhook receiver, making it a must-have tool for any telecom operator. Architectural Diagram Prometheus Components of Prometheus (Specific to a 5G Telecom Operator) Core component: Prometheus server scrapes HTTP endpoints and stores data (time series). The Prometheus server, a crucial component in the 5G telecom world, collects metrics from the Prometheus targets. In our context, these targets are the Kubernetes cluster that houses the 5G network functions. Time series database (TSDB): Prometheus stores telecom Metrics as time series data. HTTP Server: API to query data stored in TSDB; The Grafana dashboard can query this data for visualization. Telecom operator-specific libraries (5G) for instrumenting application code. Push gateway (scrape target for short-lived jobs) Service Discovery: In the world of 5G, network function pods are constantly being added or deleted by Telecom operators to scale up or down. Prometheus's adaptable service discovery component monitors the ever-changing list of pods. The Prometheus Web UI, accessible through port 9090, is a data visualization tool. It allows users to view and analyze Prometheus data in a user-friendly and interactive manner, enhancing the monitoring capabilities of the 5G telecom operators. The Alert Manager, a key component of Prometheus, is responsible for handling alerts. It is designed to notify users if something goes wrong, triggering notifications when certain conditions are met. When alerting triggers are met, Prometheus alerts the Alert Manager, which sends alerts through various channels such as email or messenger, ensuring timely and effective communication of critical issues. Grafana for dashboard visualization (actual graphs) With Prometheus's robust components, your Telecom operator's 5G network functions are monitored with diligence, ensuring reliable resource utilization, tracking performance, detection of errors in availability, and more. Prometheus can provide you with the necessary tools to keep your network running smoothly and efficiently. Prometheus Features The multi-dimensional data model identified by metric details uses PromQL (Prometheus Querying Language) as the query language and the HTTP Pull model. Telecom operators can now discover 5G network functions with service discovery and static configuration. The multiple modes of dashboard and GUI support provide a comprehensive and customizable experience for users. Prometheus Remote Write to Central Prometheus from Network Functions 5G Operators will have multiple network functions from various vendors, such as SMF (Session Management Function), UPF (User Plane Function), AMF (Access and Mobility Management Function), PCF (Policy Control Function), and UDM (Unified Data Management). Using multiple Prometheus/Grafana dashboards for each network function can lead to a complex and inefficient 5G network operator monitoring process. To address this, it is highly recommended that all data/metrics from individual Prometheus be consolidated into a single Central Prometheus, simplifying the monitoring process and enhancing efficiency. The 5G network operator can now confidently monitor all the data at the Central Prometheus's centralized location. This user-friendly interface provides a comprehensive view of the network's performance, empowering the operator with the necessary tools for efficient monitoring. Grafana Grafana Features Panels: This powerful feature empowers operators to visualize Telecom 5G data in many ways, including histograms, graphs, maps, and KPIs. It offers a versatile and adaptable interface for data representation, enhancing the efficiency and effectiveness of your data analysis. Plugins: This feature efficiently renders Telecom 5G data in real-time on a user-friendly API (Application Programming Interface), ensuring operators always have the most accurate and up-to-date data at their fingertips. It also enables operators to create data source plugins and retrieve metrics from any API. Transformations: This feature allows you to flexibly adapt, summarize, combine, and perform KPI metrics query/calculations across 5G network functions data sources, providing the tools to effectively manipulate and analyze your data. Annotations: Rich events from different Telecom 5G network functions data sources are used to annotate metrics-based graphs. Panel editor: Reliable and consistent graphical user interface for configuring and customizing 5G telecom metrics panels Grafana Sample Dashboard GUI for 5G Alert Manager Alert Manager Components The Ingester swiftly ingests all alerts, while the Grouper groups them into categories. The De-duplicator prevents repetitive alerts, ensuring you're not bombarded with notifications. The Silencer is there to mute alerts based on a label, and the Throttler regulates the frequency of alerts. Finally, the Notifier will ensure that third parties are notified promptly. Alert Manager Functionalities Grouping: Grouping categorizes similar alerts into a single notification system. This is helpful during more extensive outages when many 5G network functions fail simultaneously and when all the alerts need to fire simultaneously. The telecom operator will expect only to get a single page while still being able to visualize the exact service instances affected. Inhibition: Inhibition suppresses the notification for specific low-priority alerts if certain major/critical alerts are already firing. For example, when a critical alert fires, indicating that an entire 5G SMF (Session Management Function) cluster is not reachable, AlertManager can mute all other minor/warning alerts concerning this cluster. Silences: Silences are simply mute alerts for a given time. Incoming alerts are checked to match the regular expression matches of an active silence. If they match, no notifications will be sent out for that alert. High availability: Telecom operators will not load balance traffic between Prometheus and all its Alert Managers; instead, they will point Prometheus to a list of all Alert Managers. Dashboard Visualization Grafana dashboard visualizes the Alert Manager webhook traffic notifications as shown below: Configuration YAMLs (Yet Another Markup Language) Telecom Operators can install and run Prometheus using the configuration below: YAML prometheus: enabled: true route: enabled: {} nameOverride: Prometheus tls: enabled: true certificatesSecret: backstage-prometheus-certs certFilename: tls.crt certKeyFilename: tls.key volumePermissions: enabled: true initdbScriptsSecret: backstage-prometheus-initdb prometheusSpec: retention: 3d replicas: 2 prometheusExternalLabelName: prometheus_cluster image: repository: <5G operator image repository for Prometheus> tag: <Version example v2.39.1> sha: "" podAntiAffinity: "hard" securityContext: null resources: limits: cpu: 1 memory: 2Gi requests: cpu: 500m memory: 1Gi serviceMonitorNamespaceSelector: matchExpressions: - {key: namespace, operator: In, values: [<Network function 1 namespace>, <Network function 2 namespace>]} serviceMonitorSelectorNilUsesHelmValues: false podMonitorSelectorNilUsesHelmValues: false ruleSelectorNilUsesHelmValues: false Configuration to route scrape data segregated based on the namespace and route to Central Prometheus. Note: The below configuration can be appended to the Prometheus mentioned in the above installation YAML. YAML remoteWrite: - url: <Central Prometheus URL for namespace 1 by 5G operator> basicAuth: username: name: <secret username for namespace 1> key: username password: name: <secret password for namespace 1> key: password tlsConfig: insecureSkipVerify: true writeRelabelConfigs: - sourceLabels: - namespace regex: <namespace 1> action: keep - url: <Central Prometheus URL for namespace 2 by 5G operator> basicAuth: username: name: <secret username for namespace 2> key: username password: name: <secret password for namespace 2> key: password tlsConfig: insecureSkipVerify: true writeRelabelConfigs: - sourceLabels: - namespace regex: <namespace 2> action: keep Telecom Operators can install and run Grafana using the configuration below. YAML grafana: replicas: 2 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: "app.kubernetes.io/name" operator: In values: - Grafana topologyKey: "kubernetes.io/hostname" securityContext: false rbac: pspEnabled: false # Must be disabled due to tenant permissions namespaced: true adminPassword: admin image: repository: <artifactory>/Grafana tag: <version> sha: "" pullPolicy: IfNotPresent persistence: enabled: false initChownData: enabled: false sidecar: image: repository: <artifactory>/k8s-sidecar tag: <version> sha: "" imagePullPolicy: IfNotPresent resources: limits: cpu: 100m memory: 100Mi requests: cpu: 50m memory: 50Mi dashboards: enabled: true label: grafana_dashboard labelValue: "Vendor name" datasources: enabled: true defaultDatasourceEnabled: false additionalDataSources: - name: Prometheus type: Prometheus url: http://<prometheus-operated>:9090 access: proxy isDefault: true jsonData: timeInterval: 30s resources: limits: cpu: 400m memory: 512Mi requests: cpu: 50m memory: 206Mi extraContainers: - name: oauth-proxy image: <artifactory>/origin-oauth-proxy:<version> imagePullPolicy: IfNotPresent ports: - name: proxy-web containerPort: 4181 args: - --https-address=:4181 - --provider=openshift # Service account name here must be "<Helm Release name>-grafana" - --openshift-service-account=monitoring-grafana - --upstream=http://localhost:3000 - --tls-cert=/etc/tls/private/tls.crt - --tls-key=/etc/tls/private/tls.key - --cookie-secret=SECRET - --pass-basic-auth=false resources: limits: cpu: 100m memory: 256Mi requests: cpu: 50m memory: 128Mi volumeMounts: - mountPath: /etc/tls/private name: grafana-tls extraContainerVolumes: - name: grafana-tls secret: secretName: grafana-tls serviceAccount: annotations: "serviceaccounts.openshift.io/oauth-redirecturi.first": https://[SPK exposed IP for Grafana] service: targetPort: 4181 annotations: service.alpha.openshift.io/serving-cert-secret-name: <secret> Telecom Operators can install and run Alert Manager using the configuration below. YAML alertmanager: enabled: true alertmanagerSpec: image: repository: prometheus/alertmanager tag: <version> replicas: 2 podAntiAffinity: hard securityContext: null resources: requests: cpu: 25m memory: 200Mi limits: cpu: 100m memory: 400Mi containers: - name: config-reloader resources: requests: cpu: 10m memory: 10Mi limits: cpu: 25m memory: 50Mi Configuration to route Prometheus Alert Manager data to the Operator's centralized webhook receiver. Note: The below configuration can be appended to the Alert Manager mentioned in the above installation YAML. YAML config: global: resolve_timeout: 5m route: group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'null' routes: - receiver: '<Network function 1>' group_wait: 10s group_interval: 10s group_by: ['alertname','oid','action','time','geid','ip'] matchers: - namespace="<namespace 1>" - receiver: '<Network function 2>' group_wait: 10s group_interval: 10s group_by: ['alertname','oid','action','time','geid','ip'] matchers: - namespace="<namespace 2>" Conclusion The open-source OAM (Operation and Maintenance) tools Prometheus, Grafana, and Alert Manager can benefit 5G Telecom operators. Prometheus periodically captures all the status of monitored 5G Telecom network functions through the HTTP protocol, and any component can be connected to the monitoring as long as the 5G Telecom operator provides the corresponding HTTP interface. Prometheus and Grafana Agent gives the 5G Telecom operator control over the metrics the operator wants to report; once the data is in Grafana, it can be stored in a Grafana database as extra data redundancy. In conclusion, Prometheus allows 5G Telecom operators to improve their operations and offer better customer service. Adopting a unified monitoring and alert system like Prometheus is one way to achieve this.

By BINU SUDHAKARAN PILLAI

OpenTelemetry's Impact on Observability: Insights From Grafana Labs' Juraci Paixão Kröhling

OpenTelemetry has taken cloud-native by storm since its introduction in 2019. As one of the CNCF’s most active projects, “OTel” is providing much-needed standardization for telemetry data. By giving logs, traces, metrics, and profiling standard protocols and consistent attributes—OTel is reshaping the observability domain by making telemetry data a vastly more open, interoperable, and consistent experience for platform teams and developers. As OTel approaches graduation from the CNCF, DZone spoke with Juraci Paixão Kröhling — Principal Engineer at Grafana Labs and OpenTelemetry Governing Board Member — to learn more about the exciting progress of this standard, and the implications of the “own your own telemetry data” movement it is driving in cloud native. Q: What’s the general problem that OTel sought to solve when it was originally released, and why do you think this is a “right place, right time” type of situation for the project? How would you explain its popularity? A: The origin of OpenTelemetry really began with the success of two different open-source projects that were on a collision course. OpenTracing was trying to make distributed tracing more popular by standardizing instrumentation APIs and semantic conventions, as well as best practices for distributed tracing, while OpenCensus was tackling a similar opportunity by having a “batteries-included” solution for distributed tracing and metrics instrumentation, without specifically aiming to be a standard. While the projects mostly complemented each other, there were parts that were competing, causing fragmentation in the wider distributed tracing (and observability) communities. OpenTelemetry joined these two different OpenTracing and OpenCensus libraries, bridged the fragmentation of their communities, so there was a clear winner, and then gave a recipe for common semantic conventions and a common mental model not just for tracing, but for all telemetry types. I think the reason why OTel has become so popular, so fast, is that distributed systems have such a deep requirement for open standards and interoperability. The industry’s been racing forward in cloud-native with a heavy investment in Kubernetes and microservices—basically, massive aggregations of infrastructure and polyglot applications—and platform and development teams need consistent signals they can trust for observing these systems in real-time. OTel isn’t just giving a common and standardized schema for how to work with telemetry data—it’s also been a force multiplier in all the databases, frameworks, and programming language libraries conforming to a standard approach to telemetry data. Before OTel, observability vendors were creating monetization strategies around proprietary instrumentation and data formats, and creating mazes that made it difficult for enterprises to switch to other providers, and OTel has opened that all up and made observability vendors compete instead on the strength of their platforms, while the underlying telemetry is agnostic. Q: As telemetry data today is spanning not just infrastructure, but applications, what are you seeing in the evolution of the signals that platform teams and engineers are working with, and why is polyglot support so important to OTel? A: Things have really gotten much finer-grained. In the past, we would look at a specific application and we’d measure things like how long a specific HTTP call was taking--very coarse-grained metrics like that. Today we can go much deeper into the business transaction touching several microservices and understand what is happening. Why is it taking that long? Is it a specific algorithm? Is it a downstream service? We have the ability to not only know that an HTTP call is slow, but why. As a natural extension of that telemetry data evolution to finer-grained, OTel libraries can be integrated with many of the most popular libraries and frameworks, so that we can get deeper instrumentation data to see the details of the run-time of our applications. We are also seeing OTel being added natively as part of programming languages and frameworks. This is really interesting to watch in terms of its evolution because the instrumentation can more intelligently appreciate the primitives of the languages themselves, and the expected performance time across language-specific conventions. When we think about languages like Python for AI, or Java for concurrency—each language has its own native capabilities, and so this standardization on OTel is pushing a lot more intelligence not only into how infrastructure and applications can be observed side-by-side but also deeper drills into how applications written in specific languages are behaving. Q: Given all the activity around the project, can you summarize where the most active contributions have been in recent years and the main areas the community is evolving its capabilities? A: The Collector is our biggest SIG (Special Interest Group) at the moment, but we have many contributions as well around our semantic conventions and specification repositories. SIGs related to popular programming languages, like the Java SIG, are also very active. We are seeing continued progress both on new fronts, like the new Profiling signal, as well as on stabilization of our current projects, like the Collector or specific semantic conventions. I’m also happy to see movement on important topics for our future like establishing a new working group focused on standards for metrics and semantic conventions for environmental sustainability. We have also a growing End-User community, where our users share their experiences with other community members, including the maintainers of the code they use. If you use OTel, you are invited not only to join our Monthly Discussion Groups, but also to regularly take our surveys, and, why not, start contributing to the project: it’s likely that the SIG producing the code you are running can use your help. Q: What has been the disruptive impact of OTel on the observability vendor ecosystem? A: In the past, we’d SSH into a server to get to the origin of a problem. Those days are long gone. Today, hundreds of pods are running a distributed service, and it would be infeasible to log into all those services. So with distributed computing, we started to collect and ship telemetry data to central locations. But the way that was done before OTel, that data wasn’t aware of which machines that data was coming from, and there wasn’t much cross-coordination between the telemetry data types, or even between the same telemetry (logs, for instance) across programming languages or frameworks. Sometimes, we’d record the URI of a request as “request.uri”, but sometimes it would be “URL”. OTel came in with a very clear way to name and label telemetry. It also provides its own transport protocol to be optionally used, so all signals can be transmitted using the same basic mechanism to different backends. Now the specification makes it possible to tie the layers together, hop between infrastructure observability to application observability, and draw correlations that were very difficult before. Q: What do you see as the big new frontiers for OTel, beyond where it’s already thriving today? What’s around the corner? A: We made progress in many areas and are stabilizing others. While we have new “core” features being proposed and developed within the OTel community, I believe that what we have right now will enable us and other communities to go wider, expanding on domains that might not necessarily be our main focus. For instance, we are seeing new communities forming around CI/CD, environmental sustainability, cost tracking, and LLM, among others. Stabilization also opens the door for a much-needed time for reflection. What would we do differently with the knowledge we have today? The newly formed Entities specification SIG comes to mind in that context. Similarly, I can’t wait to see what’s next after we have a Collector v1. Our profiling story is also just at the beginning: we have the specification for OTLP Profiles, and while we know that we need to integrate that with our current projects (SDKs, Collector, …), I’m eager to see what the community will come up with next. What else can we do now that we have the ability to do a deeper correlation between profiles and the other signals? While we have Android and Swift SIGs already, I believe we’ll see more movement around mobile observability in the future as well. I hear quite frequently from developers working at retailers and FinTechs that while their backend is observable nowadays, their mobile applications still need some OTel love, given how important their apps are for their businesses today. Of course, we can’t talk about the future without mentioning GenAI. To me, we have a vast exploration area for GenAI, starting with the obvious ones: does it make sense to create tooling that generates “manual” instrumentation for existing code? Can we use GenAI to improve existing instrumentation by ensuring it adheres to semantic conventions?

By Tom Smith

CORE

How Enabling Slow Query Log Enhances Postgres Observability

In PostgreSQL, the slow query log is a feature that allows you to log queries that take longer than a specified threshold to execute. This log helps you identify and optimize queries that may be causing performance issues in your database. Let’s see how we can use it. Why Do We Need Observability? Database observability is a crucial component for the maintenance and development of the database. It helps with identifying and solving issues. Observability is much more than just monitoring, though. However, to build successful observability, we need to introduce proper telemetry and monitoring everywhere in our production environment. One of the things we start with is logging of the queries. We want to capture details of the SQL statement, metrics around execution time and consumed memory, and statistics of the tables we use. Unfortunately, many default settings in our database systems result in missing pieces of information that are crucial for debugging. One such piece is details of queries that are slow and are the most probable causes of the issues. What Is a Slow Query Log? The database executes many queries during the day. Some of them are very fast, and some of them may slow the database down and cause issues with other processes using the database. Ideally, we would like to identify these slow queries and examine them more to understand why they are slow. There are many reasons why queries may be slow and many techniques to optimize them. Most of these techniques focus on using the execution plan to understand what happened. The execution plan explains what the database engine performs when executing the query. This can involve many operations like joining many tables, using indexes, sorting data, or saving it to disk temporarily. Such a plan provides all the details, however, these plans may consume a lot of space. Therefore, we don’t store them for every single query as probably most of the queries are fast and don’t need any investigation. The slow query log is a mechanism for capturing details of queries that take too long to execute. This helps in the investigation as we capture the details at the moment when the query runs. The slow query log can be useful for identifying performance bottlenecks and optimizing slow queries to improve the overall performance of your PostgreSQL database. How to Configure the Slow Query Log To enable the slow query log in PostgreSQL, we need to set a couple of parameters. Let’s see them one by one. First, you need to enable logging with the following: log_statement = 'all' This instructs PostgreSQL to log all the syntactically correct statements. Other options are none (log nothing), ddl (log only Data Definition Language queries, i.e., queries that modify schema), mod (DDL queries and queries that modify the data, but not things like VACUUM). It’s also worth mentioning that log_statement will not log syntactically incorrect things. We need to use log_min_error_statement to do that. Also, log_statement may log confidential information. Another parameter logs the duration of all the completed statements: log_duration = on This will log the duration of all the statements. However, not all statements will have the query text (so the actual statement that was executed). To do that, we need to use another parameter: log_min_duration_statement = 100ms This causes logging of the duration of the statement if it ran for at least one hundred milliseconds. However, this will report the query text of the statement that was slow. After making these changes, you need to restart PostgreSQL for the configuration to take effect. There are additional parameters that you may configure. For instance: log_destination = 'csvlog' This causes the logging to a CSV file. You may want to log using different file formats. log_filename = 'postgresql.log.%Y-%m-%d-%H' This configures the name of the log file. This makes it easier to process the logs in an automated manner. log_rotation_age = 60 This causes creating a new log file every sixty minutes. compute_query_id = 'on' This enables in-core computation of a query identifier. We can use this identifier to find identical queries in a best-effort manner. This works starting with PostgreSQL 13. Once we log the queries, we need to get their execution plans. We can use pg_store_plans for that. pg_store_plans.plan_format = 'json' This controls what format to use when logging the execution plan. pg_store_plans.max_plan_length = 1048576 This controls the length of the plan to store. If the plan is too long, it will get truncated. It is important to set this value high enough to store the whole execution plan. We can also configure what is logged exactly: pg_store_plans.log_buffers = true pg_store_plans.log_analyze = true pg_store_plans.log_buffers = true pg_store_plans.log_timing = true This should give you enough details of what happened. What About Ephemeral Databases? Configuring your PostgreSQL is simple if your database lives for a long time. This is typically the case when you host your database in the cloud (or generally as a hosted database), or if you run it in a Docker container that is running as a service. However, if you run PostgreSQL only for a very short period, for instance during your automated tests, then you may have no technical way of reconfiguring it. This may be the case with Testcontainers. Typically, you may run some initialization code just before your actual test suite to initialize the dependencies like storage emulators or database servers. Testcontainers takes care of running them as Docker containers. However, there is no straightforward way of restarting the container. In some languages, you may have an actual API that will handle this quite nicely, though. An ephemeral database strategy allows for separating high-throughput, frequently changing data from the main database to enhance efficiency and mitigate operational risks. This approach addresses issues like query costs and system strain, with the ephemeral DB holding disposable data, thereby ensuring system stability and performance. Similar issues may happen if you host your PostgreSQL for tests as a service in GitHub Actions. You cannot easily control the containers and restart them after applying the configuration changes. The solution is to use a custom Docker image. Just prepare your image with the configuration file that enables the slow query log. You can then just run the container once and the configuration will be as expected. Summary The slow query log is a feature that allows you to log queries that take longer than a specified threshold to execute. This can significantly ease the investigation of slow queries as all the important details of the queries are already available.

By Adam Furmanek

Advanced Linux Troubleshooting Techniques for Site Reliability Engineers

In Site Reliability Engineering (SRE), the ability to quickly and effectively troubleshoot issues within Linux systems is crucial. This article explores advanced troubleshooting techniques beyond basic tools and commands, focusing on kernel debugging, system call tracing, performance analysis, and using the Extended Berkeley Packet Filter (eBPF) for real-time data gathering. Kernel Debugging Kernel debugging is a fundamental skill for any SRE working with Linux. It allows for deep inspection of the kernel's behavior, which is critical when diagnosing system crashes or performance bottlenecks. Tools and Techniques GDB (GNU Debugger) GDB can debug kernel modules and the Linux kernel. It allows setting breakpoints, stepping through the code, and inspecting variables. GNU Debugger Official Documentation: This is the official documentation for GNU Debugger, providing a comprehensive overview of its features. KGDB The kernel debugger allows the kernel to be debugged using GDB over a serial connection or a network. Using kgdb, kdb, and the kernel debugger internals provides a detailed explanation of how kgdb can be enabled and configured. Dynamic Debugging (dyndbg) Linux's dynamic debug feature enables real-time debugging messages that help trace kernel operations without rebooting the system. The official Dynamic Debug page describes how to use the dynamic debug (dyndbg) feature. Tracing System Calls With strace strace is a powerful diagnostic tool that monitors the system calls used by a program and the signals received by a program. It is instrumental in understanding the interaction between applications and the Linux kernel. Usage To trace system calls, strace can be attached to a running process or start a new process under strace. It logs all system calls, which can be analyzed to find faults in system operations. Example: Shell root@ubuntu:~# strace -p 2009 strace: Process 2009 attached munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 In the above example, the -p flag is the process, and 2009 is the pid. Similarly, you can use the -o flag to log the output to a file instead of dumping everything on the screen. You can review the following article to understand system calls on Linux with strace. Performance Analysis With perf perf is a versatile tool used for system performance analysis. It provides a rich set of commands to collect, analyze, and report on hardware and software events. Key Features perf record: Gathers performance data into a file, perf.data, which can be further analyzed using perf report to identify hotspots perf report: This report analyzes the data collected by perf record and displays where most of the time was spent, helping identify performance bottlenecks. Event-based sampling: perf can record data based on specific events, such as cache misses or CPU cycles, which helps pinpoint performance issues more accurately. Example: Shell root@ubuntu:/tmp# perf record ^C[ perf record: Woken up 17 times to write data ] [ perf record: Captured and wrote 4.619 MB perf.data (83123 samples) ] root@ubuntu:/tmp# root@ubuntu:/tmp# perf report Samples: 83K of event 'cpu-clock:ppp', Event count (approx.): 20780750000 Overhead Command Shared Object Symbol 17.74% swapper [kernel.kallsyms] [k] cpuidle_idle_call 8.36% stress [kernel.kallsyms] [k] __do_softirq 7.17% stress [kernel.kallsyms] [k] finish_task_switch.isra.0 6.90% stress [kernel.kallsyms] [k] el0_da 5.73% stress libc.so.6 [.] random_r 3.92% stress [kernel.kallsyms] [k] flush_end_io 3.87% stress libc.so.6 [.] random 3.71% stress libc.so.6 [.] 0x00000000001405bc 2.71% kworker/0:2H-kb [kernel.kallsyms] [k] ata_scsi_queuecmd 2.58% stress libm.so.6 [.] __sqrt_finite 2.45% stress stress [.] 0x0000000000000f14 1.62% stress stress [.] 0x000000000000168c 1.46% stress [kernel.kallsyms] [k] __pi_clear_page 1.37% stress libc.so.6 [.] rand 1.34% stress libc.so.6 [.] 0x00000000001405c4 1.22% stress stress [.] 0x0000000000000e94 1.20% stress [kernel.kallsyms] [k] folio_batch_move_lru 1.20% stress stress [.] 0x0000000000000f10 1.16% stress libc.so.6 [.] 0x00000000001408d4 0.84% stress [kernel.kallsyms] [k] handle_mm_fault 0.77% stress [kernel.kallsyms] [k] release_pages 0.65% stress [kernel.kallsyms] [k] super_lock 0.62% stress [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 0.61% stress [kernel.kallsyms] [k] blk_done_softirq 0.61% stress [kernel.kallsyms] [k] _raw_spin_lock 0.60% stress [kernel.kallsyms] [k] folio_add_lru 0.58% kworker/0:2H-kb [kernel.kallsyms] [k] finish_task_switch.isra.0 0.55% stress [kernel.kallsyms] [k] __rcu_read_lock 0.52% stress [kernel.kallsyms] [k] percpu_ref_put_many.constprop.0 0.46% stress stress [.] 0x00000000000016e0 0.45% stress [kernel.kallsyms] [k] __rcu_read_unlock 0.45% stress [kernel.kallsyms] [k] dynamic_might_resched 0.42% stress [kernel.kallsyms] [k] _raw_spin_unlock 0.41% stress [kernel.kallsyms] [k] __mod_memcg_lruvec_state 0.40% stress [kernel.kallsyms] [k] mas_walk 0.39% stress [kernel.kallsyms] [k] arch_counter_get_cntvct 0.39% stress [kernel.kallsyms] [k] rwsem_read_trylock 0.39% stress [kernel.kallsyms] [k] up_read 0.38% stress [kernel.kallsyms] [k] down_read 0.37% stress [kernel.kallsyms] [k] get_mem_cgroup_from_mm 0.36% stress [kernel.kallsyms] [k] free_unref_page_commit 0.34% stress [kernel.kallsyms] [k] memset 0.32% stress libc.so.6 [.] 0x00000000001408c8 0.30% stress [kernel.kallsyms] [k] sync_inodes_sb 0.29% stress [kernel.kallsyms] [k] iterate_supers 0.29% stress [kernel.kallsyms] [k] percpu_counter_add_batch Real-Time Data Gathering With eBPF eBPF allows for creating small programs that run on the Linux kernel in a sandboxed environment. These programs can track system calls and network messages, providing real-time insights into system behavior. Applications Network monitoring: eBPF can monitor network traffic in real-time, providing insights into packet flow and protocol usage without significant performance overhead. Security: eBPF helps implement security policies by monitoring system calls and network activity to detect and prevent malicious activities. Performance monitoring: It can track application performance by monitoring function calls and system resource usage, helping SREs optimize performance. Conclusion Advanced troubleshooting in Linux involves a combination of tools and techniques that provide deep insights into system operations. Tools like GDB, strace, perf, and eBPF are essential for any SRE looking to enhance their troubleshooting capabilities. By leveraging these tools, SREs can ensure the high reliability and performance of Linux systems in production environments.

By Prashanth Ravula

CORE

Analyze Your ALB/NLB Logs With ClickHouse

In the dynamic world of cloud computing, data engineers are constantly challenged with managing and analyzing vast amounts of data. A critical aspect of this challenge is effectively handling AWS Load Balancer Logs. This article examines the integration of AWS Load Balancer Logs with ClickHouse for efficient log analysis. We start by exploring AWS’s method of storing these logs in S3 and its queuing system for data management. The focus then shifts to setting up a log analysis framework using S3 and ClickHouse, highlighting the process with Terraform. The goal is to provide a clear and practical guide for implementing a scalable solution for analyzing AWS NLB or ALB access logs in real time. To understand the application of this process, consider a standard application using an AWS Load Balancer. Load Balancers, as integral components of AWS services, direct logs to an S3 bucket. This article will guide you through each step of the process, demonstrating how to make these crucial load-balancer logs available for real-time analysis in ClickHouse, facilitated by Terraform. However, before delving into the specifics of Terraform’s capabilities, it’s important to first comprehend the existing infrastructure and the critical Terraform configurations that enable the interaction between S3 and SQS for the ALB. Setting Up the S3 Log Storage Begin by establishing an S3 bucket for ALB log storage. This initial step is vital and involves linking an S3 bucket to your ALB. The process starts with creating an S3 Bucket, as demonstrated in the provided code snippet (see /example_projects/transfer/nlb_observability_stack/s3.tf#L1-L3). ProtoBuf resource "aws_s3_bucket" "nlb_logs" { bucket = var.bucket_name } The code snippet demonstrates the initial step of establishing an S3 bucket. This bucket is specifically configured for storing AWS ALB logs, serving as the primary repository for these logs. ProtoBuf resource "aws_lb" "alb" { /* your config */ dynamic "access_logs" { for_each = var.access_logs_bucket != null ? { enabled = true } : {} content { enabled = true bucket = var.bucket_name prefix = var.access_logs_bucket_prefix } } } Next, we configure an SQS queue that works in tandem with the S3 bucket. The configuration details for the SQS queue are outlined here. ProtoBuf resource "aws_sqs_queue" "nlb_logs_queue" { name = var.sqs_name policy = <<POLICY { "Version": "2012-10-17", "Id": "sqspolicy", "Statement": [ { "Effect": "Allow", "Principal": "*", "Action": "sqs:SendMessage", "Resource": "arn:aws:sqs:*:*:${var.sqs_name}", "Condition": { "ArnEquals": { "aws:SourceArn": "${aws_s3_bucket.nlb_logs.arn}" } } } ] } POLICY } This code initiates the creation of an SQS queue, facilitating the seamless delivery of ALB logs to the designated S3 bucket. As logs are delivered, they are automatically organized within a dedicated folder: Regularly generated new log files demand a streamlined approach for notification and processing. To establish a seamless notification channel, we'll configure an optimal push notification system via SQS. Referencing the guidelines outlined in Amazon S3's notification configuration documentation, our next step involves the creation of an SQS queue. This queue will serve as the conduit for receiving timely notifications, ensuring prompt handling and processing of newly generated log files within our S3 bucket. This linkage is solidified through the creation of the SQS queue (see /example_projects/transfer/nlb_observability_stack/s3.tf#L54-L61). ProtoBuf resource "aws_s3_bucket_notification" "nlb_logs_bucket_notification" { bucket = aws_s3_bucket.nlb_logs.id queue { queue_arn = aws_sqs_queue.nlb_logs_queue.arn events = ["s3:ObjectCreated:*"] } } The configurations established thus far form the core infrastructure for our log storage system. We have methodically set up the S3 bucket, configured the SQS queue, and carefully linked them. This systematic approach lays the groundwork for efficient log management and processing, ensuring that each component functions cohesively in the following orchestrated setup: The illustration above showcases the composed architecture, where the S3 bucket, SQS queue, and their interconnection stand as pivotal components for storing and managing logs effectively within the AWS environment. Logs are now in your S3 bucket, but reading these logs may be challenging. Let’s take a look at a data sample: Plain Text tls 2.0 2024-01-02T23:58:58 net/preprod-public-api-dt-tls/9f8794be28ab2534 4d9af2ddde90eb82 84.247.112.144:33342 10.0.223.207:443 244 121 0 15 - arn:aws:acm:eu-central-1:840525340941:certificate/5240a1e4-c7fe-44c1-9d89-c256213c5d23 - ECDHE-RSA-AES128-GCM-SHA256 tlsv12 - 18.193.17.109 - - "%ef%b5%bd%8" 2024-01-02T23:58:58 The snippet above represents a sample of the log data residing within the S3 bucket. Understanding this data's format and content will help us to build an efficient strategy to parse and store it. Let’s move this data to DoubleCloud Managed Clickhouse. Configuring VPC and ClickHouse With DoubleCloud The next step involves adding a Virtual Private Cloud (VPC) and a managed ClickHouse instance. These will act as the primary storage systems for our logs, ensuring secure and efficient log management (see /example_projects/transfer/nlb_observability_stack/network.tf#L1-L7). ProtoBuf resource "doublecloud_network" "nlb-network" { project_id = var.project_id name = var.network_name region_id = var.region cloud_type = var.cloud_type ipv4_cidr_block = var.ipv4_cidr } Next, we’ll demonstrate how to integrate a VPC and ClickHouse into our log storage setup. The following step is to establish a ClickHouse instance within this VPC, ensuring a seamless and secure storage solution for our logs (see /example_projects/transfer/nlb_observability_stack/ch.tf#L1-L35). ProtoBuf resource "doublecloud_clickhouse_cluster" "nlb-logs-clickhouse-cluster" { project_id = var.project_id name = var.clickhouse_cluster_name region_id = var.region cloud_type = var.cloud_type network_id = resource.doublecloud_network.nlb-network.id resources { clickhouse { resource_preset_id = var.clickhouse_cluster_resource_preset disk_size = 34359738368 replica_count = 1 } } config { log_level = "LOG_LEVEL_INFORMATION" max_connections = 120 } access { data_services = ["transfer"] ipv4_cidr_blocks = [ { value = var.ipv4_cidr description = "VPC CIDR" } ] } } data "doublecloud_clickhouse" "nlb-logs-clickhouse" { project_id = var.project_id id = doublecloud_clickhouse_cluster.nlb-logs-clickhouse-cluster.id } Integrating S3 Logs With ClickHouse To link S3 and ClickHouse, we utilize DoubleCloud Transfer, an ELT (Extract, Load, Transform) tool. The setup for DoubleCloud Transfer includes configuring both the source and target endpoints. Below is the Terraform code outlining the setup for the source endpoint (see /example_projects/transfer/nlb_observability_stack/transfer.tf#L1-L197). ProtoBuf resource "doublecloud_transfer_endpoint" "nlb-s3-s32ch-source" { name = var.transfer_source_name project_id = var.project_id settings { object_storage_source { provider { bucket = var.bucket_name path_prefix = var.bucket_prefix aws_access_key_id = var.aws_access_key_id aws_secret_access_key = var.aws_access_key_secret region = var.region endpoint = var.endpoint use_ssl = true verify_ssl_cert = true } format { csv { delimiter = " " // space as delimiter advanced_options { } additional_options { } } } event_source { sqs { queue_name = var.sqs_name } } result_table { add_system_cols = true table_name = var.transfer_source_table_name table_namespace = var.transfer_source_table_namespace } result_schema { data_schema { fields { field { name = "type" type = "string" required = false key = false path = "0" } field { name = "version" type = "string" required = false key = false path = "1" } /* Rest of Fields */ field { name = "tls_connection_creation_time" type = "datetime" required = false key = false path = "21" } } } } } } } This Terraform snippet details the setup of the source endpoint, including S3 connection specifications, data format, SQS queue for event notifications, and the schema for data in the S3 bucket. Next, we focus on establishing the target endpoint, which is straightforward with ClickHouse (see /example_projects/transfer/nlb_observability_stack/transfer.tf#L199-L215). ProtoBuf resource "doublecloud_transfer_endpoint" "nlb-ch-s32ch-target" { name = var.transfer_target_name project_id = var.project_id settings { clickhouse_target { clickhouse_cleanup_policy = "DROP" connection { address { cluster_id = doublecloud_clickhouse_cluster.nlb-logs-clickhouse-cluster.id } database = "default" password = data.doublecloud_clickhouse.nlb-logs-clickhouse.connection_info.password user = data.doublecloud_clickhouse.nlb-logs-clickhouse.connection_info.user } } } } The preceding code snippets for the source and target endpoints can now be combined to create a complete transfer configuration, as demonstrated in the following Terraform snippet (see /example_projects/transfer/nlb_observability_stack/transfer.tf#L217-L224). ProtoBuf resource "doublecloud_transfer" "nlb-logs-s32ch" { name = var.transfer_name project_id = var.project_id source = doublecloud_transfer_endpoint.nlb-s3-s32ch-source.id target = doublecloud_transfer_endpoint.nlb-ch-s32ch-target.id type = "INCREMENT_ONLY" activated = false } With the establishment of this transfer, a comprehensive delivery pipeline takes shape: The illustration above represents the culmination of our efforts — a complete delivery pipeline primed for seamless data flow. This integrated system, incorporating S3, SQS, VPC, ClickHouse, and the orchestrated configurations, stands ready to handle, process, and analyze log data efficiently and effectively at any scale. Exploring Logs in ClickHouse With ClickHouse set up, we now turn our attention to analyzing the data. This section guides you through querying your structured logs to extract valuable insights from the well-organized dataset. To begin interacting with your newly created database, the ClickHouse-client tool can be utilized: Shell clickhouse-client \ --host $CH_HOST \ --port 9440 \ --secure \ --user admin \ --password $CH_PASSWORD Begin by assessing the overall log count in your dataset. A straightforward query in ClickHouse will help you understand the scope of data you’re dealing with, providing a baseline for further analysis. Shell SELECT count(*) FROM logs_alb Query id: 6cf59405-2a61-451b-9579-a7d340c8fd5c ┌──count()─┐ │ 15935887 │ └──────────┘ 1 row in set. Elapsed: 0.457 sec. Now, we'll focus on retrieving a specific row from our dataset. Executing this targeted query allows us to inspect the contents of an individual log entry in detail. Shell SELECT * FROM logs_alb LIMIT 1 FORMAT Vertical Query id: 44fc6045-a5be-47e2-8482-3033efb58206 Row 1: ────── type: tls version: 2.0 time: 2023-11-20 21:05:01 elb: net/*****/***** listener: 92143215dc51bb35 client_port: 10.0.246.57:55534 destination_port: 10.0.39.32:443 connection_time: 1 tls_handshake_time: - received_bytes: 0 sent_bytes: 0 incoming_tls_alert: - chosen_cert_arn: - chosen_cert_serial: - tls_cipher: - tls_protocol_version: - tls_named_group: - domain_name: - alpn_fe_protocol: - alpn_be_protocol: - alpn_client_preference_list: - tls_connection_creation_time: 2023-11-20 21:05:01 __file_name: api/AWSLogs/******/elasticloadbalancing/eu-central-1/2023/11/20/****-central-1_net.****.log.gz __row_index: 1 __data_transfer_commit_time: 1700514476000000000 __data_transfer_delete_time: 0 1 row in set. Elapsed: 0.598 sec. Next, we'll conduct a simple yet revealing analysis. By running a “group by” query, we aim to identify the most frequently accessed destination ports in our dataset. Shell SELECT destination_port, count(*) FROM logs_alb GROUP BY destination_port Query id: a4ab55db-9208-484f-b019-a5c13d779063 ┌─destination_port──┬─count()─┐ │ 10.0.234.156:443 │ 10148 │ │ 10.0.205.254:443 │ 12639 │ │ 10.0.209.51:443 │ 13586 │ │ 10.0.223.207:443 │ 10125 │ │ 10.0.39.32:443 │ 4860701 │ │ 10.0.198.39:443 │ 13837 │ │ 10.0.224.240:443 │ 9546 │ │ 10.10.162.244:443 │ 416893 │ │ 10.0.212.130:443 │ 9955 │ │ 10.0.106.172:443 │ 4860359 │ │ 10.10.111.92:443 │ 416908 │ │ 10.0.204.18:443 │ 9789 │ │ 10.10.24.126:443 │ 416881 │ │ 10.0.232.19:443 │ 13603 │ │ 10.0.146.100:443 │ 4862200 │ └───────────────────┴─────────┘ 15 rows in set. Elapsed: 1.101 sec. Processed 15.94 million rows, 405.01 MB (14.48 million rows/s., 368.01 MB/s.) Conclusion This article has outlined a comprehensive approach to analyzing AWS Load Balancer Logs using ClickHouse, facilitated by DoubleCloud Transfer and Terraform. We began with the fundamental setup of S3 and SQS for log storage and notification, before integrating a VPC and ClickHouse for efficient log management. Through practical examples and code snippets, we demonstrated how to configure and utilize these tools for real-time log analysis. The seamless integration of these technologies not only simplifies the log analysis process but also enhances its efficiency, offering insights that are crucial for optimizing cloud operations. Explore the complete example in our Terraform project here for a hands-on experience with log querying in ClickHouse. The power of ClickHouse in processing large datasets, coupled with the flexibility of AWS services, forms a robust solution for modern cloud computing challenges. As cloud technologies continue to evolve, the techniques and methods discussed in this article remain pertinent for IT professionals seeking efficient and scalable solutions for log analysis.

By Andrei Tserakhau

8 Ways Mobile Observability Is the Opposite of Backend Observability

Using a hammer to pound a screw into the wall will work, but it's not really a great way to get the job done and probably will lead you with damage you wish hadn't occurred. Similarly, using monitoring tools initially designed for the observability of backend applications to monitor your mobile applications will leave you wishing you had reached for the screwdriver instead of the proverbial hammer. Often the observability challenges for mobile applications are pretty much the opposite of what they are for backend monitoring. Let's take a look at 8 examples where that is the case. To make this more concrete, we will use a typical e-commerce mobile application and the backend application that handles its requests as an example to illustrate these differences. However, the comparisons in these examples are broadly applicable to other types of mobile applications and backend systems that you are running. Duration of Interactions For high-traffic services in your backend application, you are looking to have requests that take milliseconds to run on average, and you want to scale to handle thousands of requests per second. You don't maintain a state between requests, and it's uncommon for prior requests to cause bugs in the current request. Some examples of service calls here include: Getting a list of specific products. Completing a purchase. Fetching a list of alternative products for a given product. The data needed to troubleshoot issues here is likely within the request or the supporting infrastructure. You can trace the individual calls and connected service calls, and then inspect them to look for failure points. However, in your e-commerce mobile app, a single session lasts from multiple seconds to minutes, or even hours. If you want to understand why purchases are failing, the problem could stem from many application and device factors: Did the user background the app between adding items to the cart and attempting to complete the purchase? Certain data might be lost during such app state transitions. Did the app run out of memory in a product list view due to excessive loading and retention of product images not appropriately scaled for the device? Did the app not complete the payment processing in a timely manner, so the user force quit the app? Did the device lose network connectivity and fail to gracefully recover? Did the app crash during the purchase flow, and the user decided to purchase elsewhere? Since mobile is such a dynamic environment, tracking down the root cause of a drop in purchases could have many root causes that fall outside of the actual service calls. The span over which errors can be introduced is far greater than in backend interactions. How you visualize and interpret data becomes very different when your expectation is that issues can evolve over minutes and not over milliseconds. Session Complexity When you envision what a complete session is for your backend application, it frequently boils down to responding to a request from the client. The external variables at play are mostly your infrastructure’s health and capacity. In your e-commerce mobile app, a complete user experience can span multiple sessions across varying lengths of time. The user could launch, background, and then launch the app again over multiple days to complete a single purchase. Key functionality can also take place while the app is backgrounded, such as sending push notifications and pulling fresh products and deals so that the user is always getting the most up-to-date data whenever they launch the app. Some challenges when troubleshooting apps with complex interactions include: Stitching together multiple app sessions to get the complete user experience context. Understanding how app performance is impacted by different launch conditions like cold starts versus reused app processes. Tracking problems with failed or outdated app states that were loaded far earlier than when the resulting error happened. App sessions also cannot be easily modeled as a series of traces, so there are data and visualization challenges when dealing with longer, more complex experiences. Uncontrolled Devices You control the infrastructure that your backend applications run on. As such, it would be a rookie mistake for a DevOps team to, for example, not be aware of servers that are about to run out of disk space, and most people would forgive the backend monitoring agent for not working as expected if a server ran out of disk space. That is not the case for your e-commerce mobile app that runs on devices that you have no control over. People buy devices with the least amount of storage they think they can get away with and promptly fill them up with apps and media. You have to build resilient SDKs that can gracefully handle these situations and still report as complete a picture as possible. You have to find the right balance between retaining relevant information on the device – you may not have network connectivity to send it right away – and making the lack of disk space worse by excessively adding to the data stored on the device. Heterogeneous Devices Not only do you have no control over devices, but also they are far from homogeneous. In a backend environment, you are likely to have a small set of different machine types. For an Android app, you will have it run on tens of thousands of device models, running a variety of OS versions, so you end up with more complicated variables when analyzing the collected data. Cardinality for certain dimensions will grow in ways that just would not be seen in backend applications. Some examples of device-specific issues include: Your developers and QA team have modern devices for testing, which can handle the size of the product images in a list view. However, many customer devices have less RAM and end up with an out-of-memory crash. A manufacturer introduced a bug in their custom Android version, so customers encounter a crash that only affects your app on specific OS version/manufacturer combinations. The UI stutters on some devices because they have old CPU and GPU chipsets that cannot handle the complexity of your application. With so many combinations of device variables, your engineering team needs deep insights into affected user segments to avoid costly issue investigations. Otherwise, they will spend time looking for root causes in code when seeing the holistic picture of impact would streamline their resolution efforts. Network Connectivity Your e-commerce backend application operates with the explicit assumption of constant connectivity. Failures frequently are a capacity problem, which can be alleviated by sizing your infrastructure to handle traffic spikes. Outright losing connectivity occurs mostly during cloud provider outages, which are exceedingly rare. However, constant network connectivity in mobile is never guaranteed. Maybe your app has a great network connection when it starts, then completely drops the network connection, and then gets it back but experiences significant lag and packet loss. Your mobile app observability solution needs to provide insight into how your app deals with these challenging conditions. Some examples of network connectivity issues include: The app cannot launch without connectivity because the download of critical data is required to enter the main application flow. The device loses connection as the user tries to make a purchase, but the user is not greeted with a prompt about the issue. To the user, it still looks like the app is attempting to complete the purchase. They get frustrated and force quit because they don’t know the source of the issue. The app does not effectively cache images, so customers in locales where bandwidth is a scarce resource stop using your application. Since problems can occur during connectivity switches, you need visibility into entire user experiences to track down problems. A great example is content refreshes or data synchronizations that are scheduled as background tasks. Understanding where failures happen under specific network conditions allows your engineering team to understand the root cause of network-related problems and introduce fixes that gracefully handle the loss of network connectivity. Data Delays Many backend observability tools will only accept data that is delayed by minutes or at most a few hours. This works fine for backend applications since the expectation is for the servers to not lose connectivity. The opposite is true in mobile, where the expectation is for connectivity to be lost intermittently and for a significant percentage of data to be delayed. As an example, your engineering team notices a spike in crashes, then launches an investigation and puts out a fix in a new version. You notice the crash rate go down, and everyone is happy. However, users on the previous version that crashed, who were too frustrated to immediately relaunch your app after it crashed, have decided to give it another go a day or two later. They launch the app, which sends a crash report from the device. If your observability tool marks those crashes as having just occurred, you might think the issue is still ongoing, even though you released a fix for it. Ecosystem Limitations When you build a backend application, you get to choose the environment that it runs in. The limitations on what you monitor and how you monitor it are largely dictated by the overhead it introduces and the time it takes to implement it. On mobile, you are operating in ecosystems defined by the device manufacturers or maintainers of the ecosystem, and there are restrictions that you need to find creative solutions to in order to get the data that you need. Certain metrics are forced upon you, such as the crash and Application Not Responding (ANR) rates that, in Android, impact your ranking and discoverability on the Google Play Store. The tricky part here is that the ecosystems have the ability to collect data from a system perspective, while you only have the ability to collect data from the perspective of your application. That means you have to get pretty inventive to find ways to collect the data that helps you solve certain problems, such as ANRs on Android. To provide a bit more color here, ANRs occur when an Android app has a prolonged app freeze that causes the OS to display a prompt that asks the user if they want to terminate your app. Effectively, the app freezes for so long that the user is forced to crash their current app session. From a data collection perspective, the Google Play Console treats ANRs exactly like a crash, capturing a single stack trace at the end of the session. However, app freezes are not deterministic and can stem from endless combinations of initial conditions that led to the main thread being blocked, including: Third-party SDKs (like ad vendors) conflicting with each other Loading heavy resources like large images or unoptimized ads Data synchronizations hitting slow backend service calls Heavy animations or UI work Slow responses to system broadcasts With so many variables at play, your best bet is to capture data as soon as the app freezes and then examine these code patterns across your users to find the most common causes. Backend observability solutions are simply not built for these types of nuanced mobile data capture. Deploying New Code If you discover an issue in your backend application, code can be consistently redeployed with all instances running new code. That means, if you spot an issue that’s preventing the system from completing purchases, the biggest delay is in tracking down the root cause and writing the code to fix it. In mobile, you can’t control when people upgrade their app version. There will be a long tail of old versions out in the wild. It is not unusual for a large, established application to have over a hundred different versions used in a single day. As such, it’s vital that you minimize the number of users who download bad app versions. Slow rollouts and real-time visibility into user experiences can help you proactively address issues before they become widespread. Your mobile observability solution should surface signals that allow for early issue detection for every type of broken experience, including: Performance issues like slow startups or purchase flows Stability issues like crash, error, or ANR spikes User frustration issues like abandons and force quits Device resource issues like excessive memory, CPU, and battery consumption Network issues like failing first and third-party endpoints Mobile is so complex that engineering teams frequently must add logs and release new versions to build enough context to uncover root causes. This approach is riddled with guesswork, resulting in additional releases – some of them which will introduce, rather than solve, problems – out in the wild. You want your mobile observability solution to provide complete visibility so that your engineering team can get to solutions faster and without sacrificing feature velocity. Closing Thoughts At first glance, the challenges of achieving observability in a mobile application may not seem all that different than doing so for a backend application – collect some data, store it in a database, visualize it in a dashboard – but, on closer inspection, the nuances of each of those steps is quite different for the two domains. Trying to use the same tool for monitoring your mobile application as you do for your backend application is better than having no visibility, but it will ultimately deprive you of the full clarity of what is happening in your mobile app. Your developers will take longer to figure out how to solve problems that exist if they can even detect that they are occurring. If you rely on a backend observability approach for your mobile applications, there are mobile-first approaches that can eliminate toil and guesswork while integrating across your existing tech stack for full stack visibility. In addition, given the different challenges in collecting data for mobile apps versus backend systems, open-source communities, and governing groups are actively working on what mobile telemetry standards should be in order to power the future of mobile observability.

By Fredric Newberg

Monitoring and Observability

DZone's Featured Monitoring and Observability Resources

Top Monitoring and Observability Experts

The Latest Monitoring and Observability Topics