Testing, Deployment, and Maintenance Resources

DZone's Featured Testing, Deployment, and Maintenance Resources

DevOps

The DevOps movement has paved the way for CI/CD and streamlined application delivery and release orchestration. These nuanced methodologies have not only increased the scale and speed at which we release software, but also redistributed responsibilities onto the developer and led to innovation and automation throughout the SDLC.DZone's 2023 DevOps: CI/CD, Application Delivery, and Release Orchestration Trend Report explores these derivatives of DevOps by diving into how AIOps and MLOps practices affect CI/CD, the proper way to build an effective CI/CD pipeline, strategies for source code management and branching for GitOps and CI/CD, and more. Our research builds on previous years with its focus on the challenges of CI/CD, a responsibility assessment, and the impact of release strategies, to name a few. The goal of this Trend Report is to provide developers with the information they need to further innovate on their integration and delivery pipelines.

You Can Shape Trend Reports: Participate in DZone Research Surveys + Enter the Prize Drawings!

By Caitlin Candelmo

Hello, DZone Community! We have several surveys in progress as part of our research for upcoming Trend Reports. We would love for you to join us by sharing your experiences and insights (anonymously if you choose) — readers just like you drive the content that we cover in our Trend Reports. check out the details for each research survey below Over the coming months, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our Trend Reports. Security Research Security is everywhere; you can’t live with it, and you certainly can’t live without it! We are living in an entirely unprecedented world — one where bad actors are growing more sophisticated and are taking full advantage of the rapid advancements in AI. We will be exploring the most pressing security challenges and emerging strategies in this year’s survey for our August Enterprise Security Trend Report. Our 10-12-minute Enterprise Security Survey explores: Building a security-first organization Security architecture and design Key security strategies and techniques Cloud and software supply chain security At the end of the survey, you're also able to enter the prize drawing for a chance to receive one of two $175 (USD) e-gift cards! Join the Security Research Data Engineering Research As a continuation of our annual data-related research, we're consolidating our database, data pipeline, and data and analytics scopes into a single 12-minute survey that will guide help the narratives of our July Database Systems Trend Report and data engineering report later in the year. Our 2024 Data Engineering Survey explores: Database types, languages, and use cases Distributed database design + architectures Data observability, security, and governance Data pipelines, real-time processing, and structured storage Vector data and databases + other AI-driven data capabilities Join the Data Engineering Research You'll also have the chance to enter the $500 raffle at the end of the survey — five random people will be drawn and will receive $100 each (USD)! Cloud and Kubernetes Research This year, we're combining our annual cloud native and Kubernetes research into one 10-minute survey that dives further into these topics as they relate to both one another and at the intersection of security, observability, AI, and more. DZone's research will be informing these Trend Reports: May – Cloud Native: Championing Cloud Development Across the SDLC September – Kubernetes in the Enterprise Our 2024 Cloud Native Survey covers: Microservices, container orchestration, and tools/solutions Kubernetes use cases, pain points, and security measures Cloud infrastructure, costs, tech debt, and security threats AI for release management + monitoring/observability Join the Cloud Native Research Don't forget to enter the $750 raffle at the end of the survey! Five random people will be selected to each receive $150 (USD). Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Publications team More

Refcard #368

Getting Started With OpenTelemetry

By Joana Carvalho

CORE

Observations on Cloud-Native Observability: A Journey From the Foundations of Observability to Surviving Its Challenges at Scale

By Eric D. Schabell

CORE

Mastering Unit Testing and Test-Driven Development in Java

By Maic Moerser

Securing Secrets: A Guide To Implementing Secrets Management in DevSecOps Pipelines

Introduction to Secrets Management In the world of DevSecOps, where speed, agility, and security are paramount, managing secrets effectively is crucial. Secrets, such as passwords, API keys, tokens, and certificates, are sensitive pieces of information that, if exposed, can lead to severe security breaches. To mitigate these risks, organizations are turning to secret management solutions. These solutions help securely store, access, and manage secrets throughout the software development lifecycle, ensuring they are protected from unauthorized access and misuse. This article aims to provide an in-depth overview of secrets management in DevSecOps, covering key concepts, common challenges, best practices, and available tools. Security Risks in Secrets Management The lack of implementing secrets management poses several challenges. Primarily, your organization might already have numerous secrets stored across the codebase. Apart from the ongoing risk of exposure, keeping secrets within your code promotes other insecure practices such as reusing secrets, employing weak passwords, and neglecting to rotate or revoke secrets due to the extensive code modifications that would be needed. Here below are some of the risks highlighting the potential risks of improper secrets management: Data Breaches If secrets are not properly managed, they can be exposed, leading to unauthorized access and potential data breaches. Example Scenario A Software-as-a-Service (SaaS) company uses a popular CI/CD platform to automate its software development and deployment processes. As part of their DevSecOps practices, they store sensitive credentials, such as API keys and database passwords, in a secrets management tool integrated with their pipelines. Issue Unfortunately, the CI/CD platform they use experiences a security vulnerability that allows attackers to gain unauthorized access to the secrets management tool's API. This vulnerability goes undetected by the company's security monitoring systems. Consequence Attackers exploit the vulnerability and gain access to the secrets stored in the management tool. With these credentials, they are able to access the company's production systems and databases. They exfiltrate sensitive customer data, including personally identifiable information (PII) and financial records. Impact The data breach leads to significant financial losses for the company due to regulatory fines, legal fees, and loss of customer trust. Additionally, the company's reputation is tarnished, leading to a decrease in customer retention and potential business partnerships. Preventive Measures To prevent such data breaches, the company could have implemented the following preventive measures: Regularly auditing and monitoring access to the secrets management tool to detect unauthorized access. Implementing multi-factor authentication (MFA) for accessing the secrets management tool. Ensuring that the secrets management tool is regularly patched and updated to address any security vulnerabilities. Limiting access to secrets based on the principle of least privilege, ensuring that only authorized users and systems have access to sensitive credentials. Implementing strong encryption for storing secrets to mitigate the impact of unauthorized access. Conducting regular security assessments and penetration testing to identify and address potential security vulnerabilities in the CI/CD platform and associated tools. Credential Theft Attackers may steal secrets, such as API keys or passwords, to gain unauthorized access to systems or resources. Example Scenario A fintech startup uses a popular CI/CD platform to automate its software development and deployment processes. They store sensitive credentials, such as database passwords and API keys, in a secrets management tool integrated with their pipelines. Issue An attacker gains access to the company's internal network by exploiting a vulnerability in an outdated web server. Once inside the network, the attacker uses a variety of techniques, such as phishing and social engineering, to gain access to a developer's workstation. Consequence The attacker discovers that the developer has stored plaintext files containing sensitive credentials, including database passwords and API keys, on their desktop. The developer had mistakenly saved these files for convenience and had not securely stored them in the secrets management tool. Impact With access to the sensitive credentials, the attacker gains unauthorized access to the company's databases and other systems. They exfiltrate sensitive customer data, including financial records and personal information, leading to regulatory fines and damage to the company's reputation. Preventive Measures To prevent such credential theft incidents, the fintech startup could have implemented the following preventive measures: Educating developers and employees about the importance of securely storing credentials and the risks of leaving them in plaintext files. Implementing strict access controls and auditing mechanisms for accessing and managing secrets in the secrets management tool. Using encryption to store sensitive credentials in the secrets management tool, ensures that even if credentials are stolen, they cannot be easily used without decryption keys. Regularly rotating credentials and monitoring for unusual or unauthorized access patterns to detect potential credential theft incidents early. Misconfiguration Improperly configured secrets management systems can lead to accidental exposure of secrets. Example Scenario A healthcare organization uses a popular CI/CD platform to automate its software development and deployment processes. They store sensitive credentials, such as database passwords and API keys, in a secrets management tool integrated with their pipelines. Issue A developer inadvertently misconfigures the permissions on the secrets management tool, allowing unintended access to sensitive credentials. The misconfiguration occurs when the developer sets overly permissive access controls, granting access to a broader group of users than intended. Consequence An attacker discovers the misconfigured access controls and gains unauthorized access to the secrets management tool. With access to sensitive credentials, the attacker can now access the healthcare organization's databases and other systems, potentially leading to data breaches and privacy violations. Impact The healthcare organization suffers reputational damage and financial losses due to the data breach. They may also face regulatory fines for failing to protect sensitive information. Preventive Measures To prevent such misconfiguration incidents, the healthcare organization could have implemented the following preventive measures: Implementing least privilege access controls to ensure that only authorized users and systems have access to sensitive credentials. Regularly auditing and monitoring access to the secrets management tool to detect and remediate misconfigurations. Implementing automated checks and policies to enforce proper access controls and configurations for secrets management. Providing training and guidance to developers and administrators on best practices for securely configuring and managing access to secrets. Compliance Violations Failure to properly manage secrets can lead to violations of regulations such as GDPR, HIPAA, or PCI DSS. Example Scenario A financial services company uses a popular CI/CD platform to automate their software development and deployment processes. They store sensitive credentials, such as encryption keys and API tokens, in a secrets management tool integrated with their pipelines. Issue The financial services company fails to adhere to regulatory requirements for managing and protecting sensitive information. Specifically, they do not implement proper encryption for storing sensitive credentials and do not maintain proper access controls for managing secrets. Consequence Regulatory authorities conduct an audit of the company's security practices and discover compliance violations related to secrets management. The company is found to be non-compliant with regulations such as PCI DSS (Payment Card Industry Data Security Standard) and GDPR (General Data Protection Regulation). Impact The financial services company faces significant financial penalties for non-compliance with regulatory requirements. Additionally, the company's reputation is damaged, leading to a loss of customer trust and potential legal consequences. Preventive Measures To prevent such compliance violations, the financial services company could have implemented the following preventive measures: Implementing encryption for storing sensitive credentials in the secrets management tool to ensure compliance with data protection regulations. Implementing strict access controls and auditing mechanisms for managing and accessing secrets to prevent unauthorized access. Conducting regular compliance audits and assessments to identify and address any non-compliance issues related to secrets management. Lack of Accountability Without proper auditing and monitoring, it can be difficult to track who accessed or modified secrets, leading to a lack of accountability. Example Scenario A technology company uses a popular CI/CD platform to automate its software development and deployment processes. They store sensitive credentials, such as API keys and database passwords, in a secrets management tool integrated with their pipelines. Issue The company does not establish clear ownership and accountability for managing and protecting secrets. There is no designated individual or team responsible for ensuring that proper security practices are followed when storing and accessing secrets. Consequence Due to the lack of accountability, there is no oversight or monitoring of access to sensitive credentials. As a result, developers and administrators have unrestricted access to secrets, increasing the risk of unauthorized access and data breaches. Impact The lack of accountability leads to a data breach where sensitive credentials are exposed. The company faces financial losses due to regulatory fines, legal fees, and loss of customer trust. Additionally, the company's reputation is damaged, leading to a decrease in customer retention and potential business partnerships. Preventive Measures To prevent such lack of accountability incidents, the technology company could have implemented the following preventive measures: Designating a specific individual or team responsible for managing and protecting secrets, including implementing and enforcing security policies and procedures. Implementing access controls and auditing mechanisms to monitor and track access to secrets, ensuring that only authorized users have access. Providing regular training and awareness programs for employees on the importance of secrets management and security best practices. Conducting regular security audits and assessments to identify and address any gaps in secrets management practices. Operational Disruption If secrets are not available when needed, it can disrupt the operation of DevSecOps pipelines and applications. Example Scenario A financial institution uses a popular CI/CD platform to automate its software development and deployment processes. They store sensitive credentials, such as encryption keys and API tokens, in a secrets management tool integrated with their pipelines. Issue During a routine update to the secrets management tool, a misconfiguration occurs that causes the tool to become unresponsive. As a result, developers are unable to access the sensitive credentials needed to deploy new applications and services. Consequence The operational disruption leads to a delay in deploying critical updates and features, impacting the financial institution's ability to serve its customers effectively. The IT team is forced to troubleshoot the issue, leading to downtime and increased operational costs. Impact The operational disruption results in financial losses due to lost productivity and potential revenue. Additionally, the financial institution's reputation is damaged, leading to a loss of customer trust and potential business partnerships. Preventive Measures To prevent such operational disruptions, the financial institution could have implemented the following preventive measures: Implementing automated backups and disaster recovery procedures for the secrets management tool to quickly restore service in case of a failure. Conducting regular testing and monitoring of the secrets management tool to identify and address any performance issues or misconfigurations. Implementing a rollback plan to quickly revert to a previous version of the secrets management tool in case of a failed update or configuration change. Establishing clear communication channels and escalation procedures to quickly notify stakeholders and IT teams in case of operational disruption. Dependency on Third-Party Services Using third-party secrets management services can introduce dependencies and potential risks if the service becomes unavailable or compromised. Example Scenario A software development company uses a popular CI/CD platform to automate its software development and deployment processes. They rely on a third-party secrets management tool to store sensitive credentials, such as API keys and database passwords, used in their pipelines. Issue The third-party secrets management tool experiences a service outage due to a cyber attack on the service provider's infrastructure. As a result, the software development company is unable to access the sensitive credentials needed to deploy new applications and services. Consequence The dependency on the third-party secrets management tool leads to a delay in deploying critical updates and features, impacting the software development company's ability to deliver software on time. The IT team is forced to find alternative ways to manage and store sensitive credentials temporarily. Impact The dependency on the third-party secrets management tool results in financial losses due to lost productivity and potential revenue. Additionally, the software development company's reputation is damaged, leading to a loss of customer trust and potential business partnerships. Preventive Measures To prevent such dependencies on third-party services, the software development company could have implemented the following preventive measures: Implementing a backup plan for storing and managing sensitive credentials locally in case of a service outage or disruption. Diversifying the use of secrets management tools by using multiple tools or providers to reduce the impact of a single service outage. Conducting regular reviews and assessments of third-party service providers to ensure they meet security and reliability requirements. Implementing a contingency plan to quickly switch to an alternative secrets management tool or provider in case of a service outage or disruption. Insider Threats Malicious insiders may abuse their access to secrets for personal gain or to harm the organization. Example Scenario A technology company uses a popular CI/CD platform to automate their software development and deployment processes. They store sensitive credentials, such as API keys and database passwords, in a secrets management tool integrated with their pipelines. Issue An employee with privileged access to the secrets management tool decides to leave the company and maliciously steals sensitive credentials before leaving. The employee had legitimate access to the secrets management tool as part of their job responsibilities but chose to abuse that access for personal gain. Consequence The insider threat leads to the theft of sensitive credentials, which are then used by the former employee to gain unauthorized access to the company's systems and data. This unauthorized access can lead to data breaches, financial losses, and damage to the company's reputation. Impact The insider threat results in financial losses due to potential data breaches and the need to mitigate the impact of the stolen credentials. Additionally, the company's reputation is damaged, leading to a loss of customer trust and potential legal consequences. Preventive Measures To prevent insider threats involving secrets management, the technology company could have implemented the following preventive measures: Implementing strict access controls and least privilege principles to limit the access of employees to sensitive credentials based on their job responsibilities. Conducting regular audits and monitoring of access to the secrets management tool to detect and prevent unauthorized access. Providing regular training and awareness programs for employees on the importance of data security and the risks of insider threats. Implementing behavioral analytics and anomaly detection mechanisms to identify and respond to suspicious behavior or activities involving sensitive credentials. Best Practices for Secrets Management Here are some best practices for secrets management in DevSecOps pipelines: Use a dedicated secrets management tool: Utilize a specialized tool or service designed for securely storing and managing secrets. Encrypt secrets at rest and in transit: Ensure that secrets are encrypted both when stored and when transmitted over the network. Use strong access controls: Implement strict access controls to limit who can access secrets and what they can do with them. Regularly rotate secrets: Regularly rotate secrets (e.g., passwords, API keys) to minimize the impact of potential compromise. Avoid hardcoding secrets: Never hardcode secrets in your code or configuration files. Use environment variables or a secrets management tool instead. Use environment-specific secrets: Use different secrets for different environments (e.g., development, staging, production) to minimize the impact of a compromised secret. Monitor and audit access: Monitor and audit access to secrets to detect and respond to unauthorized access attempts. Automate secrets retrieval: Automate the retrieval of secrets in your CI/CD pipelines to reduce manual intervention and the risk of exposure. Regularly review and update policies: Regularly review and update your secrets management policies and procedures to ensure they are up-to-date and effective. Educate and train employees: Educate and train employees on the importance of secrets management and best practices for handling secrets securely. Use-Cases of Secrets Management For Different Tools Here are the common use cases for different tools of secrets management: IBM Cloud Secrets Manager Securely storing and managing API keys Managing database credentials Storing encryption keys Managing certificates Integrating with CI/CD pipelines Compliance and audit requirements by providing centralized management and auditing of secrets usage. Ability to dynamically generate and rotate secrets HashiCorp Vault Centralized secrets management for distributed systems Dynamic secrets generation and management Encryption and access controls for secrets Secrets rotation for various types of secrets AWS Secrets Manager Securely store and manage AWS credentials Securely store and manage other types of secrets used in AWS services Integration with AWS services for seamless access to secrets Automatic secrets rotation for supported AWS services Azure Key Vault Centralized secrets management for Azure applications Securely store and manage secrets, keys, and certificates Encryption and access policies for secrets Automated secrets rotation for keys, secrets, and certificates CyberArk Conjur Secrets management and privileged access management Secrets retrieval via REST API for integration with CI/CD pipelines Secrets versioning and access controls Automated secrets rotation using rotation policies and scheduled tasks Google Cloud Secret Manager Centralized secrets management for Google Cloud applications Securely store and manage secrets, API keys, and certificates Encryption at rest and in transit for secrets Automated and manual secrets rotation with integration with Google Cloud Functions These tools cater to different cloud environments and offer various features for securely managing and rotating secrets based on specific requirements and use cases. Implement Secrets Management in DevSecOps Pipelines Understanding CI/CD in DevSecOps CI/CD in DevSecOps involves automating the build, test, and deployment processes while integrating security practices throughout the pipeline to deliver secure and high-quality software rapidly. Continuous Integration (CI) CI is the practice of automatically building and testing code changes whenever a developer commits code to the version control system (e.g., Git). The goal is to quickly detect and fix integration errors. Continuous Delivery (CD) CD extends CI by automating the process of deploying code changes to testing, staging, and production environments. With CD, every code change that passes the automated tests can potentially be deployed to production. Continuous Deployment (CD) CD goes one step further than continuous delivery by automatically deploying every code change that passes the automated tests to production. This requires a high level of automation and confidence in the automated tests. Continuous Compliance (CC) CC refers to the practice of integrating compliance checks and controls into the automated CI/CD pipeline. It ensures that software deployments comply with relevant regulations, standards, and internal policies throughout the development lifecycle. DevSecOps DevSecOps integrates security practices into the CI/CD pipeline, ensuring that security is built into the software development process from the beginning. This includes performing security testing (e.g., static code analysis, dynamic application security testing) as part of the pipeline and managing secrets securely. The following picture depicts the DevSecOps lifecycles: Picture courtesy Implement Secrets Management Into DevSecOps Pipelines Implementing secrets management into DevSecOps pipelines involves securely handling and storing sensitive information such as API keys, passwords, and certificates. Here's a step-by-step guide to implementing secrets management in DevSecOps pipelines: Select a Secrets Management Solution Choose a secrets management tool that aligns with your organization's security requirements and integrates well with your existing DevSecOps tools and workflows. Identify Secrets Identify the secrets that need to be managed, such as database credentials, API keys, encryption keys, and certificates. Store Secrets Securely Use the selected secrets management tool to securely store secrets. Ensure that secrets are encrypted at rest and in transit and that access controls are in place to restrict who can access them. Integrate Secrets Management into CI/CD Pipelines Update your CI/CD pipeline scripts and configurations to integrate with the secrets management tool. Use the tool's APIs or SDKs to retrieve secrets securely during the pipeline execution. Implement Access Controls Implement strict access controls to ensure that only authorized users and systems can access secrets. Use role-based access control (RBAC) to manage permissions. Rotate Secrets Regularly Regularly rotate secrets to minimize the impact of potential compromise. Automate the rotation process as much as possible to ensure consistency and security. Monitor and Audit Access Monitor and audit access to secrets to detect and respond to unauthorized access attempts. Use logging and monitoring tools to track access and usage. Best Practices for Secrets Management Into DevSecOps Pipelines Implementing secrets management in DevSecOps pipelines requires careful consideration to ensure security and efficiency. Here are some best practices: Use a secrets management tool: Utilize a dedicated to store and manage secrets securely. Encrypt secrets: Encrypt secrets both at rest and in transit to protect them from unauthorized access. Avoid hardcoding secrets: Never hardcode secrets in your code or configuration files. Use environment variables or secrets management tools to inject secrets into your CI/CD pipelines. Rotate secrets: Implement a secrets rotation policy to regularly rotate secrets, such as passwords and API keys. Automate the rotation process wherever possible to reduce the risk of human error. Implement access controls: Use role-based access controls (RBAC) to restrict access to secrets based on the principle of least privilege. Monitor and audit access: Enable logging and monitoring to track access to secrets and detect any unauthorized access attempts. Automate secrets retrieval: Automate the retrieval of secrets in your CI/CD pipelines to reduce manual intervention and improve security. Use secrets injection: Use tools or libraries that support secrets injection (e.g., Kubernetes secrets, Docker secrets) to securely inject secrets into your application during deployment. Conclusion Secrets management is a critical aspect of DevSecOps that cannot be overlooked. By implementing best practices such as using dedicated secrets management tools, encrypting secrets, and implementing access controls, organizations can significantly enhance the security of their software development and deployment pipelines. Effective secrets management not only protects sensitive information but also helps in maintaining compliance with regulatory requirements. As DevSecOps continues to evolve, it is essential for organizations to prioritize secrets management as a fundamental part of their security strategy.

By Josephine Eskaline Joyce

CORE

How To Reduce MTTR

As a Site Reliability Engineer, one of the key metrics that I use to track the effectiveness of incident management is Mean Time To Recover (MTTR). Based on Wikipedia, MTTR is defined as the average time that a service or system will take to recover from any failure. Trying to achieve a low MTTR is key to achieving service level objectives and in turn, service level agreements of any critical production service. 10 Things That Can Help Reduce the Mean Time to Recovery (MTTR) 1. Clearly Defined SLIs Service level indicators or SLIs are the key indicators that measure the health of your service. A few examples of SLIs are error rate, latency, throughput, etc. 2. Actionable Alerts Based on SLIs The alert strategy should include improving the signal-to-noise ratio of the alerts. The goal with alerting is that every alert that your team gets should be actionable. Sending too many alerts will cause alert fatigue and will have the risk of the on-call person ignoring alerts that indicate real issues with the service. 3. Troubleshooting Guides Associated With Alerts Every alert should have a clearly defined troubleshooting guide on how to triage and mitigate the issue the alert identifies. A good methodology to use while writing these troubleshooting guides is the USE methodology, suggested by Brendan Gregg in his book, "Systems Performance." USE stands for Usage, Saturation, and Errors. 4. Practice Troubleshooting Guides Practicing troubleshooting guides periodically will help mitigate incidents when they occur. It will also help identify gaps with the TSGs since services evolve over time. A few examples of a good time to practice troubleshooting guides is when a new team member joins the team so that they can give a fresh perspective of the TSG. This will reduce assumptions about the knowledge of the system. 5. Usable Dashboards The observability strategy should include creating easy-to-use dashboards. The dashboards should have panels to include the key metrics of the services and the health of dependent services such as upstream and downstream services. A few examples of important metrics that should be included in the dashboards are the golden signals suggested by the Google SRE book such as latency, throughput, error rate, and saturation metrics. 6. Automated Actions To Mitigate Issues Automating certain actions based on the metrics and events is key to reducing MTTR. An example of this is taking certain servers out of rotation if packet loss is observed from these servers. This will help reduce the impact on user experience and reduce MTTR. 7. Failovers Rehearsals In the case of multi-data center architectures, it is crucial to have failover plans defined to make sure to recover from an outage of a specific data center quickly. Practicing these failover scenarios periodically will help to quickly execute them during an outage. This will also help in identifying any gaps in the failover plans and give the chance to update and fix the failover plans. 8. Automated Failovers Once the failover plans are defined, implemented, and practiced, the next step is to automate these failover scenarios based on the health checks of the service on a given data center. This will help to mitigate the issues faster and thus reduce the MTTR. 9. Change Management Process Changes to production systems are a major cause of outages. It is important to have a well-thought-out change management process in place. A few key elements of the change management process should include clearly defined checklists, change review and approval procedures, automated deployment pipelines with built-in monitoring, and the ability to quickly roll back the changes if any issues are observed. 10. Easy To Identify Change List and Automated Rollbacks There can be multiple changes continuously done in distributed systems where services are designed as microservices. Having a central system where one can easily identify which changes have been done during a given period of time will help to identify if a specific change has caused an outage and is thus easy to roll back. Conclusion In this article, I have discussed 10 things that can help reduce the Mean Time To Recovery of any critical production service. This is not an exhaustive list, but a list of best practices based on my years of experience working as a Site Reliability Engineer on services such as TikTok, Microsoft Teams, Xbox, and Microsoft Dynamics.

By Krishna Vinnakota

DevOps vs. DataOps vs. MLOps Vs. AIOps: Comparison of All "Ops"

The acronym "Ops" has rapidly increased in IT operations in recent years. IT operations are turning towards the automation process to improve customer delivery. Traditional application development uses DevOps implementation for Continued Integration (CI) and Continued Deployment (CD). The exact delivery and deployment process may not be suitable for data-intensive Machine Learning and Artificial Intelligence (AI) applications. This article will define different "Ops" and explain their work for the following: DevOps, DataOps, MLOps, and AIOps. DevOps This practice automates the collaboration between Development (Dev) and Operations (Ops). The main goal is to deliver the software product more rapidly and reliably and continue delivery with software quality. DevOps complements the agile software development process/agile way of working. DataOps DataOps is a practice or technology that combines integrated and process-oriented data with automation to improve data quality, collaboration, and analytics. It mainly deals with the cooperation between data scientists, data engineers, and other data professionals. MLOps MLOps is a practice or technology that develops and deploys machine learning models reliably and efficiently. MLOps is the set of practices at the intersection of DevOps, ML, and Data Engineering. AIOps AIOps is the process of capabilities to automate and streamline operations workflows for natural language processing and machine learning models. Machine Learning and Big Data are major aspects of AIOps because AI needs data from different systems and processes using ML models. AI is driven by machine learning models to create, deploy, train, and analyze the data to get accurate results. As per the IBM Developer, below are the typical “Ops” work together: Image Source: IBM Collective Comparison The table below describes the comparison between DevOps, DataOps, MLOps, and AIOps: Aspect DevOps DataOps MLOps AIOps Focus on: IT operations and software development with Agile way of working Data quality, collaboration, and analytics Machine Learning models IT operations Key Technologies/Tools: Jenkins, JIRA, Slack, Ansible, Docker, Git, Kubernetes, and Chef Apache Airflow, Databricks, Data Kitchen, High Byte Python, TensorFlow, PyTorch, Jupyter, and Notebooks Machine learning, AI algorithms, Big Data, and monitoring tools Key Principles: IT process automation Team collaboration and communication Continuous integration and continuous delivery (CI/CD) Collaboration between data Data pipeline automation and optimization Version control for data artifacts Data scientists and operations teams collaborate. Machine learning models, version control Continuous monitoring and feedback Automated analysis and response to IT incidents Proactive issue resolution using analytics IT management tools integration Continuous improvement using feedback Primary Users Software and DevOps engineers Data and DataOps engineers Data scientists and MLOps engineers Data scientists, Big Data scientists, and AIOps engineers Use Cases Microservices, containerization, CI/CD, and collaborative development Ingestion of data, processing and transforming data, and extraction of data into other platforms Machine learning (ML) and data science projects for predictive analytics and AI IT AI operations to enhance network, system, and infrastructure Summary In summary, managing a system from a single project team is at the end of its life due to business processes becoming more complex and IT systems changing dynamically with new technologies. The detailed implementation involves a combination of collaborative practices, automation, monitoring, and a focus on continuous improvement as part of DevOps, DataOps, MLOps, and AIOps processes. DevOps focuses primarily on IT processes and software development, and the DataOps and MLOps approaches focus on improving IT and business collaborations as well as overall data use in organizations. DataOps workflows leverage DevOps principles to manage the data workflows. MLOps also leverages the DevOps principles to manage applications built-in machine learning.

By Ravi Kiran Mallidi

CORE

Debugging Kubernetes Part 1: An Introduction

While debugging in an IDE or using simple command line tools is relatively straightforward, the real challenge lies in production debugging. Modern production environments have enabled sophisticated self-healing deployments, yet they have also made troubleshooting more complex. Kubernetes (aka k8s) is probably the most well-known orchestration production environment. To effectively teach debugging in Kubernetes, it's essential to first introduce its fundamental principles. This part of the debugging series is designed for developers looking to effectively tackle application issues within Kubernetes environments, without delving deeply into the complex DevOps aspects typically associated with its operations. Kubernetes is a big subject: it took me two videos just to explain the basic concepts and background. Introduction to Kubernetes and Distributed Systems Kubernetes, while often discussed in the context of cloud computing and large-scale operations, is not just a tool for managing containers. Its principles apply broadly to all large-scale distributed systems. In this post I want to explore Kubernetes from the ground up, emphasizing its role in solving real-world problems faced by developers in production environments. The Evolution of Deployment Technologies Before Kubernetes, the deployment landscape was markedly different. Understanding this evolution helps us appreciate the challenges Kubernetes aims to solve. The image below represents the road to Kubernetes and the technologies we passed along the way. In the image, we can see that initially, applications were deployed directly onto physical servers. This process was manual, error-prone, and difficult to replicate across multiple environments. For instance, if a company needed to scale its application, it involved procuring new hardware, installing operating systems, and configuring the application from scratch. This could take weeks or even months, leading to significant downtime and operational inefficiencies. Imagine a retail company preparing for the holiday season surge. Each time they needed to handle increased traffic, they would manually set up additional servers. This was not only time-consuming but also prone to human error. Scaling down after the peak period was equally cumbersome, leading to wasted resources. Enter Virtualization Virtualization technology introduced a layer that emulated the hardware, allowing for easier replication and migration of environments but at the cost of performance. However, fast virtualization enabled the cloud revolution. It lets companies like Amazon lease their servers at scale without compromising their own workloads. Virtualization involves running multiple operating systems on a single physical hardware host. Each virtual machine (VM) includes a full copy of an operating system, the application, necessary binaries, and libraries—taking up tens of GBs. VMs are managed via a hypervisor, such as VMware's ESXi or Microsoft's Hyper-V, which sits between the hardware and the operating system and is responsible for distributing hardware resources among the VMs. This layer adds additional overhead and can lead to decreased performance due to the need to emulate hardware. Note that virtualization is often referred to as "virtual machines," but I chose to avoid that terminology due to the focus of this blog on Java and the JVM where a virtual machine is typically a reference to the Java Virtual Machine (JVM). Rise of Containers Containers emerged as a lightweight alternative to full virtualization. Tools like Docker standardized container formats, making it easier to create and manage containers without the overhead associated with traditional virtual machines. Containers encapsulate an application’s runtime environment, making them portable and efficient. Unlike virtualization, containerization encapsulates an application in a container with its own operating environment, but it shares the host system’s kernel with other containers. Containers are thus much more lightweight, as they do not require a full OS instance; instead, they include only the application and its dependencies, such as libraries and binaries. This setup reduces the size of each container and improves boot times and performance by removing the hypervisor layer. Containers operate using several key Linux kernel features: Namespaces: Containers use namespaces to provide isolation for global system resources between independent containers. This includes aspects of the system like process IDs, networking interfaces, and file system mounts. Each container has its own isolated namespace, which gives it a private view of the operating system with access only to its resources. Control groups (cgroups): Cgroups further enhance the functionality of containers by limiting and prioritizing the hardware resources a container can use. This includes parameters such as CPU time, system memory, network bandwidth, or combinations of these resources. By controlling resource allocation, cgroups ensure that containers do not interfere with each other’s performance and maintain the efficiency of the underlying server. Union file systems: Containers use union file systems, such as OverlayFS, to layer files and directories in a lightweight and efficient manner. This system allows containers to appear as though they are running on their own operating system and file system, while they are actually sharing the host system’s kernel and base OS image. Rise of Orchestration As containers began to replace virtualization due to their efficiency and speed, developers and organizations rapidly adopted them for a wide range of applications. However, this surge in container usage brought with it a new set of challenges, primarily related to managing large numbers of containers at scale. While containers are incredibly efficient and portable, they introduce complexities when used extensively, particularly in large-scale, dynamic environments: Management overhead: Manually managing hundreds or even thousands of containers quickly becomes unfeasible. This includes deployment, networking, scaling, and ensuring availability and security. Resource allocation: Containers must be efficiently scheduled and managed to optimally use physical resources, avoiding underutilization or overloading of host machines. Service discovery and load balancing: As the number of containers grows, keeping track of which container offers which service and how to balance the load between them becomes critical. Updates and rollbacks: Implementing rolling updates, managing version control, and handling rollbacks in a containerized environment require robust automation tools. To address these challenges, the concept of container orchestration was developed. Orchestration automates the scheduling, deployment, scaling, networking, and lifecycle management of containers, which are often organized into microservices. Efficient orchestration tools help ensure that the entire container ecosystem is healthy and that applications are running as expected. Enter Kubernetes Among the orchestration tools, Kubernetes emerged as a frontrunner due to its robust capabilities, flexibility, and strong community support. Kubernetes offers several features that address the core challenges of managing containers: Automated scheduling: Kubernetes intelligently schedules containers on the cluster’s nodes, taking into account the resource requirements and other constraints, optimizing for efficiency and fault tolerance. Self-healing capabilities: It automatically replaces or restarts containers that fail, ensuring high availability of services. Horizontal scaling: Kubernetes can automatically scale applications up and down based on demand, which is essential for handling varying loads efficiently. Service discovery and load balancing: Kubernetes can expose a container using the DNS name or using its own IP address. If traffic to a container is high, Kubernetes is able to load balance and distribute the network traffic so that the deployment is stable. Automated rollouts and rollbacks: Kubernetes allows you to describe the desired state for your deployed containers using declarative configuration, and can change the actual state to the desired state at a controlled rate, such as to roll out a new version of an application. Why Kubernetes Stands Out Kubernetes not only solves practical, operational problems associated with running containers but also integrates with the broader technology ecosystem, supporting continuous integration and continuous deployment (CI/CD) practices. It is backed by the Cloud Native Computing Foundation (CNCF), ensuring it remains cutting-edge and community-focused. There used to be a site called "doyouneedkubernetes.com," and when you visited that site, it said, "No." Most of us don't need Kubernetes and it is often a symptom of Resume Driven Design (RDD). However, even when we don't need its scaling capabilities the advantages of its standardization are tremendous. Kubernetes became the de-facto standard and created a cottage industry of tools around it. Features such as observability and security can be plugged in easily. Cloud migration becomes arguably easier. Kubernetes is now the "lingua franca" of production environments. Kubernetes For Developers Understanding Kubernetes architecture is crucial for debugging and troubleshooting. The following image shows the high-level view of a Kubernetes deployment. There are far more details in most tutorials geared towards DevOps engineers, but for a developer, the point that matters is just "Your Code" - that tiny corner at the edge. In the image above we can see: Master node (represented by the blue Kubernetes logo on the left): The control plane of Kubernetes, responsible for managing the state of the cluster, scheduling applications, and handling replication Worker nodes: These nodes contain the pods that run the containerized applications. Each worker node is managed by the master. Pods: The smallest deployable units created and managed by Kubernetes, usually containing one or more containers that need to work together These components work together to ensure that an application runs smoothly and efficiently across the cluster. Kubernetes Basics In Practice Up until now, this post has been theory-heavy. Let's now review some commands we can use to work with a Kubernetes cluster. First, we would want to list the pods we have within the cluster which we can do using the get pods command as such: $ kubectl get pods NAME READY STATUS RESTARTS AGE my-first-pod-id-xxxx 1/1 Running 0 13s my-second-pod-id-xxxx 1/1 Running 0 13s A command such as kubectl describe pod returns a high-level description of the pod such as its name, parent node, etc. Many problems in production pods can be solved by looking at the system log. This can be accomplished by invoking the logs command: $ kubectl logs -f <pod> [2022-11-29 04:12:17,262] INFO log data ... Most typical large-scale application logs are ingested by tools such as Elastic, Loki, etc. As such, the logs command isn't as useful in production except for debugging edge cases. Final Word This introduction to Kubernetes has set the stage for deeper exploration into specific debugging and troubleshooting techniques, which we will cover in the upcoming posts. The complexity of Kubernetes makes it much harder to debug, but there are facilities in place to work around some of that complexity. While this article (and its follow-ups) focus on Kubernetes, future posts will delve into observability and related tools, which are crucial for effective debugging in production environments.

By Shai Almog

CORE

How to Quickly Create and Easily Configure a Local Redis Cluster

Context Do you crave hands-on experience with Redis clusters? Perhaps you're eager to learn its intricacies or conduct targeted testing and troubleshooting. A local Redis cluster empowers you with that very control. By setting it up on your own machine, you gain the freedom to experiment, validate concepts, and delve deeper into its functionality. This guide will equip you with the knowledge to quickly create and manage a Redis cluster on your local machine, paving the way for a productive and insightful learning journey. Install Redis The first step would be to install a Redis server locally. Later cluster creation commands will use Redis instances as building blocks and combine them into a cluster. Mac The easiest way would be to install using Homebrew. Use the following command to install Redis on your Macbook. Shell brew install redis Linux Use the following command to install. Shell sudo apt update sudo apt install redis-server From the Source If you need a specific version then you can use this method of installation. For this, you can use the following steps: Download the latest Redis source code from the official website. Unpack the downloaded archive. Navigate to the extracted directory in your terminal. Run the following commands: Shell make sudo make install Create Cluster One Time Steps Clone the git repository Go to the directory where you cloned the repository Then go to the following directory Shell cd <path to local redis repository>/redis/utils/create-cluster Modify create-cluster with the path to your Redis-server Shell vi create-cluster Replace BIN_PATH="$SCRIPT_DIR/../../src/" with BIN_PATH="/usr/local/bin/" Steps to Create/Start/Stop/Clean Cluster These steps are used whenever you need to use a Redis Cluster. Start the Redis Instances Shell ./create-cluster start Create the Cluster Shell echo "yes" | ./create-cluster create Tip You can create an alias and add it to the shell configuration files (~/.bashrc or ~./zshrc) Example: Shell open ~/.zshrc Add the following to this file. Shell alias cluster_start="./create-cluster start && echo "yes" | ./create-cluster create" Open a new terminal and run the following. Shell source ~/.zshrc Now you use “cluster_start” in the command line and it will start and create the cluster for you. Stop the Cluster Shell ./create-cluster stop Clean Up Clears previous cluster data for a fresh start. Shell ./create-cluster clean Tip Similarly, you can create an alias as below to stop the cluster and clean the cluster data files. Shell alias cluster_stop="./create-cluster stop && ./create-cluster clean” How To Create the Cluster With a Custom Number of Nodes by Default By default cluster-create script creates 6 nodes with 3 primaries and 3 replicas. For some special testing or troubleshooting if you need to change the number of nodes you can modify the script instead of manually adding nodes. Shell vi create-cluster Edit the following to the desired number of nodes for the cluster. NODES=6 Also, by default, it creates 1 replica for a primary. You can change that as well by changing the value in the same script (create-cluster) to the desired value. REPLICAS=1 Create Cluster With Custom Configuration Redis provides various options to customize the configuration to configure Redis servers the way you want. All those are present in the redis.conf file. In order to customize those with the desired options follow these steps: Edit the redis.conf With Desired Configurations Shell cd <path to local redis repository>/redis/redis.conf Edit the create-cluster Script Shell vi create-cluster Modify the command in the start and restart options of the script to add the following ../../redis.conf Before Modification Shell $BIN_PATH/redis-server --port $PORT --protected-mode $PROTECTED_MODE --cluster-enabled yes --cluster-config-file nodes-${PORT}.conf --cluster-node-timeout $TIMEOUT --appendonly yes --appendfilename appendonly-${PORT}.aof --appenddirname appendonlydir-${PORT} --dbfilename dump-${PORT}.rdb --logfile ${PORT}.log --daemonize yes ${ADDITIONAL_OPTIONS} After Modification Shell $BIN_PATH/redis-server ../../redis.conf --port $PORT --protected-mode $PROTECTED_MODE --cluster-enabled yes --cluster-config-file nodes-${PORT}.conf --cluster-node-timeout $TIMEOUT --appendonly yes --appendfilename appendonly-${PORT}.aof --appenddirname appendonlydir-${PORT} --dbfilename dump-${PORT}.rdb --logfile ${PORT}.log --daemonize yes ${ADDITIONAL_OPTIONS} For reference please see the snippet below after modification in the start option: References GitHub Redis Redis.git

By Rahul Chaturvedi

Cypress vs Playwright: A Comparative Analysis

Selecting the ideal testing tool for your project could seem like a difficult task. Two of the most popular choices in the field are Cypress and Playwright, but understanding their features and capabilities can help you make an informed decision. Cypress is an end-to-end (E2E) testing framework built for modern online applications, based on JavaScript. Cypress is a great option for novices because of its well-known simplicity and ease of use. Web applications can be tested quickly and reliably because of its unique design. Cypress is compatible with many other tools and frameworks, including Angular, Vue, React, and more. Its special features, such as automated waiting and time travel, can greatly increase the effectiveness and dependability of your testing procedure. On the other hand, Playwright offers a more comprehensive approach to testing. Developed by Microsoft, Playwright supports multiple programming languages, including JavaScript, TypeScript, Python, and C#. Its cross-browser testing capabilities enable developers to test applications across various browsers and devices seamlessly. This blog will guide you through the criteria by which you can determine the most suitable tool for your project. Before explaining more about Cypress vs Playwright, let’s see the npm trend for Cypress and Playwright. Looking at the npm trends gives us some insights into the popularity and adoption rates of Cypress. About Cypress Cypress is an open-source end-to-end testing framework designed specifically for modern web applications. It’s known for its simplicity, speed, and ability to provide reliable testing results. Cypress operates directly within the browser and executes tests in the same run loop as the application being tested. This architecture allows for fast and consistent testing without the need for external drivers. The tests are run within the browser, which minimizes test execution time and eliminates network latency, as seen in the screenshot below. Cypress Architecture Cypress runs on top of Node.js, which acts as the central hub for managing and executing tests. Cypress architecture differs from most other test automation tools. Unlike Selenium, which runs outside the browser, Cypress executes tests directly inside the browser. This approach offers several advantages, including: Faster test execution: Because Cypress is running in the same environment as the application, there’s no need for network overhead. More reliable tests: Cypress has better control over the browser and can wait for elements to load before interacting with them, reducing the likelihood of flaky tests. Easier debugging: Since the tests are running in the browser, you can use the browser’s developer tools to debug any issues that arise. Here’s a breakdown of the key components of Cypress testing architecture: Test runner: The test runner is a Node.js server that coordinates the execution of your tests. It communicates with the browser and the test files. Browser: Cypress tests are executed inside the browser. This gives Cypress full control over the browser environment and allows it to interact with the application directly. Dev tools: Cypress can leverage the browser’s developer tools to inspect the DOM, network traffic, and other aspects of the application. Node.js server: The Node.js server runs behind the scenes and provides various functionalities such as file serving, test execution, and communication with the browser. Why To Use Cypress Cypress offers several key advantages and features that make it a popular choice for front-end testing. Here are some of the most notable ones: All-in-One Testing Framework Cypress combines several testing types and utilities into a single package, providing a comprehensive testing solution. It includes features for end-to-end testing, unit testing, integration testing, and even stubbing and mocking network requests. Automatic Waiting Cypress automatically waits for commands and assertions to pass before moving on to the next step in the test. This intelligent waiting behavior eliminates the need for explicit sleep statements or complex synchronization logic, making tests more reliable and easier to write. Time Travel and Debugging Cypress provides a powerful time travel feature that allows you to step through your tests, pause execution, and inspect the state of the application at any point. This makes debugging tests much easier and more intuitive. Real Browser Automation Cypress runs tests in a real browser, providing a realistic testing environment that closely mimics user interactions. This is in contrast to other tools that simulate browser behavior, which can sometimes miss edge cases or fail to accurately represent the user experience. Parallelization and Recording Cypress supports running tests in parallel across multiple machines or browsers, significantly speeding up the testing process. Flake-Resistant Tests Cypress is designed to minimize flaky tests (tests that pass or fail inconsistently) by providing a more reliable and deterministic testing environment. Cross-Browser Support Cypress supports all major desktop and mobile browsers, including Chrome, Firefox, Edge, Electron, and even mobile browsers on real devices or simulators. Why Not To Use Cypress So far we have seen the advantage of using Cypress however at the same time there are a few cons of Cypress. Here are some of the most notable ones: Language Limitation Cypress is limited to JavaScript/TypeScript, which may be a disadvantage for teams using other programming languages. In contrast, tools like Selenium WebDriver offer support for multiple programming languages. MultiTab Testing and iframe Support Testing scenarios involving multiple tabs or iframes can be common in web applications. Cypress’s limitations in this regard might make it challenging to test such scenarios effectively. Learning Curve While Cypress is generally user-friendly, beginners may still face a learning curve, especially if they are not familiar with JavaScript or modern web development practices. Continuous Integration Configuration Setting up Cypress for continuous integration (CI) can require some configuration and may not be as straightforward as with other testing tools. No Native Mobile Support Cypress is primarily designed for web application testing, and it does not have built-in support for native mobile applications. Parallel Test Execution Cypress does not support parallel test execution out of the box. This means that running tests in parallel across multiple browsers or machines requires additional setup and configuration. About Playwright Playwright is a modern test automation tool created and maintained by Microsoft. It’s designed to automate web applications across multiple browsers (Chromium, Firefox, and WebKit) with a single API. Playwright Architecture Playwright adopts a client-server architecture where the client, typically your test code written in various programming languages, communicates with the Playwright server. The server manages interactions with the browser engines (Chromium, Firefox, and WebKit) and executes commands received from the client. Source ProgramsBuzz Playwright leverages browser-specific protocols for communication with Chromium, Firefox, and WebKit. For Chromium, it utilizes the Chrome DevTools Protocol (CDP), For Firefox and WebKit, Playwright implements its own protocols, similar in functionality to CDP but tailored to the respective browser engines. Playwright establishes a WebSocket connection between the client and server to facilitate communication. WebSocket offers advantages such as low latency and full-duplex communication, enabling real-time data exchange between the client and server. Why To Use Playwright Here are some key points that explain why to use Playwright: Cross-Browser Support Playwright provides a unified API to automate Chromium, Firefox, and WebKit browsers, allowing you to run tests across different browsers with minimal code changes. Auto-Waiting and Intelligent Selectors Playwright automatically waits for elements to be available and uses intelligent selectors that can reliably identify elements even if their attributes or positions change. Parallelization and Sharding Playwright supports running tests in parallel across multiple browsers and sharding tests across multiple machines or containers, enabling faster test execution. Codegen and Trace Viewer Playwright includes a codegen utility that can generate test code by recording user interactions, and a trace viewer that allows you to inspect and debug test execution. Multiple Language Support Playwright supports multiple programming languages, including JavaScript, TypeScript, Python, .NET, and Java. Powerful API Playwright provides a comprehensive API for interacting with web pages, including handling file uploads, emulating mobile devices, capturing screenshots and videos, and more. Built-in Assertions Playwright includes built-in assertions for common testing scenarios, reducing the need for external assertion libraries. Docker Support Playwright can run tests inside Docker containers, making it easier to set up and maintain a consistent testing environment. Debugging and Inspection Tools Playwright offers various debugging and inspection tools, such as the ability to pause test execution, take screenshots, and capture network requests and responses. Active Development and Community Playwright is actively developed and maintained by Microsoft, with a growing community of contributors and users. Why Not To Use Playwright Limited Mobile Support While Playwright supports mobile browser automation, its capabilities for mobile app testing are not as extensive as its desktop browser testing features. Limited Community Resources Playwright is growing very fast compared to other browser automation tools like Selenium WebDriver, Playwright has a smaller community and fewer online resources such as tutorials, articles, and forums. Comparison Between Cypress Vs Playwright Here’s a simplified comparison of Cypress vs Playwright: Conclusion Cypress excels in its simplicity, ease of use, and strong community support. On the other hand, Playwright’s versatility, cross-browser support, and robust automation capabilities make it a better choice for complex web applications, especially those requiring multi-browser testing or scenarios involving interactions beyond the scope of typical user interactions. Ultimately, the choice between the two tools hinges on the specific needs of your team and project.

By Kailash Pathak

CORE

Feature Flags for Coordinated Spring API and Mobile App Rollouts

As part of our FinTech mobile application deployment, we usually faced challenges coordinating the releases across backend APIs and mobile applications on iOS and Android platforms whenever we released significant new features. Typically, we will first deploy the backend APIs, as they are quick to deploy along with the database changes. Once the backend APIs are deployed, we publish the mobile applications for both iOS and Android platforms. The publishing process often takes time. The mobile application gets approved within a few hours and sometimes within a few days. If we raise the tickets with stores, the SLA (Service Level Agreement) for those tickets will span multiple days. The delays we saw were predominantly with Android in the last year or so. Once the back-end APIs are deployed, they initiate new workflows related to the new features. These could be new screens or a new set of data. However, the mobile application version available at that time for both platforms is not ready to accept these new screens as the newer app version has not been approved yet and would be in the store review process. This inconsistency can lead to a poor user experience which can manifest in various ways, such as the app not functioning correctly, the application crashing, or displaying an oops page or some internal errors. This can be avoided by implementing feature flags on the backend APIs. Feature Flags The feature flags are the configurations stored in the database that help us turn specific features on or off in an application without requiring code changes. By wrapping new functionality behind feature flags, we can deploy the code to production environments while keeping the features hidden from end-users until they are ready to be released. Once the newer versions of the mobile apps are available, we enable these new features from the database so that backend APIs can orchestrate the new workflows or data for the new set of features. Additionally, we need to consider that both iOS and Android apps would get published at different times, so we need to ensure that we have platform-specific feature flags. In our experience, we have seen iOS apps get approved in minutes or hours, and Android apps sometimes take a day to a few hours. In summary, backend APIs need to orchestrate new workflows and data when the corresponding platform application's latest version is available in the app store. For existing users who have the app installed already, we force a version upgrade at the app launch. To avoid version discrepancy issues during the new feature rollout, we follow a coordinated release strategy using feature flags, as explained below. Coordinated Release Backend APIs Release With Feature Flags Off We first deploy the backend APIs with feature flags with the value Off for all the platforms. Typically, when we create the feature flags, we keep the default value as Off or 0. Mobile Application Publishing The mobile application teams for iOS and Android submit the latest validated version to the App Store and Play Store, respectively. The respective teams monitor the publishing process for rejections or clarifications during the review process. Enable New Feature Once the respective mobile application team confirms that the app has been published, then we enable the new feature for that platform. Monitoring After the new feature has been enabled across the platforms, we monitor the production environment for backend APIs for any errors and mobile applications for any crashes. If any significant issue is identified, we turn off the feature entirely across all platforms or specific platforms, depending on the type of the issue. This allows us to instantaneously roll back a new feature functionality, minimizing the impact on user experience. Feature Flags Implementation in Spring Boot Application Feature Service Below is an example of a FeatureServiceV1Impl Spring service in the Spring Boot application, which handles feature flags configuration. We have defined the bean's scope as the request scope. This ensures a new service instance is created for each HTTP request, thus ensuring that the updated configuration data is available for all new requests. The initializeConfiguration method is annotated with @PostConstruct, meaning it is called after the bean's properties have been set. This method fetches the configuration data from the database when the service is first instantiated for each request. With request scope, we only fetch the feature flags configuration from the database once. If there are feature checks at multiple places while executing that request, there would be only one database call to fetch the configuration. This service's main functionality is to check whether a specific feature is available. It does this by checking the feature flag configuration values from the database. In the example below, the isCashFlowUWAvailable method checks if the "Cash Flow Underwriting" feature is available for a given origin (iOS, Android, or mobile web app). Java @RequestScope @Service @Qualifier("featureServiceV1") public class FeatureServiceV1Impl implements FeatureServiceV1 { private final Logger logger = LoggerFactory.getLogger(this.getClass()); private List<Config> configs; @Autowired ConfigurationRepository configurationRepository; @PostConstruct private void initializeConfiguration() { logger.info("FeatureService::initializeConfiguration - Initializing configuration"); if (configs == null) { logger.info("FeatureService::initializeConfiguration - Fetching configuration"); GlobalConfigListRequest globalConfigListRequest = new GlobalConfigListRequest("ICW_API"); this.configs = this.configurationRepository.getConfigListNoError(globalConfigListRequest); } } @Override public boolean isCashFlowUWAvailable(String origin) { boolean result = false; try { if (configs != null && configs.size() > 0) { if (origin.toLowerCase().contains("ios")) { result = this.isFeatureAvailableBasedOnConfig("feature_cf_uw_ios"); } else if (origin.toLowerCase().contains("android")) { result = this.isFeatureAvailableBasedOnConfig("feature_cf_uw_android"); } else if (origin.toLowerCase().contains("mobilewebapp")) { result = this.isFeatureAvailableBasedOnConfig("feature_cf_uw_mobilewebapp"); } } } catch (Exception ex) { logger.error("FeatureService::isCashFlowUWAvailable - An error occurred detail error:", ex); } return result; } private boolean isFeatureAvailableBasedOnConfig(String configName) { boolean result = false; if (configs != null && configs.size() > 0) { Optional<Config> config = Optional .of(configs.stream().filter(o -> o.getConfigName().equals(configName)).findFirst()).orElse(null); if (config.isPresent()) { String configValue = config.get().getConfigValue(); if (configValue.equalsIgnoreCase("1")) { result = true; } } } return result; } } Consuming Feature Service We will then reference and auto-wire the FeatureServiceV1 in the controller or other service in the Spring Boot application, as shown below. We annotate the FeatureServiceV1 with the @Lazy annotation. The @Lazy annotation will ensure that the FeatueServiceV1 is instantiated when the FeatrueServiceV1 method is invoked from particular methods of the controller or service. This will prevent the unnecessary loading of the feature-specific database configurations if any other method of the controller or service is invoked where the feature service is not referenced. This helps improve the application start-up time. Java @Autowired @Lazy private FeatureServiceV1 featureServiceV1; We then leverage FeatureServiceV1 to check the availability of the feature and then branch our code accordingly. Branching allows us to execute feature-specific code when available or default to the normal path. Below is an example of how to use the feature availability check and to branch the code: Java if (this.featureServiceV1.isCashFlowUWAvailable(context.origin)) { logger.info("Cashflow Underwriting Path"); // Implement the logic for the Cash Flow Underwriting path } else { logger.info("Earlier Normal Path"); // Implement the logic for the normal path } Here’s how we can implement this conditional logic in a controller or service method: Java @RestController @RequestMapping("/api/v1/uw") public class UnderwritingController { @Autowired @Lazy private FeatureServiceV1 featureServiceV1; @RequestMapping("/loan") public void processLoanUnderwriting(RequestContext context) { if (this.featureServiceV1.isCashFlowUWAvailable(context.origin)) { logger.info("Cashflow Underwriting Path"); // Implement the logic for the Cash Flow Underwriting path } else { logger.info("Earlier Normal Path"); // Implement the logic for the normal path } } } Conclusion Feature flags play is important, particularly when coordinating releases across multiple platforms. In our case, we have four channels: two native mobile applications (iOS and Android), a mobile web application (browser-based), and an iPad application. Feature flags help in smooth and controlled rollouts, minimizing disruptions to the user experience. They ensure that new features are only activated when the corresponding platform-specific latest version of the application is available in the app stores.

By Amol Gote

CORE

Parsing Structured Environment Variables in Rust

I'm in the process of adding more components to my OpenTelemetry demo (again!). The new design deploys several warehouse services behind the inventory service so the latter can query the former for data via their respective HTTP interface. I implemented each warehouse on top of a different technology stack. This way, I can show OpenTelemetry traces across several stacks. Anyone should be able to add a warehouse in their favorite tech stack if it returns the correct JSON payload to the inventory. For this, I want to make the configuration of the inventory "easy"; add a new warehouse with a simple environment variable pair, i.e., the endpoint and its optional country. The main issue is that environment variables are not structured. I searched for a while and found a relevant post. Its idea is simple but efficient; here's a sample from the post: Properties files FOO__1__BAR=setting-1 #1 FOO__1__BAZ=setting-2 #1 FOO__2__BAR=setting-3 #1 FOO__2__QUE=setting-4 #1 FIZZ__1=setting-5 #2 FIZZ__2=setting-6 #2 BILL=setting-7 #3 Map-like structure Table-like structure Just a value With this approach, I could configure the inventory like this: YAML services: inventory: image: otel-inventory:1.0 environment: WAREHOUSE__0__ENDPOINT: http://apisix:9080/warehouse/us #1 WAREHOUSE__0__COUNTRY: USA #2 WAREHOUSE__1__ENDPOINT: http://apisix:9080/warehouse/eu #1 WAREHOUSE__2__ENDPOINT: http://warehouse-jp:8080 #1 WAREHOUSE__2__COUNTRY: Japan #2 OTEL_EXPORTER_OTLP_ENDPOINT: http://jaeger:4317 OTEL_RESOURCE_ATTRIBUTES: service.name=inventory OTEL_METRICS_EXPORTER: none OTEL_LOGS_EXPORTER: none Warehouse endpoint Set country You can see the three warehouses configured in the above. Each has an endpoint/optional country pair. My first attempt looked like the following: Rust lazy_static::lazy_static! { //1 static ref REGEXP_WAREHOUSE: Regex = Regex::new(r"^WAREHOUSE__(\d)__.*").unwrap(); } std::env::vars() .filter(|(key, _)| REGEXP_WAREHOUSE.find(key.as_str()).is_some()) //2 .group_by(|(key, _)| key.split("__").nth(1).unwrap().to_string()) //3 .into_iter() //4 .map(|(_, mut group)| { //5 let some_endpoint = group.find(|item| item.0.ends_with("ENDPOINT")); //6 let endpoint = some_endpoint.unwrap().1; let some_country = group //7 .find(|item| item.0.ends_with("COUNTRY")) .map(|(_, country)| country); println! {"Country pair is: {:?}", some_country}; (endpoint, some_country).into() //8 } .collect::<Vec<_>>() For making constants out of code evaluated at runtime Filter out warehouse-related environment variable Group by index Back to an Iter with the help of itertools Consist of just the endpoint or the endpoint and the country Get the endpoint Get the country Into a structure — irrelevant I encountered issues several times when I started the demo. The code somehow didn't find the endpoint at all. I chose this approach because I've been taught that it's more performant to iterate throughout the key-value pairs of a map than iterate through its key only and then get the value in the map. I tried to change to the latter. Rust lazy_static! { static ref REGEXP_WAREHOUSE_ENDPOINT: Regex = Regex::new(r"^WAREHOUSE__(?<index>\d)__ENDPOINT.*").unwrap(); //1 } std::env::vars() .filter(|(key, _)| REGEXP_WAREHOUSE_ENDPOINT.find(key.as_str()).is_some()) //2 .map(|(key, endpoint)| { let some_warehouse_index = REGEXP_WAREHOUSE_ENDPOINT.captures(key.as_str()).unwrap(); //3//4 println!("some_warehouse_index: {:?}", some_warehouse_index); let index = some_warehouse_index.name("index").unwrap().as_str(); let country_key = format!("WAREHOUSE__{}__COUNTRY", index); //5 let some_country = var(country_key); //6 println!("endpoint: {}", endpoint); (endpoint, some_country).into() }) .collect::<Vec<_>>() Change the regex to capture only the endpoint-related variables Filter out warehouse-related environment variable I'm aware that the filter_map() function exists, but I think it's clearer to separate them here Capture the index Create the country environment variable from a known string, and the index Get the country With this code, I didn't encounter any issues. Now that it works, I'm left with two questions: Why doesn't the group()/find() version work in the deployed Docker Compose despite working in the tests? Is anyone interested in making a crate out of it? To Go Further Structured data in environment variables lazy_static crate envconfig crate

By Nicolas Fränkel

CORE

Operation and Network Administration Management of Telecom 5G Network Functions Using Openshift Kubernetes Tools

The world of Telecom is evolving at a rapid pace, and it is not just important, but crucial for operators to stay ahead of the game. As 5G technology becomes the norm, it is not just essential, but a strategic imperative to transition seamlessly from 4G technology (which operates on OpenStack cloud) to 5G technology (which uses Kubernetes). In the current scenario, operators invest in multiple vendor-specific monitoring tools, leading to higher costs and less efficient operations. However, with the upcoming 5G world, operators can adopt a unified monitoring and alert system for all their products. This single system, with its ability to monitor network equipment, customer devices, and service platforms, offers a reassuringly holistic view of the entire system, thereby reducing complexity and enhancing efficiency. By adopting a Prometheus-based monitoring and alert system, operators can streamline operations, reduce costs, and enhance customer experience. With a single monitoring system, operators can monitor their entire 5G system seamlessly, ensuring optimal performance and avoiding disruptions. This practical solution eliminates the need for a complete overhaul and offers a cost-effective transition. Let's dive deep. Prometheus, Grafana, and Alert Manager Prometheus is a tool for monitoring and alerting systems, utilizing a pull-based monitoring system. It scrapes, collects, and stores Key Performance Indicators (KPI) with labels and timestamps, enabling it to collect metrics from targets, which are the Network Functions' namespaces in the 5G telecom world. Grafana is a dynamic web application that offers a wide range of functionalities. It visualizes data, allowing the building of charts, graphs, and dashboards that the 5G Telecom operator wants to visualize. Its primary feature is the display of multiple graphing and dashboarding support modes using GUI (Graphical user interface). Grafana can seamlessly integrate data collected by Prometheus, making it an indispensable tool for telecom operators. It is a powerful web application that supports the integration of different data sources into one dashboard, enabling continuous monitoring. This versatility improves response rates by alerting the telecom operator's team when an incident emerges, ensuring a minimum 5G network function downtime. The Alert Manager is a crucial component that manages alerts from the Prometheus server via alerting rules. It manages the received alerts, including silencing and inhibiting them and sending out notifications via email or chat. The Alert Manager also removes duplications, grouping, and routing them to the centralized webhook receiver, making it a must-have tool for any telecom operator. Architectural Diagram Prometheus Components of Prometheus (Specific to a 5G Telecom Operator) Core component: Prometheus server scrapes HTTP endpoints and stores data (time series). The Prometheus server, a crucial component in the 5G telecom world, collects metrics from the Prometheus targets. In our context, these targets are the Kubernetes cluster that houses the 5G network functions. Time series database (TSDB): Prometheus stores telecom Metrics as time series data. HTTP Server: API to query data stored in TSDB; The Grafana dashboard can query this data for visualization. Telecom operator-specific libraries (5G) for instrumenting application code. Push gateway (scrape target for short-lived jobs) Service Discovery: In the world of 5G, network function pods are constantly being added or deleted by Telecom operators to scale up or down. Prometheus's adaptable service discovery component monitors the ever-changing list of pods. The Prometheus Web UI, accessible through port 9090, is a data visualization tool. It allows users to view and analyze Prometheus data in a user-friendly and interactive manner, enhancing the monitoring capabilities of the 5G telecom operators. The Alert Manager, a key component of Prometheus, is responsible for handling alerts. It is designed to notify users if something goes wrong, triggering notifications when certain conditions are met. When alerting triggers are met, Prometheus alerts the Alert Manager, which sends alerts through various channels such as email or messenger, ensuring timely and effective communication of critical issues. Grafana for dashboard visualization (actual graphs) With Prometheus's robust components, your Telecom operator's 5G network functions are monitored with diligence, ensuring reliable resource utilization, tracking performance, detection of errors in availability, and more. Prometheus can provide you with the necessary tools to keep your network running smoothly and efficiently. Prometheus Features The multi-dimensional data model identified by metric details uses PromQL (Prometheus Querying Language) as the query language and the HTTP Pull model. Telecom operators can now discover 5G network functions with service discovery and static configuration. The multiple modes of dashboard and GUI support provide a comprehensive and customizable experience for users. Prometheus Remote Write to Central Prometheus from Network Functions 5G Operators will have multiple network functions from various vendors, such as SMF (Session Management Function), UPF (User Plane Function), AMF (Access and Mobility Management Function), PCF (Policy Control Function), and UDM (Unified Data Management). Using multiple Prometheus/Grafana dashboards for each network function can lead to a complex and inefficient 5G network operator monitoring process. To address this, it is highly recommended that all data/metrics from individual Prometheus be consolidated into a single Central Prometheus, simplifying the monitoring process and enhancing efficiency. The 5G network operator can now confidently monitor all the data at the Central Prometheus's centralized location. This user-friendly interface provides a comprehensive view of the network's performance, empowering the operator with the necessary tools for efficient monitoring. Grafana Grafana Features Panels: This powerful feature empowers operators to visualize Telecom 5G data in many ways, including histograms, graphs, maps, and KPIs. It offers a versatile and adaptable interface for data representation, enhancing the efficiency and effectiveness of your data analysis. Plugins: This feature efficiently renders Telecom 5G data in real-time on a user-friendly API (Application Programming Interface), ensuring operators always have the most accurate and up-to-date data at their fingertips. It also enables operators to create data source plugins and retrieve metrics from any API. Transformations: This feature allows you to flexibly adapt, summarize, combine, and perform KPI metrics query/calculations across 5G network functions data sources, providing the tools to effectively manipulate and analyze your data. Annotations: Rich events from different Telecom 5G network functions data sources are used to annotate metrics-based graphs. Panel editor: Reliable and consistent graphical user interface for configuring and customizing 5G telecom metrics panels Grafana Sample Dashboard GUI for 5G Alert Manager Alert Manager Components The Ingester swiftly ingests all alerts, while the Grouper groups them into categories. The De-duplicator prevents repetitive alerts, ensuring you're not bombarded with notifications. The Silencer is there to mute alerts based on a label, and the Throttler regulates the frequency of alerts. Finally, the Notifier will ensure that third parties are notified promptly. Alert Manager Functionalities Grouping: Grouping categorizes similar alerts into a single notification system. This is helpful during more extensive outages when many 5G network functions fail simultaneously and when all the alerts need to fire simultaneously. The telecom operator will expect only to get a single page while still being able to visualize the exact service instances affected. Inhibition: Inhibition suppresses the notification for specific low-priority alerts if certain major/critical alerts are already firing. For example, when a critical alert fires, indicating that an entire 5G SMF (Session Management Function) cluster is not reachable, AlertManager can mute all other minor/warning alerts concerning this cluster. Silences: Silences are simply mute alerts for a given time. Incoming alerts are checked to match the regular expression matches of an active silence. If they match, no notifications will be sent out for that alert. High availability: Telecom operators will not load balance traffic between Prometheus and all its Alert Managers; instead, they will point Prometheus to a list of all Alert Managers. Dashboard Visualization Grafana dashboard visualizes the Alert Manager webhook traffic notifications as shown below: Configuration YAMLs (Yet Another Markup Language) Telecom Operators can install and run Prometheus using the configuration below: YAML prometheus: enabled: true route: enabled: {} nameOverride: Prometheus tls: enabled: true certificatesSecret: backstage-prometheus-certs certFilename: tls.crt certKeyFilename: tls.key volumePermissions: enabled: true initdbScriptsSecret: backstage-prometheus-initdb prometheusSpec: retention: 3d replicas: 2 prometheusExternalLabelName: prometheus_cluster image: repository: <5G operator image repository for Prometheus> tag: <Version example v2.39.1> sha: "" podAntiAffinity: "hard" securityContext: null resources: limits: cpu: 1 memory: 2Gi requests: cpu: 500m memory: 1Gi serviceMonitorNamespaceSelector: matchExpressions: - {key: namespace, operator: In, values: [<Network function 1 namespace>, <Network function 2 namespace>]} serviceMonitorSelectorNilUsesHelmValues: false podMonitorSelectorNilUsesHelmValues: false ruleSelectorNilUsesHelmValues: false Configuration to route scrape data segregated based on the namespace and route to Central Prometheus. Note: The below configuration can be appended to the Prometheus mentioned in the above installation YAML. YAML remoteWrite: - url: <Central Prometheus URL for namespace 1 by 5G operator> basicAuth: username: name: <secret username for namespace 1> key: username password: name: <secret password for namespace 1> key: password tlsConfig: insecureSkipVerify: true writeRelabelConfigs: - sourceLabels: - namespace regex: <namespace 1> action: keep - url: <Central Prometheus URL for namespace 2 by 5G operator> basicAuth: username: name: <secret username for namespace 2> key: username password: name: <secret password for namespace 2> key: password tlsConfig: insecureSkipVerify: true writeRelabelConfigs: - sourceLabels: - namespace regex: <namespace 2> action: keep Telecom Operators can install and run Grafana using the configuration below. YAML grafana: replicas: 2 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: "app.kubernetes.io/name" operator: In values: - Grafana topologyKey: "kubernetes.io/hostname" securityContext: false rbac: pspEnabled: false # Must be disabled due to tenant permissions namespaced: true adminPassword: admin image: repository: <artifactory>/Grafana tag: <version> sha: "" pullPolicy: IfNotPresent persistence: enabled: false initChownData: enabled: false sidecar: image: repository: <artifactory>/k8s-sidecar tag: <version> sha: "" imagePullPolicy: IfNotPresent resources: limits: cpu: 100m memory: 100Mi requests: cpu: 50m memory: 50Mi dashboards: enabled: true label: grafana_dashboard labelValue: "Vendor name" datasources: enabled: true defaultDatasourceEnabled: false additionalDataSources: - name: Prometheus type: Prometheus url: http://<prometheus-operated>:9090 access: proxy isDefault: true jsonData: timeInterval: 30s resources: limits: cpu: 400m memory: 512Mi requests: cpu: 50m memory: 206Mi extraContainers: - name: oauth-proxy image: <artifactory>/origin-oauth-proxy:<version> imagePullPolicy: IfNotPresent ports: - name: proxy-web containerPort: 4181 args: - --https-address=:4181 - --provider=openshift # Service account name here must be "<Helm Release name>-grafana" - --openshift-service-account=monitoring-grafana - --upstream=http://localhost:3000 - --tls-cert=/etc/tls/private/tls.crt - --tls-key=/etc/tls/private/tls.key - --cookie-secret=SECRET - --pass-basic-auth=false resources: limits: cpu: 100m memory: 256Mi requests: cpu: 50m memory: 128Mi volumeMounts: - mountPath: /etc/tls/private name: grafana-tls extraContainerVolumes: - name: grafana-tls secret: secretName: grafana-tls serviceAccount: annotations: "serviceaccounts.openshift.io/oauth-redirecturi.first": https://[SPK exposed IP for Grafana] service: targetPort: 4181 annotations: service.alpha.openshift.io/serving-cert-secret-name: <secret> Telecom Operators can install and run Alert Manager using the configuration below. YAML alertmanager: enabled: true alertmanagerSpec: image: repository: prometheus/alertmanager tag: <version> replicas: 2 podAntiAffinity: hard securityContext: null resources: requests: cpu: 25m memory: 200Mi limits: cpu: 100m memory: 400Mi containers: - name: config-reloader resources: requests: cpu: 10m memory: 10Mi limits: cpu: 25m memory: 50Mi Configuration to route Prometheus Alert Manager data to the Operator's centralized webhook receiver. Note: The below configuration can be appended to the Alert Manager mentioned in the above installation YAML. YAML config: global: resolve_timeout: 5m route: group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'null' routes: - receiver: '<Network function 1>' group_wait: 10s group_interval: 10s group_by: ['alertname','oid','action','time','geid','ip'] matchers: - namespace="<namespace 1>" - receiver: '<Network function 2>' group_wait: 10s group_interval: 10s group_by: ['alertname','oid','action','time','geid','ip'] matchers: - namespace="<namespace 2>" Conclusion The open-source OAM (Operation and Maintenance) tools Prometheus, Grafana, and Alert Manager can benefit 5G Telecom operators. Prometheus periodically captures all the status of monitored 5G Telecom network functions through the HTTP protocol, and any component can be connected to the monitoring as long as the 5G Telecom operator provides the corresponding HTTP interface. Prometheus and Grafana Agent gives the 5G Telecom operator control over the metrics the operator wants to report; once the data is in Grafana, it can be stored in a Grafana database as extra data redundancy. In conclusion, Prometheus allows 5G Telecom operators to improve their operations and offer better customer service. Adopting a unified monitoring and alert system like Prometheus is one way to achieve this.

By BINU SUDHAKARAN PILLAI

Deployment Strategies

Deployment strategies provide a systematic approach to releasing software changes, minimizing risks, and maintaining consistency across projects and teams. Without a well-defined strategy and systematic approach, deployments can lead to downtime, data loss, or system failures, resulting in frustrated users and revenue loss. Before we start exploring different deployment strategies in more detail, let’s take a look at the short overview of each deployment strategy mentioned in this article: All-at-once deployment: This strategy involves updating all the target environments at once, making it the fastest but riskiest approach. In-place deployment: This involves stopping the current application and replacing it with a new version, directly affecting availability. Blue/Green deployment: A zero-downtime approach that involves running two identical environments and switching from old to new Canary deployment: Introduces new changes incrementally to a small subset of users before a full rollout Shadow deployment: Mirrors real traffic to a shadow environment where the new deployment is tested without affecting the live environment All-At-Once Deployment All-at-once deployment strategy, also known as the "Big Bang" deployment strategy, involves simultaneously releasing your application's new version to all servers or environments. This method is straightforward and can be implemented quickly, as it does not require complex orchestration or additional infrastructure. The primary benefit of this approach is its simplicity and the ability to immediately transition all users to the new version of the application. However, the all-at-once method carries significant risks. Since all instances are updated together, any issues with the new release immediately impact all users. There is no opportunity to mitigate risks by gradually rolling out the change or testing it with a subset of the user base first. Additionally, if something goes wrong, the rollback process can be just as disruptive as the initial deployment. Despite these risks, all-at-once deployment can be suitable for small applications or environments where downtime is more acceptable and the impact of potential issues is minimal and is used pretty often. It is also useful in scenarios where applications are inherently simple or have been thoroughly tested to ensure compatibility and stability before release. In-Place (Recreate) Deployment In-place or recreate deployment strategy is another strategy that is used pretty often when developing projects. It is the simplest and does not require additional infrastructure. Its essence lies in the fact that when we deploy a new version, we stop the application and start it with new changes. The disadvantage of this approach is that the service we are updating will experience downtime that will affect its users. Also, in case of problems with new software changes, we might need to roll back the latest changes, which will lead to service downtime. To avoid downtime and be able to roll back changes without it during the deployment process, there are deployment strategies that are created for this purpose and used in the industry. Blue/Green Deployment The first zero downtime deployment strategy we’re going to talk about is the Blue/Green deployment strategy. Its main goal is to minimize downtime and risks while deploying new software versions. This is done by having 2 identical environments of our service. One environment contains the original application (Blue environment) that serves users' requests and the other environment (Green environment) is where new software changes are deployed. This allows us to verify and test new changes with near zero downtime for users and the service, with the ability to safely roll back in case of any problems, except for some cases that we will discuss a bit later. Typically, the process is the following: after verifying and testing the new changes in the Green environment, we reroute traffic from the Blue environment to our identical Green environment with the new changes. Sounds easy, isn’t it? ... it depends. The problem is that we can easily reroute traffic between environments only when our services are stateless. If they interact with any data sources, things get more complicated, and here's why: Our identical Green and Blue environments share common data source(s). While sharing data sources such as NoSQL databases or object stores (AWS S3, for example) between our identical environment is easier to accomplish, this is completely not true for relational databases because it requires additional efforts (NoSQL also might require some effort) to support Blue/Green deployments. Since approaches to handle schema updates without downtime are out of the scope of this article, you can check out the article, "Upgrading database schema without downtime," to learn more about updating schemas without downtime (if you have any interesting resources on updating schemas without downtime - please, share with us in the comments). A general recommendation is that If your services are not stateless and use data sources with schemas, implementing a Blue/Green deployment strategy is not always recommended because of the additional risk and failure points this can introduce minimizing the benefits of the Blue/Green deployment strategy. But if you’ve decided that you need to integrate a Blue/Green deployment strategy and your infrastructure is running on Amazon Web Services, you might find this document by AWS on how to implement Blue/Green deployments and its infrastructure useful. Canary Deployment The idea of the Canary deployment strategy is to reduce the risks of deploying new software versions in production by rolling out new changes to users slowly. In the same manner, as we do in the Blue/Green deployment strategy, we roll out new software versions to our identical environment; but instead of completely rerouting traffic from one environment to another, we, for example, route a portion of users to our environment with new software version using a load balancer. The size of the portion of users getting new software versions and/or the criteria used to determine them - may be specific for every company/project. Some roll out new changes only to their internal stuff first, some determine users randomly and some may use algorithms to match users based on some criteria. Pick anything that best suits your needs. Shadow Deployment Shadow deployment strategy is the next strategy I find interesting, personally. This strategy also uses the concept of identical environments, just as the Blue/Green and Canary deployment strategies do. The main difference is that instead of completely rerouting or rerouting only a portion of real users we duplicate entire traffic to our second environment where we deployed our new changes. This way, we can test and verify our changes without negatively affecting our users, thus mitigating risks of broken software updates or performance bottlenecks. Conclusion In this article, we walked through five different deployment strategies, each with its own set of advantages and challenges. The all-at-once and in-place deployment strategies stand out for their speed and minimal effort required to deploy new versions of software. While these two strategies will be your go-to deployment strategies in most cases, it’s still useful to understand and know about more complex and resource-intensive strategies. Ultimately, implementing any deployment strategy requires careful consideration of the potential impact on both the system and its users. The choice of deployment strategy should align with your project’s needs, risk tolerance, and operational capabilities.

By Illia Pantsyr