Software Design and Architecture Resources

DZone's Featured Software Design and Architecture Resources

Observability and Application Performance

Making data-driven decisions, as well as business-critical and technical considerations, first comes down to the accuracy, depth, and usability of the data itself. To build the most performant and resilient applications, teams must stretch beyond monitoring into the world of data, telemetry, and observability. And as a result, you'll gain a far deeper understanding of system performance, enabling you to tackle key challenges that arise from the distributed, modular, and complex nature of modern technical environments.Today, and moving into the future, it's no longer about monitoring logs, metrics, and traces alone — instead, it’s more deeply rooted in a performance-centric team culture, end-to-end monitoring and observability, and the thoughtful usage of data analytics.In DZone's 2023 Observability and Application Performance Trend Report, we delve into emerging trends, covering everything from site reliability and app performance monitoring to observability maturity and AIOps, in our original research. Readers will also find insights from members of the DZone Community, who cover a selection of hand-picked topics, including the benefits and challenges of managing modern application performance, distributed cloud architecture considerations and design patterns for resiliency, observability vs. monitoring and how to practice both effectively, SRE team scalability, and more.

You Can Shape Trend Reports: Participate in DZone Research Surveys + Enter the Prize Drawings!

By Caitlin Candelmo

Hello, DZone Community! We have several surveys in progress as part of our research for upcoming Trend Reports. We would love for you to join us by sharing your experiences and insights (anonymously if you choose) — readers just like you drive the content that we cover in our Trend Reports. check out the details for each research survey below Over the coming months, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our Trend Reports. Security Research Security is everywhere; you can’t live with it, and you certainly can’t live without it! We are living in an entirely unprecedented world — one where bad actors are growing more sophisticated and are taking full advantage of the rapid advancements in AI. We will be exploring the most pressing security challenges and emerging strategies in this year’s survey for our August Enterprise Security Trend Report. Our 10-12-minute Enterprise Security Survey explores: Building a security-first organization Security architecture and design Key security strategies and techniques Cloud and software supply chain security At the end of the survey, you're also able to enter the prize drawing for a chance to receive one of two $175 (USD) e-gift cards! Join the Security Research Data Engineering Research As a continuation of our annual data-related research, we're consolidating our database, data pipeline, and data and analytics scopes into a single 12-minute survey that will guide help the narratives of our July Database Systems Trend Report and data engineering report later in the year. Our 2024 Data Engineering Survey explores: Database types, languages, and use cases Distributed database design + architectures Data observability, security, and governance Data pipelines, real-time processing, and structured storage Vector data and databases + other AI-driven data capabilities Join the Data Engineering Research You'll also have the chance to enter the $500 raffle at the end of the survey — five random people will be drawn and will receive $100 each (USD)! Cloud and Kubernetes Research This year, we're combining our annual cloud native and Kubernetes research into one 10-minute survey that dives further into these topics as they relate to both one another and at the intersection of security, observability, AI, and more. DZone's research will be informing these Trend Reports: May – Cloud Native: Championing Cloud Development Across the SDLC September – Kubernetes in the Enterprise Our 2024 Cloud Native Survey covers: Microservices, container orchestration, and tools/solutions Kubernetes use cases, pain points, and security measures Cloud infrastructure, costs, tech debt, and security threats AI for release management + monitoring/observability Join the Cloud Native Research Don't forget to enter the $750 raffle at the end of the survey! Five random people will be selected to each receive $150 (USD). Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Publications team More

Refcard #385

Observability Maturity Model

By Lodewijk Bogaards

The Maturing of Cloud-Native Microservices Development: Effectively Embracing Shift Left to Improve Delivery

By Ray Elenteny

CORE

Orchestrating the Cloud: Increase Deployment Speed and Avoid Downtime by Orchestrating Infrastructure, Databases, and Containers

By Alan Hohn

Observations on Cloud-Native Observability: A Journey From the Foundations of Observability to Surviving Its Challenges at Scale

Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC. Cloud native and observability are an integral part of developer lives. Understanding their responsibilities within observability at scale helps developers tackle the challenges they are facing on a daily basis. There is more to observability than just collecting and storing data, and developers are essential to surviving these challenges. Observability Foundations Gone are the days of monitoring a known application environment, debugging services within our development tooling, and waiting for new resources to deploy our code to. This has become dynamic, agile, and quickly available with auto-scaling infrastructure in the final production deployment environments. Developers are now striving to observe everything they are creating, from development to production, often owning their code for the entire lifecycle. The tooling from days of old, such as Nagios and HP OpenView, can't keep up with constantly changing cloud environments that contain thousands of microservices. The infrastructure for cloud-native deployments is designed to dynamically scale as needed, making it even more essential for observability platforms to help condense all that data noise to detect trends leading to downtime before they happen. Splintering of Responsibilities in Observability Cloud-native complexity not only changed the developer world but also impacted how organizations are structured. The responsibilities of creating, deploying, and managing cloud-native infrastructure have split into a series of new organizational teams. Developers are being tasked with more than just code creation and are expected to adopt more hybrid roles within some of these new teams. Observability teams have been created to focus on a specific aspect of the cloud-native ecosystem to provide their organization a service within the cloud infrastructure. In Table 1, we can see the splintering of traditional roles in organizations into these teams with specific focuses. Table 1. Who's who in the observability game Team Focus maturity goals DevOps Automation and optimization of the app development lifecycle, including post-launch fixes and updates Early stages: developer productivity Platform engineering Designing and building toolchains and workflows that enable self-service capabilities for developers Early stages: developer maturity and productivity boost CloudOps Provides organizations proper (cloud) resource management, using DevOps principles and IT operations applied to cloud-based architectures to speed up business processes Later stages: cloud resource management, costs, and business agility SRE All-purpose role aiming to manage reliability for any type of environment; a full-time job avoiding downtime and optimizing performance of all apps and supporting infrastructure, regardless of whether it's cloud native Early to late stages: on-call engineers trying to reduce downtime Central observability team Responsible for defining observability standards and practices, delivering key data to engineering teams, and managing tooling and observability data storage Later stages, owning: Define monitoring standards and practices Deliver monitoring data to engineering teams Measure reliability and stability of monitoring solutions Manage tooling and storage of metrics data To understand how these teams work together, imagine a large, mature, cloud native organization that has all the teams featured in Table 1: The DevOps team is the first line for standardizing how code is created, managed, tested, updated, and deployed. They work with toolchains and workflow provided by the platform engineering team. DevOps advises on new tooling and/or workflows, creating continuous improvements to both. A CloudOps team focuses on cloud resource management and getting the most out of the budgets spent on the cloud by the other teams. An SRE team is on call to manage reliability, avoiding downtime for all supporting infrastructure in the organization. They provide feedback for all the teams to improve tools, processes, and platforms. The overarching central observability team sets the observability standards for all teams to adhere to, delivering the right observability data to the right teams and managing tooling and data storage. Why Observability Is Important to Cloud Native Today, cloud native usage has seen such growth that developers are overwhelmed by their vast responsibilities that go beyond just coding. The complexity introduced by cloud-native environments means that observability is becoming essential to solving many of the challenges developers are facing. Challenges Increasing cloud native complexity means that developers are providing more code faster and passing more rigorous testing to ensure that their applications work at cloud native scale. These challenges expanded the need for observability within what was traditionally the developers' coding environment. Not only do they need to provide code and testing infrastructure for their applications, they are also required to instrument that code so that business metrics can be monitored. Over time, developers learned that fully automating metrics was overkill, with much of that data being unnecessary. This led developers to fine tune their instrumentation methods and turn to manual instrumentation, where only the metrics they needed were collected. Another challenge arises when decisions are made to integrate existing application landscapes with new observability practices in an organization. The time developers spend manually instrumenting existing applications so that they provide the needed data to an observability platform is an often overlooked burden. New observability tools designed to help with metrics, logs, and traces are introduced to the development teams — leading to more challenges for developers. Often, these tools are mastered by few, leading to siloed knowledge, which results in organizations paying premium prices for advanced observability tools only to have them used as if one is engaging in observability as a toy. Finally, when exploring the ingested data from our cloud infrastructure, the first thing that becomes obvious is that we don't need to keep everything that is being ingested. We need the ability to have control over our telemetry data and find out what is unused by our observability teams. There are some questions we need to answer about how we can: Identify ingested data not used in dashboards, alerting rules, nor touched in ad hoc queries by our observability teams Control telemetry data with aggregation and rules before we put it into expensive, longer-term storage Use only telemetry data needed to support the monitoring of our application landscape Tackling the flood of cloud data in such a way as to filter out the unused telemetry data, keeping only that which is applied for our observability needs, is crucial to making this data valuable to the organization. Cloud Native at Scale The use of cloud-native infrastructure brings with it a lot of flexibility, but when done at scale, the small complexities can become overwhelming. This is due to the premise of cloud native where we describe how our infrastructure should be set up, how our applications and microservices should be deployed, and finally, how it automatically scales when needed. This approach reduces our control over how our production infrastructure reacts to surges in customer usage of an organization's services. Empowering Developers Empowering developers starts with platform engineering teams that focus on developer experiences. We create developer experiences in our organization that treat observability as a priority, dedicating resources for creating a telemetry strategy from day one. In this culture, we're setting up development teams for success with cloud infrastructure, using observability alongside testing, continuous integration, and continuous deployment. Developers are not only owning the code they deliver but are now encouraged and empowered to create, test, and own the telemetry data from their applications and microservices. This is a brave new world where they are the owners of their work, providing agility and consensus within the various teams working on cloud solutions. Rising to the challenges of observability in a cloud native world is a success metric for any organization, and they can't afford to get it wrong. Observability needs to be front of mind with developers, considered a first-class citizen in their daily workflows, and consistently helping them with challenges they face. Artificial Intelligence and Observability Artificial intelligence (AI) has risen in popularity within not only developer tooling but also in the observability domain. The application of AI in observability falls within one of two use cases: Monitoring machine learning (ML) solutions or large language model (LLM) systems Embedding AI into observability tooling itself as an assistant The first case is when you want to monitor specific AI workloads, such as ML or LLMs. They can be further split into two situations that you might want to monitor, the training platform and the production platform. Training infrastructure and the process involved can be approached just like any other workload: easy-to-achieve monitoring using instrumentation and existing methods, such as observing specific traces through a solution. This is not the complete monitoring process that goes with these solutions, but out-of-the-box observability solutions are quite capable of supporting infrastructure and application monitoring of these workloads. The second case is when AI assistants, such as chatbots, are included in the observability tooling that developers are exposed to. This is often in the form of a code assistant, such as one that helps fine tune a dashboard or query our time series data ad hoc. While these are nice to have, organizations are very mindful of developer usage when inputting queries that include proprietary or sensitive data. It's important to understand that training these tools might include using proprietary data in their training sets, or even the data developers input, to further train the agents for future query assistance. Predicting the future of AI-assisted observability is not going to be easy as organizations consider their data one of their top valued assets and will continue to protect its usage outside of their control to help improve tooling. To that end, one direction that might help adoption is to have agents trained only on in-house data, but that means the training data is smaller than publicly available agents. Cloud-Native Observability: The Developer Survival Pattern While we spend a lot of time on tooling as developers, we all understand that tooling is not always the fix for the complex problems we face. Observability is no different, and while developers are often exposed to the mantra of metrics, logs, and traces for solving their observability challenges, this is not the path to follow without considering the big picture. The amount of data generated in cloud-native environments, especially at scale, makes it impossible to continue collecting all data. This flood of data, the challenges that arise, and the inability to sift through the information to find the root causes of issues becomes detrimental to the success of development teams. It would be more helpful if developers were supported with just the right amount of data, in just the right forms, and at the right time to solve issues. One does not mind observability if the solution to problems are found quickly, situations are remediated faster, and developers are satisfied with the results. If this is done with one log line, two spans from a trace, and three metric labels, then that's all we want to see. To do this, developers need to know when issues arise with their applications or services, preferably before it happens. They start troubleshooting with data that has been determined by their instrumented applications to succinctly point to areas within the offending application. Any tooling allows the developer who's investigating to see dashboards reporting visual information that directs them to the problem and potential moment it started. It is crucial for developers to be able to remediate the problem, maybe by rolling back a code change or deployment, so the application can continue to support customer interactions. Figure 1 illustrates the path taken by cloud native developers when solving observability problems. The last step for any developer is to determine how issues encountered can be prevented going forward. Figure 1. Observability pattern Conclusion Observability is essential for organizations to succeed in a cloud native world. The splintering of responsibilities in observability, along with the challenges that cloud-native environments bring at scale, cannot be ignored. Understanding the challenges that developers face in cloud native organizations is crucial to achieving observability happiness. Empowering developers, providing ways to tackle observability challenges, and understanding how the future of observability might look are the keys to handling observability in modern cloud environments. DZone Refcard resources: Full-Stack Observability Essentials by Joana Carvalho Getting Started With OpenTelemetry by Joana Carvalho Getting Started With Prometheus by Colin Domoney Getting Started With Log Management by John Vester Monitoring and the ELK Stack by John Vester This is an excerpt from DZone's 2024 Trend Report,Cloud Native: Championing Cloud Development Across the SDLC.Read the Free Report

By Eric D. Schabell

CORE

The Impact of AI and Platform Engineering on Cloud Native's Evolution: Automate Your Cloud Journey to Light Speed

Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC. 2024 and the dawn of cloud-native AI technologies marked a significant jump in computational capabilities. We're experiencing a new era where artificial intelligence (AI) and platform engineering converge to transform cloud computing landscapes. AI is now merging with cloud computing, and we're experiencing an age where AI transcends traditional boundaries, offering scalable, efficient, and powerful solutions that learn and improve over time. Platform engineering is providing the backbone for these AI systems to operate within cloud environments seamlessly. This shift entails designing, implementing, and managing the software platforms that serve as the fertile ground for AI applications to flourish. Together, the integration of AI and platform engineering in cloud-native environments is not just an enhancement but a transformative force, redefining the very fabric of how services are now being delivered, consumed, and evolved in the digital cosmos. The Rise of AI in Cloud Computing Azure and Google Cloud are pivotal solutions in cloud computing technology, each offering a robust suite of AI capabilities that cater to a wide array of business needs. Azure brings to the table its AI Services and Azure Machine Learning, a collection of AI tools that enable developers to build, train, and deploy AI models rapidly, thus leveraging its vast cloud infrastructure. Google Cloud, on the other hand, shines with its AI Platform and AutoML, which simplify the creation and scaling of AI products, integrating seamlessly with Google's data analytics and storage services. These platforms empower organizations to integrate intelligent decision-making into their applications, optimize processes, and provide insights that were once beyond reach. A quintessential case study that illustrates the successful implementation of AI in the cloud is that of the Zoological Society of London (ZSL), which utilized Google Cloud's AI to tackle the biodiversity crisis. ZSL's "Instant Detect" system harnesses AI on Google Cloud to analyze vast amounts of images and sensor data from wildlife cameras across the globe in real time. This system enables rapid identification and categorization of species, transforming the way conservation efforts are conducted by providing precise, actionable data, leading to more effective protection of endangered species. Such implementations as ZSL's not only showcase the technical prowess of cloud AI capabilities but also underscore their potential to make a significant positive impact on critical global issues. Platform Engineering: The New Frontier in Cloud Development Platform engineering is a multifaceted discipline that refers to the strategic design, development, and maintenance of software platforms to support more efficient deployment and application operations. It involves creating a stable and scalable foundation that provides developers the tools and capabilities needed to develop, run, and manage applications without the complexity of maintaining the underlying infrastructure. The scope of platform engineering spans the creation of internal development platforms, automation of infrastructure provisioning, implementation of continuous integration and continuous deployment (CI/CD) pipelines, and the insurance of the platforms' reliability and security. In cloud-native ecosystems, platform engineers play a pivotal role. They are the architects of the digital landscape, responsible for constructing the robust frameworks upon which applications are built and delivered. Their work involves creating abstractions on top of cloud infrastructure to provide a seamless development experience and operational excellence. Figure 1. Platform engineering from the top down Platform engineers enable teams to focus on creating business value by abstracting away complexities related to environment configurations, along with resource scaling and service dependencies. They guarantee that the underlying systems are resilient, self-healing, and can be deployed consistently across various environments. The convergence of DevOps and platform engineering with AI tools is an evolution that is reshaping the future of cloud-native technologies. DevOps practices are enhanced by AI's ability to predict, automate, and optimize processes. AI tools can analyze data from development pipelines to predict potential issues, automate root cause analyses, and optimize resources, leading to improved efficiency and reduced downtime. Moreover, AI can drive intelligent automation in platform engineering, enabling proactive scaling and self-tuning of resources, and personalized developer experiences. This synergy creates a dynamic environment where the speed and quality of software delivery are continually advancing, setting the stage for more innovative and resilient cloud-native applications. Synergies Between AI and Platform Engineering AI-augmented platform engineering introduces a layer of intelligence to automate processes, streamline operations, and enhance decision-making. Machine learning (ML) models, for instance, can parse through massive datasets generated by cloud platforms to identify patterns and predict trends, allowing for real-time optimizations. AI can automate routine tasks such as network configurations, system updates, and security patches; these automations not only accelerate the workflow but also reduce human error, freeing up engineers to focus on more strategic initiatives. There are various examples of AI-driven automation in cloud environments, such as implementing intelligent systems to analyze application usage patterns and automatically adjust computing resources to meet demand without human intervention. The significant cost savings and performance improvements provide exceptional value to an organization. AI-operated security protocols can autonomously monitor and respond to threats more quickly than traditional methods, significantly enhancing the security posture of the cloud environment. Predictive analytics and ML are particularly transformative in platform optimization. They allow for anticipatory resource management, where systems can forecast loads and scale resources accordingly. ML algorithms can optimize data storage, intelligently archiving or retrieving data based on usage patterns and access frequencies. Figure 2. AI resource autoscaling Moreover, AI can oversee and adjust platform configurations, ensuring that the environment is continuously refined for optimal performance. These predictive capabilities are not limited to resource management; they also extend to predicting application failures, user behavior, and even market trends, providing insights that can inform strategic business decisions. The proactive nature of predictive analytics means that platform engineers can move from reactive maintenance to a more visionary approach, crafting platforms that are not just robust and efficient but also self-improving and adaptive to future needs. Changing Landscapes: The New Cloud Native The landscape of cloud native and platform engineering is rapidly evolving, particularly with leading cloud service providers like Azure and Google Cloud. This evolution is largely driven by the growing demand for more scalable, reliable, and efficient IT infrastructure, enabling businesses to innovate faster and respond to market changes more effectively. In the context of Azure, Microsoft has been heavily investing in Azure Kubernetes Service (AKS) and serverless offerings, aiming to provide more flexibility and ease of management for cloud-native applications. Azure's emphasis on DevOps, through tools like Azure DevOps and Azure Pipelines, reflects a strong commitment to streamlining the development lifecycle and enhancing collaboration between development and operations teams. Azure's focus on hybrid cloud environments, with Azure Arc, allows businesses to extend Azure services and management to any infrastructure, fostering greater agility and consistency across different environments. In the world of Google Cloud, they've been leveraging expertise in containerization and data analytics to enhance cloud-native offerings. Google Kubernetes Engine (GKE) stands out as a robust, managed environment for deploying, managing, and scaling containerized applications using Google's infrastructure. Google Cloud's approach to serverless computing, with products like Cloud Run and Cloud Functions, offers developers the ability to build and deploy applications without worrying about the underlying infrastructure. Google's commitment to open-source technologies and its leading-edge work in AI and ML integrate seamlessly into its cloud-native services, providing businesses with powerful tools to drive innovation. Both Azure and Google Cloud are shaping the future of cloud-native and platform engineering by continuously adapting to technological advancements and changing market needs. Their focus on Kubernetes, serverless computing, and seamless integration between development and operations underlines a broader industry trend toward more agile, efficient, and scalable cloud environments. Implications for the Future of Cloud Computing AI is set to revolutionize cloud computing, making cloud-native technologies more self-sufficient and efficient. Advanced AI will oversee cloud operations, enhancing performance and cost effectiveness while enabling services to self-correct. Yet integrating AI presents ethical challenges, especially concerning data privacy and decision-making bias, and poses risks requiring solid safeguards. As AI reshapes cloud services, sustainability will be key; future AI must be energy efficient and environmentally friendly to ensure responsible growth. Kickstarting Your Platform Engineering and AI Journey To effectively adopt AI, organizations must nurture a culture oriented toward learning and prepare by auditing their IT setup, pinpointing AI opportunities, and establishing data management policies. Further: Upskilling in areas such as machine learning, analytics, and cloud architecture is crucial. Launching AI integration through targeted pilot projects can showcase the potential and inform broader strategies. Collaborating with cross-functional teams and selecting cloud providers with compatible AI tools can streamline the process. Balancing innovation with consistent operations is essential for embedding AI into cloud infrastructures. Conclusion Platform engineering with AI integration is revolutionizing cloud-native environments, enhancing their scalability, reliability, and efficiency. By enabling predictive analytics and automated optimization, AI ensures cloud resources are effectively utilized and services remain resilient. Adopting AI is crucial for future-proofing cloud applications, and it necessitates foundational adjustments and a commitment to upskilling. The advantages include staying competitive and quickly adapting to market shifts. As AI evolves, it will further automate and refine cloud services, making a continued investment in AI a strategic choice for forward-looking organizations. This is an excerpt from DZone's 2024 Trend Report,Cloud Native: Championing Cloud Development Across the SDLC.Read the Free Report

By Kellyn Gorman

CORE

New Ways for CNAPP to Shift Left and Shield Right: The Technology Trends That Will Allow CNAPP to Address More Extensive Threat Models

Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC. The cloud-native application protection platform (CNAPP) model is designed to secure applications that leverage cloud-native technologies. However, applications not in the scope are typically legacy systems that were not designed to operate within modern cloud infrastructures. Therefore, in practice, CNAPP covers the security of containerized applications, serverless functions, and microservices architectures, possibly running across different cloud environments. Figure 1. CNAPP capabilities across different application areas A good way to understand the goal of the security practices in CNAPPs is to look at the threat model, i.e., attack scenarios against which applications are protected. Understanding these scenarios helps practitioners grasp the aim of features in CNAPP suites. Note also that the threat model might vary according to the industry, the usage context of the application, etc. In general, the threat model is attached to the dynamic and distributed nature of cloud-native architectures. Such applications face an important attack surface and an intricate threat landscape mainly because of the complexity of their execution environment. In short, the model typically accounts for unauthorized access, data breaches due to misconfigurations, inadequate identity and access management policies, or simply vulnerabilities in container images or third-party libraries. Also, due to the ephemeral and scalable characteristics of cloud-native applications, CNAPPs require real-time mechanisms to ensure consistent policy enforcement and threat detection. This is to protect applications from automated attacks and advanced persistent threats. Some common threats and occurrences are shown in Figure 2: Figure 2. Typical threats against cloud-native applications Overall, the scope of the CNAPP model is quite broad, and vendors in this space must cover a significant amount of security domains to shield the needs of the entire model. Let’s review the specific challenges that CNAPP vendors face and the opportunities to improve the breadth of the model to address an extended set of threats. Challenges and Opportunities When Evolving the CNAPP Model To keep up with the evolving threat landscape and complexity of modern organizations, the evolution of the CNAPP model yields both significant challenges and opportunities. Both the challenges and opportunities discussed in the following sections are briefly summarized in Table 1: Table 1. Challenges and opportunities with evolving the CNAPP model Challenges Opportunities Integration complexity – connect tools, services, etc. Automation – AI and orchestration Technological changes – tools must continually evolve Proactive security – predictive and prescriptive measures Skill gaps – tools must be friendly and efficient DevSecOps – integration with DevOps security practices Performance – security has to scale with complexity Observability – extend visibility to the SDLC’s left and right Compliance – region-dependent, evolving landscape Edge security – control security beyond the cloud Challenges The integration challenges that vendors face due to the scope of the CNAPP model are compounded by quick technological changes: Cloud technologies are continuously evolving, and vendors need to design tools that are user friendly. Managing the complexity of cloud technology via simple, yet powerful, user interfaces allows organizations to cope with the notorious skill gaps in teams resulting from rapid technology evolution. An important aspect of the security measures delivered by CNAPPs is that they must be efficient enough to not impact the performance of the applications. In particular, when scaling applications, security measures should continue to perform gracefully. This is a general struggle with security — it should be as transparent as possible yet responsive and effective. An often industry-rooted challenge is regulatory compliance. The expansion of data protection regulations globally requires organizations to comply with evolving regulation frameworks. For vendors, this requires maintaining a wide perspective on compliance and incorporating these requirements into their tool capabilities. Opportunities In parallel, there are significant opportunities for CNAPPs to evolve to address the challenges. Taming complexity is an important factor to tackle head first to expand the scope of the CNAPP model. For that purpose, automation is a key enabler. For example, there is a significant opportunity to leverage artificial intelligence (AI) to accelerate routine tasks, such as policy enforcement and anomaly detection. The implementation of AI for operation automation is particularly important to address the previously mentioned scalability challenges. This capability enhances analytics and threat intelligence, particularly to offer predictive and prescriptive security capabilities (e.g., to advise users for the necessary settings in a given scenario). With such new AI-enabled capabilities, organizations can effectively address the skill gap by offering guided remediation, automated policy recommendations, and comprehensive visibility. An interesting opportunity closer to the code stage is integrating DevSecOps practices. While a CNAPP aims to protect cloud-native applications across their lifecycle, in contrast, DevSecOps embeds security practices that liaise between development, operations, and security teams. Enabling DevSecOps in the context of the CNAPP model covers areas such as providing integration with source code management tools and CI/CD pipelines. This integration helps detect vulnerabilities early and ensure that security is baked into the product from the start. Also, providing developers with real-time feedback on the security implications of their activities helps educate them on security best practices and thus reduce the organization’s exposure to threats. The main goal here is to "shift left" the approach to improve observability and to help reduce the cost and complexity of fixing security issues later in the development cycle. A last and rather forward-thinking opportunity is to evolve the model so that it extends to securing an application on “the edge,” i.e., where it is executed and accessed. A common use case is the access of a web application from a user device via a browser. The current CNAPP model does not explicitly address security here, and this opportunity should be seen as an extension of the operation stage to further “shield right” the security model. Technology Trends That Can Reshape CNAPP The shift left and shield right opportunities (and the related challenges) that I reviewed in the last section can be addressed by the technologies exemplified here. Firstly, the enablement of DevSecOps practices is an opportunity to further shift the security model to the left of the SDLC, moving security earlier in the development process. Current CNAPP practices already include looking at source code and container vulnerabilities. More often than not, visibility over these development artifacts starts once they have been pushed from the development laptop to a cloud-based repository. By using a secure implementation of cloud development environments (CDEs), from a CNAPP perspective, observability across performance and security can start from the development environment, as opposed to the online DevOps tool suites such as CI/CD and code repositories. Secondly, enforcing security for web applications at the edge is an innovative concept when looking at it from the perspective of the CNAPP model. This can be realized by integrating an enterprise browser into the model. For example: Security measures that aim to protect against insider threats can be implemented on the client side with mechanisms very similar to how mobile applications are protected against tampering. Measures to protect web apps against data exfiltration and prevent display of sensitive information can be activated based on injecting a security policy into the browser. Automation of security steps allows organizations to extend their control over web apps (e.g., using robotic process automation). Figure 3. A control component (left) fetches policies to secure app access and browsing (right) Figure 4 shows the impact of secure implementation of a CDE and enterprise browser on CNAPP security practices. The use of both technologies enables security to become a boon for productivity as automation plays the dual role of simplifying user-facing processes around security to the benefit of increased productivity. Figure 4. CNAPP model and DevOps SDLC augmented with secure cloud development and browsing Conclusion The CNAPP model and the tools that implement it should be evolving their coverage in order to add resilience to new threats. The technologies discussed in this article are examples of how coverage can be improved to the left and further to the right of the SDLC. The goal of increasing coverage is to provide organizations more control over how they implement and deliver security in cloud-native applications across business scenarios. This is an excerpt from DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC.Read the Free Report

By Laurent Balmelli, PhD

Optimizing Cloud Spend in the New Era: Strategies and Frameworks for Cost Management

Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC. In today's cloud computing landscape, businesses are embracing the dynamic world of hybrid and multi-cloud environments and seamlessly integrating infrastructure and services from multiple cloud vendors. This shift from a single provider is driven by the need for greater flexibility, redundancy, and the freedom to leverage the best features from each provider and create tailored solutions. Furthermore, the rise of cloud-native technologies is reshaping how we interact with the cloud. Containerization, serverless, artificial intelligence (AI), and edge computing are pushing the boundaries of what's possible, unlocking a new era of innovation and efficiency. But with these newfound solutions comes a new responsibility: cost optimization. The complexities of hybrid and multi-cloud environments, coupled with the dynamic nature of cloud-native deployments, require a strategic approach to managing cloud costs. This article dives into the intricacies of cloud cost management in this new era, exploring strategies, best practices, and frameworks to get the most out of your cloud investments. The Role of Containers in Vendor Lock-In Vendor lock-in occurs when a company becomes overly reliant on a specific cloud provider's infrastructure, services, and tools. This can have a great impact on both agility and cost. Switching to a different cloud provider can be a complex and expensive process, especially as apps become tightly coupled with the vendor's proprietary offerings. Additionally, vendor lock-in can limit you from negotiating better pricing options or accessing the latest features offered by other cloud providers. Containers are recognized for their portability and ability to package applications for seamless deployment across different cloud environments by encapsulating an application's dependencies within a standardized container image (as seen in Figure 1). This means that you can theoretically move your containerized application from one cloud provider to another without significant code modifications. This flexibility affects greater cost control as you're able to leverage the competitive nature of the cloud landscape to negotiate the best deals for your business. Figure 1. Containerization explained With all that being said, complete freedom from vendor lock-in remains a myth with containers. While application code may be portable, configuration management tools, logging services, and other aspects of your infrastructure might still be tied up with the specific vendor's offerings. An approach that leverages open-source solutions whenever possible can maximize the portability effects of containers and minimize the risk of vendor lock-in. The Importance of Cloud Cost Management With evolving digital technologies, where startups and enterprises alike depend on cloud services for their daily operations, efficient cloud cost management is essential. To maximize the value of your cloud investment, understanding and controlling cloud costs not only prevents budget overruns but also ensures that resources are used optimally. The first step in effective cloud cost management is understanding your cloud bill. Most cloud providers now offer detailed billing reports that break down your spending by service, resource type, and region. Familiarize yourself with these reports and identify the primary cost drivers for your environment. Common cost factors include: Transfer rates Storage needs Compute cycles consumed by your services Once you have an understanding of these drivers, the next step is to identify and eliminate any cloud waste. Wasteful cloud spending is often attributed to unused or underutilized resources, which can easily happen if you leave them running overnight or on weekends, and this can significantly inflate your cloud bill. You can eliminate this waste by leveraging tools like autoscaling to automatically adjust resources based on demand. Additionally, overprovisioning (allocating more resources than necessary) can be another really big cost driver. Practices such as rightsizing, where you adjust the scales of your cloud resources to match the demand, can lead to cost savings. Continuous monitoring and analysis of resource utilization is necessary to ensure that each service is perfectly fitted to its needs, neither over- nor under-provisioned. Finally, most cloud providers now offer cost-saving programs that can help optimize your spending. These may include reserved instances where you get discounts for committing to a specific resource for a fixed period, or Spot instances that allow you to use unused capacity at a significantly lower price. Taking advantage of such programs requires a deep understanding of your current and projected usage to select the most beneficial option. Effective cloud cost management is not just about cutting costs but also about optimizing cloud usage in a way that aligns with organizational goals and strategies. Selecting the Best Cloud Options for Your Organization As the one-size-fits-all approach doesn't really exist when working with the cloud, choosing the best options for your specific needs is paramount. Below are some strategies that can help. Assessing Organizational Needs A thorough assessment of your organizational needs involves analyzing your workload characteristics, scalability, and performance requirements. For example, mission-critical applications with high resource demands might need different cloud configurations than static web pages. You can evaluate your current usage patterns and future project needs using machine learning and AI. Security and compliance needs are equally important considerations. Certain industries face regulatory requirements that can dictate data-handling and processing protocols. Identifying a cloud provider that meets these security and compliance standards is non-negotiable for protecting sensitive information. This initial assessment will help you identify which cloud services are suitable for your business needs and implement a proactive approach to cloud cost optimization. Evaluating Cloud Providers Once you have a clear understanding, the next step is to compare the offerings of different cloud providers. Evaluate their services based on key metrics, such as performance, cost efficiency, and the quality of customer support. Take advantage of free trials and demos offered to test drive their services and better assess their suitability. The final decision often comes down to one question: adopt a single- or multi-cloud strategy? Each approach offers specific advantages and disadvantages, so the optimal choice depends on specific needs and priorities. The table below compares the key features of single-cloud and multi-cloud strategies to help you make an informed decision: Table 1. Single- vs. multi-cloud approaches Feature Single-Cloud Multi-Cloud Simplicity Easier to manage; single point of contact More complex to manage; requires expertise in multiple platforms Cost Potentially lower costs through volume discounts May offer lower costs overall by leveraging the best pricing models from different providers Vendor lock-in High; limited flexibility to switch providers Low; greater freedom to choose and switch providers Performance Consistent performance if the provider is chosen well May require optimization for performance across different cloud environments Security Easier to implement and maintain consistent security policies Requires stronger security governance to manage data across multiple environments Compliance Easier to comply with regulations if provider offerings align with needs May require additional effort to ensure compliance across different providers Scalability Scalable within the chosen provider's ecosystem Offers greater horizontal scaling potential by leveraging resources from multiple providers Innovation Limited to innovations offered by the chosen provider Access to a wider range of innovations and features from multiple providers Modernizing Cloud Tools and Architectures Having selected the right cloud options and established a solid foundation for cloud cost management, you need to ensure your cloud environment is optimized for efficiency and cost control. This requires a proactive approach that continuously evaluates and modernizes your cloud tools and architectures. Here, we introduce a practical framework for cloud modernization and continuous optimization: Assessment – Analyze your current cloud usage using cost management platforms and identify inefficiencies and opportunities for cost reduction. Pinpoint idle or underutilized resources that can be scaled down or eliminated. Planning – Armed with these insights, define clear goals and objectives for your efforts. These goals might include reducing overall cloud costs by a specific percentage, optimizing resource utilization, or improving scalability. Once you establish your goals, choose the right optimization strategies that will help you achieve them. Implementation – Now is time to put your plan into action. This can mean implementing cost-saving measures like autoscaling, which automatically adjusts your resources based on demand. Cloud cost management platforms can also play a crucial role in providing real-time visibility and automated optimization recommendations. Monitoring and optimization – Cloud modernization is an ongoing process that requires continuous monitoring and improvement. Regularly review your performance metrics, cloud costs, and resource utilization metrics to adapt your strategies as needed. Figure 2. A framework for modernizing cloud environments By following this framework, you can systematically improve your cloud environment and make sure it remains cost effective. Conclusion Cloud technologies offer a lot of benefits for businesses of all sizes. However, without a strategic approach to cost management, these benefits can be overshadowed by unexpected expenses. By following the best practices in this article, from understanding your cloud requirements and selecting the best cloud option to adopting continuous optimization for your tools and architectures, you can ensure your cloud journey is under financial control. Looking ahead, the future of cloud computing looks exciting as serverless, AI, and edge computing promise to unlock even greater agility, scalability, and efficiency. Staying informed about these advancements, new pricing models, and emerging tools will be really important to maximize the value of your cloud investment. Cost optimization is not a one-time endeavor but an ongoing process that requires continuous monitoring, adaptation, and a commitment to extract the most value out of your cloud resources. This is an excerpt from DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC.Read the Free Report

By Marija Naumovska

CORE

Microservices Design Patterns for Highly Resilient Architecture

The monolithic architecture was historically used by developers for a long time — and for a long time, it worked. Unfortunately, these architectures use fewer parts that are larger, thus meaning they were more likely to fail in entirety if a single part failed. Often, these applications ran as a singular process, which only exacerbated the issue. Microservices solve these specific issues by having each microservice run as a separate process. If one cog goes down, it doesn’t necessarily mean the whole machine stops running. Plus, diagnosing and fixing defects in smaller, highly cohesive services is often easier than in larger monolithic ones. Microservices design patterns provide tried-and-true fundamental building blocks that can help write code for microservices. By utilizing patterns during the development process, you save time and ensure a higher level of accuracy versus writing code for your microservices app from scratch. In this article, we cover a comprehensive overview of microservices design patterns you need to know, as well as when to apply them. Key Benefits of Using Microservices Design Patterns Microservices design patterns offer several key benefits, including: Scalability: Microservices allow applications to be broken down into smaller, independent services, each responsible for a specific function or feature. This modular architecture enables individual services to be scaled independently based on demand, improving overall system scalability and resource utilization. Flexibility and agility: Microservices promote flexibility and agility by decoupling different parts of the application. Each service can be developed, deployed, and updated independently, allowing teams to work autonomously and release new features more frequently. This flexibility enables faster time-to-market and easier adaptation to changing business requirements. Resilience and fault isolation: Microservices improve system resilience and fault isolation by isolating failures to specific services. If one service experiences an issue or failure, it does not necessarily impact the entire application. This isolation minimizes downtime and improves system reliability, ensuring that the application remains available and responsive. Technology diversity: Microservices enable technology diversity by allowing each service to be built using the most suitable technology stack for its specific requirements. This flexibility enables teams to choose the right tools and technologies for each service, optimizing performance, development speed, and maintenance. Improved development and deployment processes: Microservices streamline development and deployment processes by breaking down complex applications into smaller, manageable components. This modular architecture simplifies testing, debugging, and maintenance tasks, making it easier for development teams to collaborate and iterate on software updates. Scalability and cost efficiency: Microservices enable organizations to scale their applications more efficiently by allocating resources only to the services that require them. This granular approach to resource allocation helps optimize costs and ensures that resources are used effectively, especially in cloud environments where resources are billed based on usage. Enhanced fault tolerance: Microservices architecture allows for better fault tolerance as services can be designed to gracefully degrade or fail independently without impacting the overall system. This ensures that critical functionalities remain available even in the event of failures or disruptions. Easier maintenance and updates: Microservices simplify maintenance and updates by allowing changes to be made to individual services without affecting the entire application. This reduces the risk of unintended side effects and makes it easier to roll back changes if necessary, improving overall system stability and reliability. Let's go ahead and look for different Microservices Design Patterns. Database per Service Pattern The database is one of the most important components of microservices architecture, but it isn’t uncommon for developers to overlook the database per service pattern when building their services. Database organization will affect the efficiency and complexity of the application. The most common options that a developer can use when determining the organizational architecture of an application are: Dedicated Database for Each Service A database dedicated to one service can’t be accessed by other services. This is one of the reasons that makes it much easier to scale and understand from a whole end-to-end business aspect. Picture a scenario where your databases have different needs or access requirements. The data owned by one service may be largely relational, while a second service might be better served by a NoSQL solution and a third service may require a vector database. In this scenario, using dedicated services for each database could help you manage them more easily. This structure also reduces coupling as one service can’t tie itself to the tables of another. Services are forced to communicate via published interfaces. The downside is that dedicated databases require a failure protection mechanism for events where communication fails. Single Database Shared by All Services A single shared database isn’t the standard for microservices architecture but bears mentioning as an alternative nonetheless. Here, the issue is that microservices using a single shared database lose many of the key benefits developers rely on, including scalability, robustness, and independence. Still, sharing a physical database may be appropriate in some situations. When a single database is shared by all services, though, it’s very important to enforce logical boundaries within it. For example, each service should own its have schema, and read/write access should be restricted to ensure that services can’t poke around where they don’t belong. Saga Pattern A saga is a series of local transactions. In microservices applications, a saga pattern can help maintain data consistency during distributed transactions. The saga pattern is an alternative solution to other design patterns that allow for multiple transactions by giving rollback opportunities. A common scenario is an e-commerce application that allows customers to purchase products using credit. Data may be stored in two different databases: One for orders and one for customers. The purchase amount can’t exceed the credit limit. To implement the Saga pattern, developers can choose between two common approaches. 1. Choreography Using the choreography approach, a service will perform a transaction and then publish an event. In some instances, other services will respond to those published events and perform tasks according to their coded instructions. These secondary tasks may or may not also publish events, according to presets. In the example above, you could use a choreography approach so that each local e-commerce transaction publishes an event that triggers a local transaction in the credit service. Benefits of Choreography After having explained the term itself let us take a closer look at the benefits of using a choreographed pattern for a microservice architecture. The most important ones are outlined in the bulleted list below: Loose coupling: Choreography allows microservices to be loosely coupled, which means they can operate independently and asynchronously without depending on a central coordinator. This can make the system more scalable and resilient, as the failure of one microservice will not necessarily affect the other microservices. Ease of maintenance: Choreography allows microservices to be developed and maintained independently, which can make it easier to update and evolve the system. Decentralized control: Choreography allows control to be decentralized, which can make the system more resilient and less prone to failure. Asynchronous communication: Choreography allows microservices to communicate asynchronously, which can be more efficient and scalable than synchronous communication. Overall, choreography can be a useful design pattern for building scalable, resilient, and maintainable microservice architectures. Though some of these benefits can actually turn into drawbacks. 2. Orchestration An orchestration approach will perform transactions and publish events using an object to orchestrate the events, triggering other services to respond by completing their tasks. The orchestrator tells the participants what local transactions to execute. Saga is a complex design pattern that requires a high level of skill to successfully implement. However, the benefit of proper implementation is maintained data consistency across multiple services without tight coupling. Benefits of Orchestration Orchestration in microservice architectures can lead to some nice benefits which compensate for the drawbacks of a choreographed system. A few of them are explained below: Simplicity: Orchestration can be simpler to implement and maintain than choreography, as it relies on a central coordinator to manage and coordinate the interactions between the microservices. Centralized control: With a central coordinator, it is easier to monitor and manage the interactions between the microservices in an orchestrated system. Visibility: Orchestration allows for a holistic view of the system, as the central coordinator has visibility into all of the interactions between the microservices. Ease of troubleshooting: With a central coordinator, it is easier to troubleshoot issues in an orchestrated system. When to use Orchestration vs Choreography Whether you want to use choreography or orchestration in your microservice architecture should always be a well-thought-out choice. Both approaches bring their advantages but also downsides. API Gateway Pattern For large applications with multiple clients, implementing an API gateway pattern is a compelling option One of the largest benefits is that it insulates the client from needing to know how services have been partitioned. However, different teams will value the API gateway pattern for different reasons. One of these possible reasons is that it grants a single entry point for a group of microservices by working as a reverse proxy between client apps and the services. Another is that clients don’t need to know how services are partitioned, and service boundaries can evolve independently since the client knows nothing about them. The client also doesn’t need to know how to find or communicate with a multitude of ever-changing services. You can also create a gateway for specific types of clients (for example, backends for frontends) which improves ergonomics and reduces the number of roundtrips needed to fetch data. Plus, an API gateway pattern can take care of crucial tasks like authentication, SSL termination, and caching, which makes your app more secure and user-friendly. Another advantage is that the pattern insulates the client from needing to know how services have been partitioned. Before moving on to the next pattern, there’s one more benefit to cover: Security. The primary way the pattern improves security is by reducing the attack surface area. By providing a single entry point, the API endpoints aren’t directly exposed to clients, and authorization and SSL can be efficiently implemented. Developers can use this design pattern to decouple internal microservices from client apps so a partially failed request can be utilized. This ensures a whole request won’t fail because a single microservice is unresponsive. To do this, the encoded API gateway utilizes the cache to provide an empty response or return a valid error code. Circuit Breaker Design Pattern This pattern is usually applied between services that are communicating synchronously. A developer might decide to utilize the circuit breaker when a service is exhibiting high latency or is completely unresponsive. The utility here is that failure across multiple systems is prevented when a single microservice is unresponsive. Therefore, calls won’t be piling up and using the system resources, which could cause significant delays within the app or even a string of service failures. Implementing this pattern as a function in a circuit breaker design requires an object to be called to monitor failure conditions. When a failure condition is detected, the circuit breaker will trip. Once this has been tripped, all calls to the circuit breaker will result in an error and be directed to a different service. Alternatively, calls can result in a default error message being retrieved. There are three states of the circuit breaker pattern functions that developers should be aware of. These are: Open: A circuit breaker pattern is open when the number of failures has exceeded the threshold. When in this state, the microservice gives errors for the calls without executing the desired function. Closed: When a circuit breaker is closed, it’s in the default state and all calls are responded to normally. This is the ideal state developers want a circuit breaker microservice to remain in — in a perfect world, of course. Half-open: When a circuit breaker is checking for underlying problems, it remains in a half-open state. Some calls may be responded to normally, but some may not be. It depends on why the circuit breaker switched to this state initially. Command Query Responsibility Segregation (CQRS) A developer might use a command query responsibility segregation (CQRS) design pattern if they want a solution to traditional database issues like data contention risk. CQRS can also be used for situations when app performance and security are complex and objects are exposed to both reading and writing transactions. The way this works is that CQRS is responsible for either changing the state of the entity or returning the result in a transaction. Multiple views can be provided for query purposes, and the read side of the system can be optimized separately from the write side. This shift allows for a reduction in the complexity of all apps by separately querying models and commands so: The write side of the model handles persistence events and acts as a data source for the read side The read side of the model generates projections of the data, which are highly denormalized views Asynchronous Messaging If a service doesn’t need to wait for a response and can continue running its code post-failure, asynchronous messaging can be used. Using this design pattern, microservices can communicate in a way that’s fast and responsive. Sometimes this pattern is referred to as event-driven communication. To achieve the fastest, most responsive app, developers can use a message queue to maximize efficiency while minimizing response delays. This pattern can help connect multiple microservices without creating dependencies or tightly coupling them. While there are tradeoffs one makes with async communication (such as eventual consistency), it’s still a flexible, scalable approach to designing a microservices architecture. Event Sourcing The event-sourcing design pattern is used in microservices when a developer wants to capture all changes in an entity’s state. Using event stores like Kafka or alternatives will help keep track of event changes and can even function as a message broker. A message broker helps with the communication between different microservices, monitoring messages and ensuring communication is reliable and stable. To facilitate this function, the event sourcing pattern stores a series of state-changing events and can reconstruct the current state by replaying the occurrences of an entity. Using event sourcing is a viable option in microservices when transactions are critical to the application. This also works well when changes to the existing data layer codebase need to be avoided. Strangler-Fig Pattern Developers mostly use the strangler design pattern to incrementally transform a monolith application to microservices. This is accomplished by replacing old functionality with a new service — and, consequently, this is how the pattern receives its name. Once the new service is ready to be executed, the old service is “strangled” so the new one can take over. To accomplish this successful transfer from monolith to microservices, a facade interface is used by developers that allows them to expose individual services and functions. The targeted functions are broken free from the monolith so they can be “strangled” and replaced. Utilizing Design Patterns To Make Organization More Manageable Setting up the proper architecture and process tooling will help you create a successful microservice workflow. Use the design patterns described above and learn more about microservices in my blog to create a robust, functional app.

By Gaurav Shekhar

Distributed Systems: Common Pitfalls and Complexity

The complexity of distributed systems is an important challenge for engineers and developers. Complexity tends to increase as the system evolves, and therefore it is important to be proactive. Let's talk about what types of complexity you may encounter and what effective tactics to deal with it in your work. Distributed Systems and Complexity In development, a distributed system is a network of computers that are connected to each other and working on a single task. Each computer or node has its own local memory and processor and runs its own processes. However, they use a common network for coordination and centralization. A distributed system is very reliable; failure of one component does not disrupt the entire network. In a centralized computing system, one computer with one processor and one memory works on solving problems. In a centralized system, there are nodes, but they access the central node, which can cause network congestion and slowness. A centralized system has a single point of failure — this is an important disadvantage of it. Complexity Complexity can be defined from different perspectives and aspects. There are two main definitions that are important. In systems theory, complexity describes how different independent parts of the system interact and communicate with each other: how they define interactions with each other, how they depend on each other, how many dependencies they have, and also how they interact within the whole. From a software and technology perspective, complexity refers to the details of the software architecture, such as the number of components. Monolithic Architecture Monolithic architecture is a great example of a centralized system. It is represented as a single deployable and single executable component. For instance, such components may contain a user interface and different modules located in one place. Although this architecture is a traditional one for building software, it has several important drawbacks: Inability to scale modules independently Harder to control the growing complexity Lack of modules independent deployment Challenging to maintain a huge code database Technology and vendors coupling Microservices Architecture Microservices architecture is an architectural style and a variant of service-oriented architecture that structures the system as a collection of loosely coupled services. For example, companies, accounts, customers, and UI are represented as separate processes deployed on multiple nodes. All these services have their own time-to-time shared database, but this is probably a bad practice or antipattern. There are some advantages of such an architecture. Horizontal scalability is a game-changer! You can scale the database horizontally, and you can scale your services horizontally. Technically, any infrastructure component can be scaled horizontally by cloning, but many challenges must be solved. High availability and tolerance: Whenever you have several clones, you may organize some techniques that will help you avoid any downtimes in case of crashes, memory leaks, or power outages. Geographic distribution: If we all have customers in the USA, Europe, or Asia, and we also want to bring the best experience to our customers, we need to distribute these services across the world and organize more complicated techniques for data replication. Technology choice: You are free to choose your solutions. Quality Attributes There are three main quality attributes which any system has at some level or another: Reliability: Continuing to function properly despite challenges, meaning being fault-tolerant or resilient; Even if a system operates reliably now, it doesn’t guarantee future reliability. A frequent cause of performance degradation is increased load: for example, the system might have expanded from 10,000 to 100,000 concurrent users, or from 1 million to 10 million. Scalability is the term we use to describe a system’s ability to handle increased load. It is important to note that the scalability weakness of the whole system is determined by its weakest component. Maintainability is about making life better for the engineering and operations teams who need to work with the system. Good and stable abstractions can help reduce complexity and make the system easier to modify and adapt for new use features. What Are the Main Issues? “Anything that can go wrong will go wrong and at the worst possible time.” — Murphy Law Unreliable Networks There are many reasons why networks are not reliable, such as: Your request may have been lost. Your request may be waiting in a queue and will be delivered later. The remote node may have failed (perhaps it crashed or was powered down). The remote node may have temporarily stopped responding. The remote node may have processed your request, but the response has been lost on the network. The remote node may have processed your request, but the response has been delayed and will be delivered later. Strategy: Timeout The simplest solution to the problem is to apply timeout logic on the caller's side. For example, if the caller doesn’t receive a response after some timeout, it just throws an error and shows an error to the user. Strategy: Retry At scale, we can’t just throw exceptions for every network problem and upset users or delay system execution. So, if a response indicates that something goes wrong, just retry it. But what if the request was processed by the server and only the response was lost? In this scenario, retries may lead to severe consequences like several orders, payments, transactions, and so on. Strategy: Idempotency To avoid that, we can utilize a technique named idempotency. The concept of idempotency pertains to the notion that performing the same action multiple times has the same effect as performing it just once. To achieve the property of exactly-once semantics, a solution can be employed that attaches an idempotency key to the request. Upon retrying the same request with the identical idempotency key, the server will verify that a request with such a key has already been processed and will simply return the previous response. Consequently, any number of retries with the same key will not have a deleterious effect on the system's behavior. Strategy: Circuit Breaker Another pattern that might be useful in preventing overloading and completely crushing the server in case of failure is the circuit breaker. Circuit Breaker acts as a proxy to prevent the calling system, which is under maintenance, likely will fail, or heavily failing right now. There are so many reasons why it can go wrong: memory leak, a bug in the code, or external dependencies that are faulted. In such scenarios, it is better to fail fast rather than risk cascading failures. Concurrency and Lost Writes Concurrency represents one of the most intricate challenges in distributed systems. Concurrency implies the simultaneous occurrence of multiple computations. Consequently, what occurs when an attempt is made to update the account balance simultaneously from disparate operations? In the absence of a defensive mechanism, it is highly probable that race conditions will ensue, which will inevitably result in the loss of writes and data inconsistency. In this example, two operations are attempting to update the account balance concurrently. Since they are running in parallel, the last one to complete wins, resulting in a significant issue. To circumvent this problem, various techniques can be employed. Strategy: Snapshot Isolation The ACID acronym stands for Atomicity, Consistency, Isolation, and Durability. All of the popular SQL databases implement these properties. Atomicity specifies that the operation will be either completely executed or failed, no matter at what stage it happens. It allows us to ensure that another thread cannot see the half-finished result of the operation. Consistency means that all invariants are defined and will be satisfied before successfully committing a transaction and changing the state. Isolation in the sense of ACID means that concurrently executing transactions are isolated from each other. There is a serializable isolation level which is the strictest one to process all transactions sequentially, but another level named snapshot isolation in popular databases is mainly used. Durability promises that once the transaction is committed, all the data is stored safely. The key idea of this level is that databases track recorded versions and fail to commit transactions for ones that were already modified outside of the current transaction. Strategy: Compare and Set Most of the NoSQL databases do not provide ACID properties while choosing in favor of BASE, wherein such databases compare and the set is used. The purpose of this operation is to avoid lost updates by allowing an update to happen only if the value has not changed since you last read it. If the current value does not match what you previously read, the update has no effect and the read-modify-write cycle must be retried. For instance, Cassandra provides lightweight transactions that allow you to utilize various IF, IF NOT EXISTS, and IF EXISTS conditionals to prevent concurrency issues. Strategy: Lease Another potential solution is the lease pattern. To illustrate, consider a scenario where a resource must be updated exclusively. The lease pattern entails first obtaining a lease with an expiration period for the resource, then updating it, and finally returning the lease. In the event of failures, the lease will automatically expire, allowing another thread to access the resource. Although this technique is highly beneficial, there is a risk of process pauses and clock desynchronization, which may lead to issues with parallel resource access. Dual Write Problem The dual write problem is a challenge that arises in distributed systems, particularly when multiple data sources or databases must be kept in sync. To illustrate, consider a scenario in which new data must be stored in the database and messages are sent to Kafka. Since these two operations are not atomic, there is a possibility of failure during the publishing of new messages. If a transaction is attempted while messages are being sent, the result is a more problematic situation. In the event that the transaction fails to commit, external systems may have already been informed of changes that, in fact, did not occur. Strategy: Transactional Outbox One potential solution is the implementation of a transactional outbox. This involves the storage of events in an "OutboxEvents" table within the same transaction as the operation itself. Due to the atomicity of the process, no data will be stored in the event of a transaction failure. Another necessary component is Relay, which polls the OutboxEvents table at regular intervals and sends messages to destinations. This approach allows for the achievement of at least one delivery guarantee. Nevertheless, this is not a concern since all consumers must be idempotent due to the unreliability of the network. Strategy: Log Tailing An alternative solution to the construction of a custom transactional outbox is the utilization of a database transactional log and custom connectors to read directly from this log and send changes to destinations. This approach has its own advantages and disadvantages. For instance, it requires coupling to database solutions but allows for the writing of less code in the application. Unreliable Clocks Time tracking is a fundamental aspect of any software or infrastructure, as it enables the enforcement of timeouts, expirations, and the gathering of metrics. However, the reliability of clocks represents a significant challenge in distributed systems, as the accuracy of time is contingent upon the performance of individual computers, which may have clocks that are either faster or slower than others. There are two primary types of clocks utilized by computers: time-of-day and monotonic clocks. Time-of-day clocks return the date and time according to a specific calendar, and they are typically synchronized with Network Time Protocol (NTP). However, delays and network issues may affect the synchronization process, leading to clock desynchronization. Monotonic clocks continuously advance, making them suitable for measuring durations. However, the monotonically increased value is unique to each computer, limiting their use for multi-server date and time comparison. Achieving highly accurate clock synchronization is a challenging task. In the majority of cases, the necessity for such a solution is not apparent. However, in instances where compliance with regulations necessitates its use, the Precision Time Protocol can be employed, although this will entail a significant investment. Availability and Consistency The CAP Theorem posits that any distributed data store can only satisfy two of the three guarantees. However, since network unreliability is not a factor that can be significantly influenced, in the case of network partitions, the only viable option is to choose between availability or consistency. Consider the scenario in which two clients read from different nodes: one from the primary node and another from the follower. A replication is configured to update followers after the leader has been changed. However, what happens if, for some reason, the leader stops responding? This could be a crash, network partitioning, or another issue. In highly available systems, a new leader must be assigned, but how do we choose between existing followers? To address this issue, a distributed consensus algorithm must be employed. However, before delving into the specifics of this algorithm, it is essential to gain a comprehensive understanding of the various types of consistency. Consistency Type There are two main classes of consistency used to describe guarantees. Weak consistency, or eventually one, means that data will be synchronized on all followers after some time if you stop making changes to the leader. Strong consistency is a property that ensures that all nodes in the system see the same data at the same time, regardless of which node they are accessing. Strategy: Distributed Consensus Algorithm (e.g., Raft) Returning to the problem when the leader crashes, there is a need to elect a new leader. This problem, at first glance, looks easy, but in reality, there are so many conditions and tradeoffs that have to be taken into account when selecting the appropriate approach. Per Raft protocol, if followers do not receive data or heartbeat from the leader for a specified period of time, then a new leader election process begins. Each Replication Unit (monolith write node or multiple shards) is associated with a set of Raft logs and OS processes that maintain the logs and replicate changes from the leader to followers. The Raft protocol guarantees that followers receive log records in the same order they are generated by the leader. A user transaction is committed on the leader as soon as half of the followers acknowledge the receipt of the commit record and writes it to the Raft log. Strategy: Read From Leader One of the possible effective and simple strategies is to read from the follower by the user who just saved new data to avoid replication lag. Instead of Conclusion From monolithic architectures to microservices, each approach presents its own set of advantages and challenges. While monolithic architectures offer simplicity, they often struggle with scalability and maintainability, pushing developers towards a more modular and scalable microservices architecture. Central to the discussion is the management of complexity, which manifests in various forms, from network unreliability to concurrency issues and the dual write problem. Strategies such as timeouts, retries, idempotency, and circuit breakers offer effective tools for mitigating the risks associated with unreliable networks, while techniques like snapshot isolation, compare and set, and leases address the challenges of concurrency and lost writes. Furthermore, the critical issue of unreliable clocks underscores the importance of accurate time synchronization in distributed systems, with solutions ranging from NTP synchronization to the Precision Time Protocol. Additionally, the CAP theorem reminds us of the inherent trade-offs between availability and consistency, necessitating a thorough understanding of distributed consensus algorithms like Raft. In conclusion, mastering the maze of complexity in distributed systems requires a multifaceted approach, combining theoretical knowledge with practical strategies. By embracing these strategies and continuously adapting to the evolving landscape of distributed computing, engineers and developers can navigate the complexities with confidence, ensuring the reliability, scalability, and maintainability of their systems in the face of ever-changing challenges.

By Aleksei Popov

API Gateway vs. Load Balancer

Recently, while working on a project at work, we came to this architectural choice of whether to use API Gateway as the interface of a backend service, back the service behind a load balancer, or maybe have the API Gateway route the requests to the load balancer fronting the service. While debating about these architectural choices with my peers, I realized this is a problem many software development engineers would face while designing solutions in their domain. This article will hopefully simplify these concepts and help choose the one that works the best based on individual use cases. Callout Please understand the requirements or the problem you are working on first as the choice you make will be highly dependent on the use cases or requirements you have. Application Programming Interface (API) Let's understand what an API is first. An API is how two actors of software systems (software components or users) communicate with each other. This communication happens through a defined set of interfaces and protocols e.g., the weather bureau’s software system contains daily weather data. The weather app on your phone “talks” to this system via APIs and shows you daily weather updates on your phone. API Gateway An API Gateway is a component of the app-delivery infrastructure that sits between clients and services and provides centralized handling of API communication between them. In very simplistic terms API Gateway is the gateway to the API. It is the channel that helps users of the APIs communicate with the APIs while abstracting out complex details of the APIs e.g. services they are in, access control (authentication and authorization), security (preventing DDoS attacks), etc. Imagine it being the switchboard operator of the manual telephone exchange, whom users can call and ask to be connected with a specific number (analogous to the software component here). Image Credits Let's discuss the pros and cons of API Gateway a little. Pros Access control: Providers support Authenticating and Authorizing the clients before requests reach the Backend systems. Security: Providers security/potential mitigations from DDoS (Distributed Denial of Service) attacks from the get-go. Abstraction: Abstracts out internal hosting details of the Backend APIs and provides clean routing to Backend services based on multiple techniques — path-based routing, Query String params-based routing, etc. Monitoring and analytics: API Gateway could provide additional support for API-level monitoring and analytics to help scale infrastructure gracefully. Cons Additional layer between users and services: API Gateway adds another layer between users and Services, thus adding additional complexity to the orchestration of requests. Performance impact: Since an additional layer is added in the service architecture, this could lead to potential performance impact, as the requests now have to pass through one more layer before reaching backend services. Load Balancing and Load Balancer Load balancing is the technique of distributing load between multiple backend servers, based on their capacity and actual requests pattern. Today's applications can have requests incoming at a higher rate (read hundreds or thousands of requests/second), asking backend services to perform actions (e.g. data processing, data fetching, etc.). This requires services to be hosted on multiple servers at once. This thus means we need a layer sitting on top of these backend servers (load balancer), which could route incoming requests to these servers, based on what they can handle "efficiently" while keeping customer experience and service performance intact. The load balancers also ensure that no one server is overworked, as that could lead to requests failing or getting higher latencies. On a very high level, the load balancer does the following: Routes incoming requests to backend servers to "efficiently" distribute the load on the servers. Maintain the performance of the service, by ensuring no one server is overworked. Lets the service efficiently and independently scale up or down and routes the requests to the active hosts (The load balancer figures out the number of active hosts by performing a technique named Heartbeat). Image Credits Pros Performance: Load balancers help maintain the service performance by ensuring that the request load is distributed across the servers. Availability: Load balancers help maintain the availability of the service as with them, now there could be multiple servers hosting the same service. Support scalability: Helps service scale up or down cleanly (horizontally) by letting new servers be added or removed anytime needed. Cons Potential single point of failure: Since all requests have to flow through a Load Balancer, a load balancer can become a single point of failure for the whole service, if not configured properly by adding enough redundancy. Additional overhead: Load balancers use multiple algorithms to route the requests to the backend services e.g. Round robin, Least connections, Adaptive, etc. This means a request has to flow through this additional overhead in the load balancer to figure out which server the request should be forwarded to. This could thus add additional performance overload as well on the requests. When To Use What Let's come to the crux of the article now. When do we use a load balancer for a service and when do we use an API Gateway for it? When To Use API Gateway Going with the Pros of API Gateway mentioned above, the following are the cases when API Gateway is best suited: When we need central access control (authentication and authorization) before the backend services/APIs When we need central security mechanisms for issues like DDoS. When we are exposing APIs to external customers (read Internet) we don't want to share internal details of the services or infrastructure to the Internet. When we need out-of-the-box monitoring and analytics on the APIs and need insights on how to scale the backend services. When To Use Load Balancer When the service can get a high number of requests/second, which one server can't handle the service would be hosted on more than one server. When the service has defined availability and SLAs and needs to adhere to that. When the service needs to be able to scale up or down as required.

By Sumit Kumar

When It’s Time to Give REST a Rest

Through my years of building services, the RESTful API has been my primary go-to. However, even though REST has its merits, that doesn’t mean it’s the best approach for every use case. Over the years, I’ve learned that, occasionally, there might be better alternatives for certain scenarios. Sticking with REST just because I’m passionate about it — when it’s not the right fit — only results in tech debt and a strained relationship with the product owner. One of the biggest pain points with the RESTful approach is the need to make multiple requests to retrieve all the necessary information for a business decision. As an example, let’s assume I want a 360-view of a customer. I would need to make the following requests: GET /customers/{some_token} provides the base customer information GET /addresses/{some_token} supplies a required address GET /contacts/{some_token} returns the contact information GET /credit/{some_token} returns key financial information While I understand the underlying goal of REST is to keep responses laser-focused for each resource, this scenario makes for more work on the consumer side. Just to populate a user interface that helps an organization make decisions related to future business with the customer, the consumer must make multiple calls In this article, I’ll show why GraphQL is the preferred approach over a RESTful API here, demonstrating how to deploy Apollo Server (and Apollo Explorer) to get up and running quickly with GraphQL. I plan to build my solution with Node.js and deploy my solution to Heroku. When To Use GraphQL Over REST? There are several common use cases when GraphQL is a better approach than REST: When you need flexibility in how you retrieve data: You can fetch complex data from various resources but all in a single request. (I will dive down this path in this article.) When the frontend team needs to evolve the UI frequently: Rapidly changing data requirements won’t require the backend to adjust endpoints and cause blockers. When you want to minimize over-fetching and under-fetching: Sometimes REST requires you to hit multiple endpoints to gather all the data you need (under-fetching), or hitting a single endpoint returns way more data than you actually need (over-fetching). When you’re working with complex systems and microservices: Sometimes multiple sources just need to hit a single API layer for their data. GraphQL can provide that flexibility through a single API call. When you need real-time data pushed to you: GraphQL features subscriptions, which provide real-time updates. This is useful in the case of chat apps or live data feeds. (I will cover this benefit in more detail in a follow-up article.) What Is Apollo Server? Since my skills with GraphQL aren’t polished, I decided to go with Apollo Server for this article. Apollo Server is a GraphQL server that works with any GraphQL schema. The goal is to simplify the process of building a GraphQL API. The underlying design integrates well with frameworks such as Express or Koa. I will explore the ability to leverage subscriptions (via the graphql-ws library) for real-time data in my next article. Where Apollo Server really shines is the Apollo Explorer, a built-in web interface that developers can use to explore and test their GraphQL APIs. The studio will be a perfect fit for me, as it allows for the easy construction of queries and the ability to view the API schema in a graphical format. My Customer 360 Use Case For this example, let’s assume we need the following schema to provide a 360-view of the customer: TypeScript type Customer { token: String name: String sic_code: String } type Address { token: String customer_token: String address_line1: String address_line2: String city: String state: String postal_code: String } type Contact { token: String customer_token: String first_name: String last_name: String email: String phone: String } type Credit { token: String customer_token: String credit_limit: Float balance: Float credit_score: Int } I plan to focus on the following GraphQL queries: TypeScript type Query { addresses: [Address] address(customer_token: String): Address contacts: [Contact] contact(customer_token: String): Contact customers: [Customer] customer(token: String): Customer credits: [Credit] credit(customer_token: String): Credit } Consumers will provide the token for the Customer they wish to view. We expect to also retrieve the appropriate Address, Contact, and Credit objects. The goal is to avoid making four different API calls for all this information rather than with a single API call. Getting Started With Apollo Server I started by creating a new folder called graphql-server-customer on my local workstation. Then, using the Get Started section of the Apollo Server documentation, I followed steps one and two using a Typescript approach. Next, I defined my schema and also included some static data for testing. Ordinarily, we would connect to a database, but static data will work fine for this demo. Below is my updated index.ts file: TypeScript import { ApolloServer } from '@apollo/server'; import { startStandaloneServer } from '@apollo/server/standalone'; const typeDefs = `#graphql type Customer { token: String name: String sic_code: String } type Address { token: String customer_token: String address_line1: String address_line2: String city: String state: String postal_code: String } type Contact { token: String customer_token: String first_name: String last_name: String email: String phone: String } type Credit { token: String customer_token: String credit_limit: Float balance: Float credit_score: Int } type Query { addresses: [Address] address(customer_token: String): Address contacts: [Contact] contact(customer_token: String): Contact customers: [Customer] customer(token: String): Customer credits: [Credit] credit(customer_token: String): Credit } `; const resolvers = { Query: { addresses: () => addresses, address: (parent, args, context) => { const customer_token = args.customer_token; return addresses.find(address => address.customer_token === customer_token); }, contacts: () => contacts, contact: (parent, args, context) => { const customer_token = args.customer_token; return contacts.find(contact => contact.customer_token === customer_token); }, customers: () => customers, customer: (parent, args, context) => { const token = args.token; return customers.find(customer => customer.token === token); }, credits: () => credits, credit: (parent, args, context) => { const customer_token = args.customer_token; return credits.find(credit => credit.customer_token === customer_token); } }, }; const server = new ApolloServer({ typeDefs, resolvers, }); const { url } = await startStandaloneServer(server, { listen: { port: 4000 }, }); console.log(`Apollo Server ready at: ${url}`); const customers = [ { token: 'customer-token-1', name: 'Acme Inc.', sic_code: '1234' }, { token: 'customer-token-2', name: 'Widget Co.', sic_code: '5678' } ]; const addresses = [ { token: 'address-token-1', customer_token: 'customer-token-1', address_line1: '123 Main St.', address_line2: '', city: 'Anytown', state: 'CA', postal_code: '12345' }, { token: 'address-token-22', customer_token: 'customer-token-2', address_line1: '456 Elm St.', address_line2: '', city: 'Othertown', state: 'NY', postal_code: '67890' } ]; const contacts = [ { token: 'contact-token-1', customer_token: 'customer-token-1', first_name: 'John', last_name: 'Doe', email: 'jdoe@example.com', phone: '123-456-7890' } ]; const credits = [ { token: 'credit-token-1', customer_token: 'customer-token-1', credit_limit: 10000.00, balance: 2500.00, credit_score: 750 } ]; With everything configured as expected, we run the following command to start the server: Shell $ npm start With the Apollo server running on port 4000, I used the http://localhost:4000/ URL to access Apollo Explorer. Then I set up the following example query: TypeScript query ExampleQuery { addresses { token } contacts { token } customers { token } } This is how it looks in Apollo Explorer: Pushing the Example Query button, I validated that the response payload aligned with the static data I provided in the index.ts: JSON { "data": { "addresses": [ { "token": "address-token-1" }, { "token": "address-token-22" } ], "contacts": [ { "token": "contact-token-1" } ], "customers": [ { "token": "customer-token-1" }, { "token": "customer-token-2" } ] } } Before going any further in addressing my Customer 360 use case, I wanted to run this service in the cloud. Deploying Apollo Server to Heroku Since this article is all about doing something new, I wanted to see how hard it would be to deploy my Apollo server to Heroku. I knew I had to address the port number differences between running locally and running somewhere in the cloud. I updated my code for starting the server as shown below: TypeScript const { url } = await startStandaloneServer(server, { listen: { port: Number.parseInt(process.env.PORT) || 4000 }, }); With this update, we’ll use port 4000 unless there is a PORT value specified in an environment variable. Using Gitlab, I created a new project for these files and logged into my Heroku account using the Heroku command-line interface (CLI): Shell $ heroku login You can create a new app in Heroku with either their CLI or the Heroku dashboard web UI. For this article, we’ll use the CLI: Shell $ heroku create jvc-graphql-server-customer The CLI command returned the following response: Shell Creating jvc-graphql-server-customer... done https://jvc-graphql-server-customer-b62b17a2c949.herokuapp.com/ | https://git.heroku.com/jvc-graphql-server-customer.git The command also added the repository used by Heroku as a remote automatically: Shell $ git remote heroku origin By default, Apollo Server disables Apollo Explorer in production environments. For my demo, I want to leave it running on Heroku. To do this, I need to set the NODE_ENV environment variable to development. I can set that with the following CLI command: Shell $ heroku config:set NODE_ENV=development The CLI command returned the following response: Shell Setting NODE_ENV and restarting jvc-graphql-server-customer... done, v3 NODE_ENV: development Now we’re in a position to deploy our code to Heroku: Shell $ git commit --allow-empty -m 'Deploy to Heroku' $ git push heroku A quick view of the Heroku Dashboard shows my Apollo Server running without any issues: If you’re new to Heroku, this guide will show you how to create a new account and install the Heroku CLI. Acceptance Criteria Met: My Customer 360 Example With GraphQL, I can meet the acceptance criteria for my Customer 360 use case with the following query: TypeScript query CustomerData($token: String) { customer(token: $token) { name sic_code token }, address(customer_token: $token) { token customer_token address_line1 address_line2 city state postal_code }, contact(customer_token: $token) { token, customer_token, first_name, last_name, email, phone }, credit(customer_token: $token) { token, customer_token, credit_limit, balance, credit_score } } All I need to do is pass in a single Customer token variable with a value of customer-token-1: JSON { "token": "customer-token-1" } We can retrieve all of the data using a single GraphQL API call: JSON { "data": { "customer": { "name": "Acme Inc.", "sic_code": "1234", "token": "customer-token-1" }, "address": { "token": "address-token-1", "customer_token": "customer-token-1", "address_line1": "123 Main St.", "address_line2": "", "city": "Anytown", "state": "CA", "postal_code": "12345" }, "contact": { "token": "contact-token-1", "customer_token": "customer-token-1", "first_name": "John", "last_name": "Doe", "email": "jdoe@example.com", "phone": "123-456-7890" }, "credit": { "token": "credit-token-1", "customer_token": "customer-token-1", "credit_limit": 10000, "balance": 2500, "credit_score": 750 } } } Below is a screenshot from Apollo Explorer running from my Heroku app: Conclusion I recall earlier in my career when Java and C# were competing against each other for developer adoption. Advocates on each side of the debate were ready to prove that their chosen tech was the best choice … even when it wasn’t. In this example, we could have met my Customer 360 use case in multiple ways. Using a proven RESTful API would have worked, but it would have required multiple API calls to retrieve all of the necessary data. Using Apollo Server and GraphQL allowed me to meet my goals with a single API call. I also love how easy it is to deploy my GraphQL server to Heroku with just a few commands in my terminal. This allows me to focus on implementation—offloading the burdens of infrastructure and running my code to a trusted third-party provider. Most importantly, this falls right in line with my personal mission statement: “Focus your time on delivering features/functionality that extends the value of your intellectual property. Leverage frameworks, products, and services for everything else.” – J. Vester If you are interested in the source code for this article, it is available on GitLab. But wait… there’s more! In my follow-up post, we will build out our GraphQL server further, to implement authentication and real-time data retrieval with subscriptions. Have a really great day!

By John Vester

CORE

Hexagonal Architecture in the Frontend: A Real Case

Hexagonal architecture is a software design pattern based on the separation of responsibilities. The goal is to decouple the business logic (domain) and the application from other external interfaces. Simplifying, in hexagonal architecture we communicate the core of the app (domain + application) with the external elements using ports and adapters. A port lives in the core; it is the interface any external code must use to interact with the core (or the core with the external code). The adapter is the external piece of code that follows the port interface and executes the tasks, gets the data, etc. You can imagine the port is a space reserved only for an exact type of vessel. The vessel can only enter the port and dock if the load/unload doors are of an expected size and are in the correct position. Multiple vessels can fit in a port and vessels can be replaced, but ports are unique and can not be moved. A key concept is that the core doesn’t know anything about the external components. The port defines the vessel door positions but doesn’t care about how the load is stored in the vessel. In this case, we will also use the repository pattern (that fits very well with hexagonal, as it defines a centralized and abstract way of accessing data and it is a very common pattern), and the dependency injection principle that allows us to create decoupled (or loosely coupled) software. Simplifying again, it allows us to replace an adapter with another one that follows the same port interface. Let’s see it in action with a small (and typical) example: Your domain (core) needs to get a list of users with a name, so you define the port that is a repository. The port defined a method to do that: getUsersByName(name: string): User[]. In English, it defines that the adapter must provide a method called getUsersByName that gets a name and should return the list of the users that match that name. A Real Case The Initial Context We have a single web application (frontend) that works for different clients (tenants), and that application uses a backend that provides the menu data. The backend returns something like this: JSON { "title": "Main Menu", "id": "main", "is_staff": false, "items": [ { "title": "Home", "icon": "", "url": "/", "is_staff": false }, { "title": "Dashboards", "icon": "dashboards", "id": "dashboards", "is_staff": false, "items": [ { "title": "Home", "icon": "dashboards-home", "url": "/dashboards", "is_staff": false }, { "title": "Config (X)", "icon": "dashboards-config", "url": "/dashboards-config", "is_staff": true }, { "title": "Advanced Reports", "icon": "", "id": "advanced_reports", "is_staff": false, "items": [ { "title": "Sales Analysis", "icon": "", "url": "/sales_analysis", "is_staff": false }, ... ] } ] } ] } The front end partially implements the repository pattern as it just returns the data the backend provides without more manipulation than removing the first level in the tree (the main menu item). The view executes the repository call using a service: that again just returns the same information it gets from the repository. The Issues This “architecture” works, but has some issues that can create serious problems in the future: The data structure is coupled to the backend data: All the data flows from the backend to the view using the same interfaces. If the backend changes just the name of a property, we need to follow the data flow in our code until the view changes it in all the places. The title string includes an emoji to allow users to visualize when a menu item is only for staff users: That information is also provided in the is_staff property. If we want to expose a menu item to regular users, we need to change it in 2 places, and that is never a good idea. Visuals are defined in the backend: The name of the icon to use is defined in the backend. Unless the icon would be an app (backend + frontend) global concept, it is not a good idea to pass that value front the backend. No domain: there is no domain, or at least no explicit one. Logic is applied in the view (that it is not bad per se, but if the logic is related to the business rules, it must live in the domain). The Problem Because of different reasons, the company decided to create a new version of the backend. This new backend (called v2) will not be retro-compatible with the legacy one, but it will represent semantically the same entities. The menu endpoint will return the same menu (it will provide more features) but the new endpoint response structure is completely different: JSON [ { "menuStateId": 3, "menuPosition": 1, "menuName": "Dashboards", "menuItemId": 9, "menuItemTitle": "Home", "menuItemPosition": 1, "menuItemLink": "/dashboards", "menuItemStateId": 3, "menuInternalName": "dashboard", "menuId": 12, "menuParentId": 1, "menuItemInternalName": "dashboard.home" }, { "menuStateId": 3, "menuPosition": 1, "menuName": "Dashboards", "menuItemId": 9, "menuItemTitle": "Sales analysis", "menuItemPosition": 1, "menuItemLink": "/sales_analysis", "menuItemStateId": 3, "menuInternalName": "dashboard", "menuId": 12, "menuParentId": 1, "menuItemInternalName": "dashboard.sales_analysis" }, { "menuStateId": 3, "menuPosition": 1, "menuName": "Dashboards", "menuItemId": 9, "menuItemTitle": "Config", "menuItemPosition": 1, "menuItemLink": "/dashboards-config", "menuItemStateId": 1, "menuInternalName": "dashboard", "menuId": 12, "menuParentId": 1, "menuItemInternalName": "dashboard.sales_analysis" }, { "menuStateId": 3, "menuPosition": 1, "menuName": "Dashboards", "menuItemId": 9, "menuItemTitle": "Sales analysis", "menuItemPosition": 1, "menuItemLink": "/sales_analysis", "menuItemStateId": 3, "menuInternalName": "dashboard", "menuId": 12, "menuParentId": 1, "menuItemInternalName": "dashboard.sales_analysis" }, { "menuStateId": 3, "menuPosition": 1, "menuName": "Main", "menuItemId": 11, "menuItemTitle": "Home", "menuItemPosition": 5, "menuItemLink": "/", "menuItemStateId": 3, "menuInternalName": "home", "menuId": 1, "menuParentId": 1, "menuItemInternalName": "home" }, ... ] The new backend endpoint returns the menu items and its parent menu data in the same row. The structure is flat (no nested items). Another difference is the is_staff is still there, but it’s a specific value for the menuItemStateId property. There is no icon name, but now we have an internalId as a semantic unique ID. Things Can Become Harder The new backend will not replace the legacy one, at least not in the next months. Clients will be migrated slowly to the new backend. So some clients will use the legacy backend and others will use the new one. That means we will have both backends working at the same time for months. As the data returned by both backends is very different, it seems tough to use the same frontend code to render the menu for all the clients, right? (Not really, as we will see later.) A possible solution is to create different menu-related components, code, etc. depending on the backend version adapting our application to them. This can work, but it means we will need to duplicate a lot of code; for example, the views, the services, etc., making the maintenance harder. Decoupling Us From the Backend Let’s forget for a while how the backend(s) data returns, and think about what we want to represent from the point of view of our application. We want to represent a menu that can have items with children items (and no link), and items with links and no children. Then let’s create a model (models, in our case) in our domain as entities that will represent exactly that: TypeScript type State = 'disabled' | 'only_for_staff' | 'open' class Menu { readonly id: number = 0 readonly internalName: string = '' readonly title: string = '' readonly icon: string = '' readonly image: URL | undefined readonly state: State = 'open' readonly description: string = '' readonly position: number = 0 readonly children: (Menu | MenuItem)[] = [] constructor(values: MenuDto) { this.id = values.id this.internalName = values.internalName //... this.children = values.children } get onlyForStaff(): boolean { return this.state === 'only_for_staff' } } class MenuItem { readonly id: number = 0 readonly internalName: string = '' readonly title: string = '' readonly icon: string = '' readonly url: string = '' readonly state: State = 'open' readonly position: number = 0 readonly menuId: number = 0 private constructor(values: MenuItemDto) { // hydrate the entity this.id = values.id //.... } get isStaff(): boolean { return this.state === 'hidden' } public get external(): boolean { return (this.url.includes('http://') } } This is a simplified version of the entities, but you can see the idea. We have a Menu entity that can have children that can be Menu or MenuItem entities. The MenuItem entity has a url property that can be used to know if the item is a link or not. We modeled the domain, and our application layer (and views) can access it. The key is this: we modeled our menu independently of our backends’ data structures. We can use any backend that represents that entity to get the data independently of the structure. The Port We should create the port that will allow us to get the menu’s data from the backend(s). TypeScript interface MenuRepo { getMainMenu(states: State[]): Promise<(MenuItem | Menu)[]> } The port defines how the repository should look. In this case, we want a method that will return the main menu, filtered by state ('disabled' | 'only_for_staff' | 'open'). The Adapter: The Repositories Will Do the Magic We need to create the adapters that will get the data from the backend and transform it to our domain entities. We need an adapter, also called repository implementation, for each backend (we could have even more for mocked data, stubs for testing, etc). Remember, the repository implementation (adapter) is the one that knows the “external to the core” internals: How to get the data at the infrastructure level: REST, GraphQL, local storage, etc. How to request the data: For example, for an XHR request: headers, query params, URL, etc. The returned data structure and how to transform it to the domain entities How to handle errors, retries, etc. How to cache the data But again, the domain NEVER should not know about that. For example, the domain must not know that to get items available only for staff users, we need to pass the menuItemStateId param with the value 1. menuItemStateId is an implementation detail. it only makes sense in repository implementation, not in the domain. The domain should know about the onlyForStaff meaning, and the adapter should know how to get that information from the backend. In this case (for backend v2), we need to pass a query param called menuItemStateId with the value 1 to get the staff-only items, but that is different for the legacy backend, or for another backend that can use a different value for that filter, but the argument that represents what we want is still the same: onlyForStaff. From the point of view of the layers on the right side - the port’s line in the workflow (image above) - it does not matter how the data is retrieved. The only thing that matters is the data is returned as a domain entity. That is our contract. TypeScript // menu.legacy.repo.ts type Response = { // This type defines the shape of the data the backend returns. I do not include it here to put the focus on the data transformation into entities } class LegacyMenuRepo implements MenuRepo { async getMainMenu(states: State[]): Promise<(MenuItem | Menu)[]> { // [1] const data = await fetch<Response>('tenant.company.com/get_menu') const backendMenu = await data.json() return backendMenu.map(item => responseToEntity(backendMenu)) } private responseToEntity(response: Response): (MenuItem | Menu) { // transform the response to the domain entities if ('items' in response) { return new Menu({ id: response.id, internalName: response.id, title: response.title, icon: mapIcon(response.id), // [2] image: mapImage(response.image), // [2] state: mapState(response.state), // [3] children: response.items.map(item => responseToEntity(item)) }) } else { return new MenuItem({ id: response.id, internalName: response.id, title: response.title, icon: mapIcon(response.id), // [2] url: response.url, state: mapState(response.state), // [3] menuId: response.menuId }) } } } Things to focus on: [1]: The method that receives the state's argument is not used in the code. This is because the backend does not accept any filter. The legacy backend does the filtering using the backend context, but it ensures will only return the items the user can have access to. [2]: Those map functions are in charge of providing the correct icon and image. Now the backend does not provide that information, so our repository implementation should provide it. Remember: the repository implementation (adapter) is the one that knows all the external internals and for the images, the adapter knows that if the id is “x,” it should return the image “y” and the icon “z”. [3]: The mapState function behavior is similar to [2], but in this case, the backend returns a number that represents the state, the adapter should know how to map that number to the domain state, and that function can be reversed to know that the state should be sent to the backend. We need to implement the adapter for the “new” backend (v2): TypeScript // menu.v2.repo.ts type Response = { // This type defines the shape of the data the backend v2 returns. I do not include it here to put the focus on the data transformation into entities } const stateMappings: Record<number, State> = { 0: 'disabled', 1: 'only-for-staff', 2: 'open' } const stateMappingsReverse: Record<number, State> = { 'disabled': 0, 'only-for-staff': 1, 'open': 2 } class V2MenuRepo implements MenuRepo { async getMainMenu(states: State[]): Promise<(MenuItem | Menu)[]> { const data = await fetch<Response>('menu.company.com/company/get', { // [1] params: { menuItemStateId: states.map(state => stateMappingsReverse[state]) // [2] } }) const backendMenu = await data.json() return backendMenu.map(item => responseToEntity(backendMenu)) } private responseToEntity(response: Response): (MenuItem | Menu) { //Here the transformations from flat to nested are more complex (require more code lines) so I'm going to ignore it in the example. Let's imagine it is done after this line // transform the response to the domain entities if ('items' in response) { return new Menu({ id: response.id, internalName: response.internalName, title: response.title, icon: mapIcon(response.internalName), // [3] image: mapImage(response.internalName), // [3] state: stateMappings[response.menuStateId], // [4] children: response.items.map(item => responseToEntity(item)) }) } else { return new MenuItem({ id: response.id, internalName: response.id, title: response.title, icon: mapIcon(response.internalName), // [3] url: response.url, state: stateMappings[response.menuItemStateId], // [4] menuId: response.menuId }) } } } Things to focus on in the backend v2 repo implementation: [1]: The endpoint (even the domain) is different from the other repo. That is expected as it is a different backend. [2]: You need to convert the meaning of the filters to the backend meaning. The adapter knows that the backend expects a query param called menuItemStateId with the values 0, 1, or 2 to get the items with the state disabled, only-for-staff, or open. [3]: We have mapping functions for the icons and images, but this function is different from the legacy one. [4]: We convert the menuItemStateId and menuStateId to the domain state using the mappings. After the changes, the architecture looks like this: Now we have 2 different adapters (one per backend) for the same port. Those adapters follow the contract and convert the backend data to the domain entities. The rest of the flow is the same: the domain does not know how the data is retrieved, it only knows how to use it. This gives us a lot of flexibility. We can change the backend without changing the domain the application, or the views. The Dependency Injection The last piece of the “puzzle” is the dependency injection, which allows us, to replace a repository implementation with another one that follows the same port interface, but instead of importing it from the code that will call the repository, we inject it from outside allows us to change it easily. Let’s suppose we have a use case (or application service) that will use the repository to get the menu: TypeScript class GetMainMenuUseCase { constructor(private menuRepo: MenuRepo) {} async execute(states: State[]): Promise<(MenuItemDto | MenuDto)[]> { return this.menuRepo.getMainMenu(states).map(entity => entity.toDto()) } } We can use a factory to create the repository implementation: TypeScript function createMenuRepo(clientId: string): MenuRepo { if (['client123', 'client34'].includes(clientId)) { return new V2MenuRepo() } else { return new LegacyMenuRepo() } } const useCase = new GetMainMenuUseCase(createMenuRepo('client123')) usecase.execute(['open', 'only-for-staff']) Summarizing The hexagonal architecture, the repository pattern, and the dependency injection are very powerful tools that allow us to create decoupled software that works in independent pieces loosely coupled that can be easily changed and make the maintenance simpler. Those pieces should define a contract for the actions (execute a method) and for the returned data, and should not be used in other places. For example, it is a bad practice to pass the filters directly to the repository implementation and use them as is in the HTTP request because you are coupling your application code to the backend, as we see in the example when we map the filter values. As you can see in the example, we can change the backend at any moment: it’s just in order to change the repo implementation, we inject it into the use case without changing anything else. This will work only if all the different backend returns the same business concepts. If not, we are talking about different domain models and we need to create different ports and adapters for each one. Achieving that can require time and knowledge of the domain and the business rules, but the benefits are worth it.

By Sergio Carracedo

How To Protect a File Server

During a recent enlightening conversation with my mentor, it dawned on me that the language of security, brimming with intricate jargon, often becomes an obstacle when we attempt to apply it in practical, real-world scenarios. This article is my endeavor to bridge this gap, to convert the abstract into the tangible and make the complex understandable. Let's visualize a situation where we are entrusted with the responsibility of protecting a Linux file server. It can be a daunting task if approached haphazardly, potentially leading to disorganized and ineffective efforts. So, before you immerse yourself in this article, pause for a moment. Picture the steps you would take, the strategy you would follow, and where you would initiate this process. A great starting point for this visualization is a structured framework or established security approach. Let's consider the Defense in Depth architecture, a comprehensive strategy that includes: Technical controls, such as Firewalls, WAF, Secure Web Gateway, IDS/IPS, EDR software, and anti-malware software. Physical controls, including access control, alarm systems, ID scanners, and surveillance procedures. Administrative controls, primarily security policies. We can take this structured approach a step further by adopting layered security. This strategy involves implementing a variety of security measures at different levels or 'layers'. Now, let's embark on a journey to understand how to fortify each layer of security for a Linux file server. Through this process, we'll transform the daunting task of server protection into a systematic, manageable process. Administrative Controls User Access Controls Define who can access the server and what level of access they should have. Each user should have their own account and should only be given the permissions they need to perform their job. Establish File and Directory Permissions/Access Restrict root logins to system consoles Use useradd, userdel, usermod commands to manage user accounts. Password Policies Implement strong password policies. This might include requirements for password length, complexity, and expiration. Verify that no accounts have empty password fields Set account expiration parameters on active accounts Use PAM (Pluggable Authentication Modules) to enforce password policies. Incident Response Plan Have a plan in place for how to respond to security incidents. This might include steps for identifying, isolating, investigating, and resolving the incident. Have a plan in place that includes identifying the breach (using system logs or security tools). Isolate affected systems and patch the vulnerability Restore the system from backups. Security Audits Regularly audit your server for security issues. This might include checking for unnecessary services, weak passwords, and unpatched vulnerabilities. Tools like Lynis can be used for security audits of Linux systems. Employee Training Train employees on security best practices. This might include training on topics like phishing, password security, and safe internet use. Physical Barriers Secure the physical server in a locked room to prevent unauthorized physical access. Implement alarm systems and surveillance cameras for additional security. Apply strict access control to this room - only authorized personnel should be allowed to enter. Use biometric controls or card-based access systems for added security. Perimeter Security Install and configure a Linux firewall like ufw (an uncomplicated Firewall) or iptables. Block all incoming connections except those that are necessary. Set up rules for allowing only necessary traffic. # Block all incoming traffic by default sudo ufw default deny incoming # Allow outgoing traffic by default sudo ufw default allow outgoing # Allow incoming SSH connections sudo ufw allow ssh # Enable the firewall sudo ufw enable Set up an IDS/IPS system to monitor and potentially block suspicious network activity. Network Security Network Segmentation If possible, place the file server in a dedicated network segment isolated from other parts of the network to limit potential attack vectors. Use VLANs or separate physical networks for segmenting your network. Secure Remote Access If remote access is necessary, it should be secured using a VPN or SSH with key-based authentication. Use OpenSSH for remote connections. Disable password-based authentication and use SSH keys for improved security. Endpoint Security Regular Updates Keep the server's operating system and all installed software up to date to ensure any known vulnerabilities are patched. # Update the system sudo apt-get update sudo apt-get upgrade Antivirus Install and configure an antivirus solution, even on a Linux system. For example, ClamAV is a popular choice for Linux servers. Disable Unnecessary Services The less software installed on your server, the fewer potential vulnerabilities it has. Application Security Secure Configuration Ensure that any applications running on the server, such as a file service like Samba, are securely configured. Regular Updates Keep all applications up to date to mitigate known vulnerabilities. Use package managers like apt or yum to keep all software up-to-date. Data Security Access Controls Implement strong access control policies for your files. Users should only be able to access the files they need. Encryption Encrypt sensitive data to protect it in case of unauthorized access. This could be done at the file level, or for the entire disk. Backups Regularly backup data to protect against data loss. Ensure backups are stored securely, and regularly test restoring from backup. # Install and configure Samba sudo apt-get install samba sudo cp /etc/samba/smb.conf /etc/samba/smb.conf.backup sudo nano /etc/samba/smb.conf In the smb.conf file, you would configure your file shares and their permissions. Remember, security is not a one-off task but a continuous process of monitoring, updating, and improving your protective measures. Once you grasp the key terms in the security world, start applying them to truly understand how to secure the digital world.

By Akanksha Pathak

CORE