Low-Code Development: Leverage low and no code to streamline your workflow so that you can focus on higher priorities.
DZone Security Research: Tell us your top security strategies in 2024, influence our research, and enter for a chance to win $!
Cloud architecture refers to how technologies and components are built in a cloud environment. A cloud environment comprises a network of servers that are located in various places globally, and each serves a specific purpose. With the growth of cloud computing and cloud-native development, modern development practices are constantly changing to adapt to this rapid evolution. This Zone offers the latest information on cloud architecture, covering topics such as builds and deployments to cloud-native environments, Kubernetes practices, cloud databases, hybrid and multi-cloud environments, cloud computing, and more!
You Can Shape Trend Reports: Participate in DZone Research Surveys + Enter the Prize Drawings!
The Maturing of Cloud-Native Microservices Development: Effectively Embracing Shift Left to Improve Delivery
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC. The cloud-native application protection platform (CNAPP) model is designed to secure applications that leverage cloud-native technologies. However, applications not in the scope are typically legacy systems that were not designed to operate within modern cloud infrastructures. Therefore, in practice, CNAPP covers the security of containerized applications, serverless functions, and microservices architectures, possibly running across different cloud environments. Figure 1. CNAPP capabilities across different application areas A good way to understand the goal of the security practices in CNAPPs is to look at the threat model, i.e., attack scenarios against which applications are protected. Understanding these scenarios helps practitioners grasp the aim of features in CNAPP suites. Note also that the threat model might vary according to the industry, the usage context of the application, etc. In general, the threat model is attached to the dynamic and distributed nature of cloud-native architectures. Such applications face an important attack surface and an intricate threat landscape mainly because of the complexity of their execution environment. In short, the model typically accounts for unauthorized access, data breaches due to misconfigurations, inadequate identity and access management policies, or simply vulnerabilities in container images or third-party libraries. Also, due to the ephemeral and scalable characteristics of cloud-native applications, CNAPPs require real-time mechanisms to ensure consistent policy enforcement and threat detection. This is to protect applications from automated attacks and advanced persistent threats. Some common threats and occurrences are shown in Figure 2: Figure 2. Typical threats against cloud-native applications Overall, the scope of the CNAPP model is quite broad, and vendors in this space must cover a significant amount of security domains to shield the needs of the entire model. Let’s review the specific challenges that CNAPP vendors face and the opportunities to improve the breadth of the model to address an extended set of threats. Challenges and Opportunities When Evolving the CNAPP Model To keep up with the evolving threat landscape and complexity of modern organizations, the evolution of the CNAPP model yields both significant challenges and opportunities. Both the challenges and opportunities discussed in the following sections are briefly summarized in Table 1: Table 1. Challenges and opportunities with evolving the CNAPP model Challenges Opportunities Integration complexity – connect tools, services, etc. Automation – AI and orchestration Technological changes – tools must continually evolve Proactive security – predictive and prescriptive measures Skill gaps – tools must be friendly and efficient DevSecOps – integration with DevOps security practices Performance – security has to scale with complexity Observability – extend visibility to the SDLC’s left and right Compliance – region-dependent, evolving landscape Edge security – control security beyond the cloud Challenges The integration challenges that vendors face due to the scope of the CNAPP model are compounded by quick technological changes: Cloud technologies are continuously evolving, and vendors need to design tools that are user friendly. Managing the complexity of cloud technology via simple, yet powerful, user interfaces allows organizations to cope with the notorious skill gaps in teams resulting from rapid technology evolution. An important aspect of the security measures delivered by CNAPPs is that they must be efficient enough to not impact the performance of the applications. In particular, when scaling applications, security measures should continue to perform gracefully. This is a general struggle with security — it should be as transparent as possible yet responsive and effective. An often industry-rooted challenge is regulatory compliance. The expansion of data protection regulations globally requires organizations to comply with evolving regulation frameworks. For vendors, this requires maintaining a wide perspective on compliance and incorporating these requirements into their tool capabilities. Opportunities In parallel, there are significant opportunities for CNAPPs to evolve to address the challenges. Taming complexity is an important factor to tackle head first to expand the scope of the CNAPP model. For that purpose, automation is a key enabler. For example, there is a significant opportunity to leverage artificial intelligence (AI) to accelerate routine tasks, such as policy enforcement and anomaly detection. The implementation of AI for operation automation is particularly important to address the previously mentioned scalability challenges. This capability enhances analytics and threat intelligence, particularly to offer predictive and prescriptive security capabilities (e.g., to advise users for the necessary settings in a given scenario). With such new AI-enabled capabilities, organizations can effectively address the skill gap by offering guided remediation, automated policy recommendations, and comprehensive visibility. An interesting opportunity closer to the code stage is integrating DevSecOps practices. While a CNAPP aims to protect cloud-native applications across their lifecycle, in contrast, DevSecOps embeds security practices that liaise between development, operations, and security teams. Enabling DevSecOps in the context of the CNAPP model covers areas such as providing integration with source code management tools and CI/CD pipelines. This integration helps detect vulnerabilities early and ensure that security is baked into the product from the start. Also, providing developers with real-time feedback on the security implications of their activities helps educate them on security best practices and thus reduce the organization’s exposure to threats. The main goal here is to "shift left" the approach to improve observability and to help reduce the cost and complexity of fixing security issues later in the development cycle. A last and rather forward-thinking opportunity is to evolve the model so that it extends to securing an application on “the edge,” i.e., where it is executed and accessed. A common use case is the access of a web application from a user device via a browser. The current CNAPP model does not explicitly address security here, and this opportunity should be seen as an extension of the operation stage to further “shield right” the security model. Technology Trends That Can Reshape CNAPP The shift left and shield right opportunities (and the related challenges) that I reviewed in the last section can be addressed by the technologies exemplified here. Firstly, the enablement of DevSecOps practices is an opportunity to further shift the security model to the left of the SDLC, moving security earlier in the development process. Current CNAPP practices already include looking at source code and container vulnerabilities. More often than not, visibility over these development artifacts starts once they have been pushed from the development laptop to a cloud-based repository. By using a secure implementation of cloud development environments (CDEs), from a CNAPP perspective, observability across performance and security can start from the development environment, as opposed to the online DevOps tool suites such as CI/CD and code repositories. Secondly, enforcing security for web applications at the edge is an innovative concept when looking at it from the perspective of the CNAPP model. This can be realized by integrating an enterprise browser into the model. For example: Security measures that aim to protect against insider threats can be implemented on the client side with mechanisms very similar to how mobile applications are protected against tampering. Measures to protect web apps against data exfiltration and prevent display of sensitive information can be activated based on injecting a security policy into the browser. Automation of security steps allows organizations to extend their control over web apps (e.g., using robotic process automation). Figure 3. A control component (left) fetches policies to secure app access and browsing (right) Figure 4 shows the impact of secure implementation of a CDE and enterprise browser on CNAPP security practices. The use of both technologies enables security to become a boon for productivity as automation plays the dual role of simplifying user-facing processes around security to the benefit of increased productivity. Figure 4. CNAPP model and DevOps SDLC augmented with secure cloud development and browsing Conclusion The CNAPP model and the tools that implement it should be evolving their coverage in order to add resilience to new threats. The technologies discussed in this article are examples of how coverage can be improved to the left and further to the right of the SDLC. The goal of increasing coverage is to provide organizations more control over how they implement and deliver security in cloud-native applications across business scenarios. This is an excerpt from DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC.Read the Free Report
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC. In today's cloud computing landscape, businesses are embracing the dynamic world of hybrid and multi-cloud environments and seamlessly integrating infrastructure and services from multiple cloud vendors. This shift from a single provider is driven by the need for greater flexibility, redundancy, and the freedom to leverage the best features from each provider and create tailored solutions. Furthermore, the rise of cloud-native technologies is reshaping how we interact with the cloud. Containerization, serverless, artificial intelligence (AI), and edge computing are pushing the boundaries of what's possible, unlocking a new era of innovation and efficiency. But with these newfound solutions comes a new responsibility: cost optimization. The complexities of hybrid and multi-cloud environments, coupled with the dynamic nature of cloud-native deployments, require a strategic approach to managing cloud costs. This article dives into the intricacies of cloud cost management in this new era, exploring strategies, best practices, and frameworks to get the most out of your cloud investments. The Role of Containers in Vendor Lock-In Vendor lock-in occurs when a company becomes overly reliant on a specific cloud provider's infrastructure, services, and tools. This can have a great impact on both agility and cost. Switching to a different cloud provider can be a complex and expensive process, especially as apps become tightly coupled with the vendor's proprietary offerings. Additionally, vendor lock-in can limit you from negotiating better pricing options or accessing the latest features offered by other cloud providers. Containers are recognized for their portability and ability to package applications for seamless deployment across different cloud environments by encapsulating an application's dependencies within a standardized container image (as seen in Figure 1). This means that you can theoretically move your containerized application from one cloud provider to another without significant code modifications. This flexibility affects greater cost control as you're able to leverage the competitive nature of the cloud landscape to negotiate the best deals for your business. Figure 1. Containerization explained With all that being said, complete freedom from vendor lock-in remains a myth with containers. While application code may be portable, configuration management tools, logging services, and other aspects of your infrastructure might still be tied up with the specific vendor's offerings. An approach that leverages open-source solutions whenever possible can maximize the portability effects of containers and minimize the risk of vendor lock-in. The Importance of Cloud Cost Management With evolving digital technologies, where startups and enterprises alike depend on cloud services for their daily operations, efficient cloud cost management is essential. To maximize the value of your cloud investment, understanding and controlling cloud costs not only prevents budget overruns but also ensures that resources are used optimally. The first step in effective cloud cost management is understanding your cloud bill. Most cloud providers now offer detailed billing reports that break down your spending by service, resource type, and region. Familiarize yourself with these reports and identify the primary cost drivers for your environment. Common cost factors include: Transfer rates Storage needs Compute cycles consumed by your services Once you have an understanding of these drivers, the next step is to identify and eliminate any cloud waste. Wasteful cloud spending is often attributed to unused or underutilized resources, which can easily happen if you leave them running overnight or on weekends, and this can significantly inflate your cloud bill. You can eliminate this waste by leveraging tools like autoscaling to automatically adjust resources based on demand. Additionally, overprovisioning (allocating more resources than necessary) can be another really big cost driver. Practices such as rightsizing, where you adjust the scales of your cloud resources to match the demand, can lead to cost savings. Continuous monitoring and analysis of resource utilization is necessary to ensure that each service is perfectly fitted to its needs, neither over- nor under-provisioned. Finally, most cloud providers now offer cost-saving programs that can help optimize your spending. These may include reserved instances where you get discounts for committing to a specific resource for a fixed period, or Spot instances that allow you to use unused capacity at a significantly lower price. Taking advantage of such programs requires a deep understanding of your current and projected usage to select the most beneficial option. Effective cloud cost management is not just about cutting costs but also about optimizing cloud usage in a way that aligns with organizational goals and strategies. Selecting the Best Cloud Options for Your Organization As the one-size-fits-all approach doesn't really exist when working with the cloud, choosing the best options for your specific needs is paramount. Below are some strategies that can help. Assessing Organizational Needs A thorough assessment of your organizational needs involves analyzing your workload characteristics, scalability, and performance requirements. For example, mission-critical applications with high resource demands might need different cloud configurations than static web pages. You can evaluate your current usage patterns and future project needs using machine learning and AI. Security and compliance needs are equally important considerations. Certain industries face regulatory requirements that can dictate data-handling and processing protocols. Identifying a cloud provider that meets these security and compliance standards is non-negotiable for protecting sensitive information. This initial assessment will help you identify which cloud services are suitable for your business needs and implement a proactive approach to cloud cost optimization. Evaluating Cloud Providers Once you have a clear understanding, the next step is to compare the offerings of different cloud providers. Evaluate their services based on key metrics, such as performance, cost efficiency, and the quality of customer support. Take advantage of free trials and demos offered to test drive their services and better assess their suitability. The final decision often comes down to one question: adopt a single- or multi-cloud strategy? Each approach offers specific advantages and disadvantages, so the optimal choice depends on specific needs and priorities. The table below compares the key features of single-cloud and multi-cloud strategies to help you make an informed decision: Table 1. Single- vs. multi-cloud approaches Feature Single-Cloud Multi-Cloud Simplicity Easier to manage; single point of contact More complex to manage; requires expertise in multiple platforms Cost Potentially lower costs through volume discounts May offer lower costs overall by leveraging the best pricing models from different providers Vendor lock-in High; limited flexibility to switch providers Low; greater freedom to choose and switch providers Performance Consistent performance if the provider is chosen well May require optimization for performance across different cloud environments Security Easier to implement and maintain consistent security policies Requires stronger security governance to manage data across multiple environments Compliance Easier to comply with regulations if provider offerings align with needs May require additional effort to ensure compliance across different providers Scalability Scalable within the chosen provider's ecosystem Offers greater horizontal scaling potential by leveraging resources from multiple providers Innovation Limited to innovations offered by the chosen provider Access to a wider range of innovations and features from multiple providers Modernizing Cloud Tools and Architectures Having selected the right cloud options and established a solid foundation for cloud cost management, you need to ensure your cloud environment is optimized for efficiency and cost control. This requires a proactive approach that continuously evaluates and modernizes your cloud tools and architectures. Here, we introduce a practical framework for cloud modernization and continuous optimization: Assessment – Analyze your current cloud usage using cost management platforms and identify inefficiencies and opportunities for cost reduction. Pinpoint idle or underutilized resources that can be scaled down or eliminated. Planning – Armed with these insights, define clear goals and objectives for your efforts. These goals might include reducing overall cloud costs by a specific percentage, optimizing resource utilization, or improving scalability. Once you establish your goals, choose the right optimization strategies that will help you achieve them. Implementation – Now is time to put your plan into action. This can mean implementing cost-saving measures like autoscaling, which automatically adjusts your resources based on demand. Cloud cost management platforms can also play a crucial role in providing real-time visibility and automated optimization recommendations. Monitoring and optimization – Cloud modernization is an ongoing process that requires continuous monitoring and improvement. Regularly review your performance metrics, cloud costs, and resource utilization metrics to adapt your strategies as needed. Figure 2. A framework for modernizing cloud environments By following this framework, you can systematically improve your cloud environment and make sure it remains cost effective. Conclusion Cloud technologies offer a lot of benefits for businesses of all sizes. However, without a strategic approach to cost management, these benefits can be overshadowed by unexpected expenses. By following the best practices in this article, from understanding your cloud requirements and selecting the best cloud option to adopting continuous optimization for your tools and architectures, you can ensure your cloud journey is under financial control. Looking ahead, the future of cloud computing looks exciting as serverless, AI, and edge computing promise to unlock even greater agility, scalability, and efficiency. Staying informed about these advancements, new pricing models, and emerging tools will be really important to maximize the value of your cloud investment. Cost optimization is not a one-time endeavor but an ongoing process that requires continuous monitoring, adaptation, and a commitment to extract the most value out of your cloud resources. This is an excerpt from DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC.Read the Free Report
A typical machine learning (ML) workflow involves processes such as data extraction, data preprocessing, feature engineering, model training and evaluation, and model deployment. As data changes over time, when you deploy models to production, you want your model to learn continually from the stream of data. This means supporting the model’s ability to autonomously learn and adapt in production as new data is added. In practice, data scientists often work with Jupyter Notebooks for development work and find it hard to translate from notebooks to automated pipelines. To achieve the two main functions of an ML service in production, namely retraining (retrain the model on newer labeled data) and inference (use the trained model to get predictions), you might primarily use the following: Amazon SageMaker: A fully managed service that provides developers and data scientists the ability to build, train, and deploy ML models quickly AWS Glue: A fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data In this post, we demonstrate how to orchestrate an ML training pipeline using AWS Glue workflows and train and deploy the models using Amazon SageMaker. For this use case, you use AWS Glue workflows to build an end-to-end ML training pipeline that covers data extraction, data processing, training, and deploying models to Amazon SageMaker endpoints. Use Case For this use case, we use the DBpedia Ontology classification dataset to build a model that performs multi-class classification. We trained the model using the BlazingText algorithm, which is a built-in Amazon SageMaker algorithm that can classify unstructured text data into multiple classes. This post doesn’t go into the details of the model but demonstrates a way to build an ML pipeline that builds and deploys any ML model. Solution Overview The following diagram summarizes the approach for the retraining pipeline. The workflow contains the following elements: AWS Glue crawler: You can use a crawler to populate the Data Catalog with tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. ETL jobs that you define in AWS Glue use these Data Catalog tables as sources and targets. AWS Glue triggers: Triggers are Data Catalog objects that you can use to either manually or automatically start one or more crawlers or ETL jobs. You can design a chain of dependent jobs and crawlers by using triggers. AWS Glue job: An AWS Glue job encapsulates a script that connects source data, processes it, and writes it to a target location. AWS Glue workflow: An AWS Glue workflow can chain together AWS Glue jobs, data crawlers, and triggers, and build dependencies between the components. When the workflow is triggered, it follows the chain of operations as described in the preceding image. The workflow begins by downloading the training data from Amazon Simple Storage Service (Amazon S3), followed by running data preprocessing steps and dividing the data into train, test, and validate sets in AWS Glue jobs. The training job runs on a Python shell running in AWS Glue jobs, which starts a training job in Amazon SageMaker based on a set of hyperparameters. When the training job is complete, an endpoint is created, which is hosted on Amazon SageMaker. This job in AWS Glue takes a few minutes to complete because it makes sure that the endpoint is in InService status. At the end of the workflow, a message is sent to an Amazon Simple Queue Service (Amazon SQS) queue, which you can use to integrate with the rest of the application. You can also use the queue to trigger an action to send emails to data scientists that signal the completion of training, add records to management or log tables, and more. Setting up the Environment To set up the environment, complete the following steps: Configure the AWS Command Line Interface (AWS CLI) and a profile to use to run the code. For instructions, see Configuring the AWS CLI. Make sure you have the Unix utility wget installed on your machine to download the DBpedia dataset from the internet. Download the following code into your local directory. Organization of Code The code to build the pipeline has the following directory structure: --Glue workflow orchestration --glue_scripts --DataExtractionJob.py --DataProcessingJob.py --MessagingQueueJob,py --TrainingJob.py --base_resources.template --deploy.sh --glue_resources.template The code directory is divided into three parts: AWS CloudFormation templates: The directory has two AWS CloudFormation templates: glue_resources.template and base_resources.template. The glue_resources.template template creates the AWS Glue workflow-related resources, and base_resources.template creates the Amazon S3, AWS Identity and Access Management (IAM), and SQS queue resources. The CloudFormation templates create the resources and write their names and ARNs to AWS Systems Manager Parameter Store, which allows easy and secure access to ARNs further in the workflow. AWS Glue scripts: The folder glue_scripts holds the scripts that correspond to each AWS Glue job. This includes the ETL as well as model training and deploying scripts. The scripts are copied to the correct S3 bucket when the bash script runs. Bash script: A wrapper script deploy.sh is the entry point to running the pipeline. It runs the CloudFormation templates and creates resources in the dev, test, and prod environments. You use the environment name, also referred to as stage in the script, as a prefix to the resource names. The bash script performs other tasks, such as downloading the training data and copying the scripts to their respective S3 buckets. However, in a real-world use case, you can extract the training data from databases as a part of the workflow using crawlers. Implementing the Solution Complete the following steps: Go to the deploy.sh file and replace algorithm_image name with <ecr_path> based on your Region. The following code example is a path for Region us-west-2: Shell algorithm_image="433757028032.dkr.ecr.us-west-2.amazonaws.com/blazingtext:latest" For more information about BlazingText parameters, see Common parameters for built-in algorithms. Enter the following code in your terminal: Shell sh deploy.sh -s dev AWS_PROFILE=your_profile_name This step sets up the infrastructure of the pipeline. On the AWS CloudFormation console, check that the templates have the status CREATE_COMPLETE. On the AWS Glue console, manually start the pipeline. In a production scenario, you can trigger this manually through a UI or automate it by scheduling the workflow to run at the prescribed time. The workflow provides a visual of the chain of operations and the dependencies between the jobs. To begin the workflow, in the Workflow section, select DevMLWorkflow. From the Actions drop-down menu, choose Run. View the progress of your workflow on the History tab and select the latest RUN ID. The workflow takes approximately 30 minutes to complete. The following screenshot shows the view of the workflow post-completion. After the workflow is successful, open the Amazon SageMaker console. Under Inference, choose Endpoint. The following screenshot shows that the endpoint of the workflow deployed is ready. Amazon SageMaker also provides details about the model metrics calculated on the validation set in the training job window. You can further enhance model evaluation by invoking the endpoint using a test set and calculating the metrics as necessary for the application. Cleaning Up Make sure to delete the Amazon SageMaker hosting services—endpoints, endpoint configurations, and model artifacts. Delete both CloudFormation stacks to roll back all other resources. See the following code: Python def delete_resources(self): endpoint_name = self.endpoint try: sagemaker.delete_endpoint(EndpointName=endpoint_name) print("Deleted Test Endpoint ", endpoint_name) except Exception as e: print('Model endpoint deletion failed') try: sagemaker.delete_endpoint_config(EndpointConfigName=endpoint_name) print("Deleted Test Endpoint Configuration ", endpoint_name) except Exception as e: print(' Endpoint config deletion failed') try: sagemaker.delete_model(ModelName=endpoint_name) print("Deleted Test Endpoint Model ", endpoint_name) except Exception as e: print('Model deletion failed') This post describes a way to build an automated ML pipeline that not only trains and deploys ML models using a managed service such as Amazon SageMaker, but also performs ETL within a managed service such as AWS Glue. A managed service unburdens you from allocating and managing resources, such as Spark clusters, and makes it easy to move from notebook setups to production pipelines.
Microservices-based architecture splits applications into multiple independent deployable services, where each service provides a unique functionality. Every architectural style has pros and cons. One of the challenges of micro-service architecture is complex debugging/troubleshooting. Distributed Tracing In a microservice world, distributed tracing is the key to faster troubleshooting. Distributed tracing enables engineers to track a request through the mesh of services and therefore help them troubleshoot a problem. To achieve this, a unique identifier, say, trace-id, is injected right at the initiation point of a request, which is usually an HTTP load balancer. As the request hops through the different components (third-party apps, service-mesh, etc.), the same trace-id should be recorded at every component. This essentially requires propagation of the trace-id from one hop to another. Over a period, different vendors adopted different mechanisms to define the unique identifier (trace-id), for example: Zipkin B3 headers Datadog tracing headers Google proprietary trace context AWS proprietary trace context Envoy request id W3C trace context An application can adopt one of these available solutions as per the need. Accordingly, the relevant header (e.g., x-cloud-trace-context if Google proprietary trace context is adopted) should get injected at the request initiation and thereafter same value should get propagated to each of the components involved in the request lifecycle to achieve distributed tracing. W3C Trace Context Standard As the microservice world is evolving, there is a need to have a standard mechanism for trace propagation. Consider the case when two different applications that adopted two different trace propagation approaches, are used together. Since they use two different headers for trace propagation, distributed tracing gets broken when they communicate. To address such problems, it is recommended to use the W3C trace context across all components. W3C trace context is the standard that is being adopted by all major vendors for supporting cross-vendor distributed traces. Problem: Broken Traces OpenTelemetry supports the W3C trace context header "traceparent" propagation using auto-instrumentation. This means, as an application developer, I need not write any code in my application for trace context propagation when I instrument it with OpenTelemetry. For example, if I have a Java application, I can instrument it as shown below: java -javaagent:opentelemetry-javaagent.jar -Dotel.service.name=app-name -jar app.jar The traceparent header will now be automatically generated/propagated by the instrumented Java application. However, when my application, instrumented using OpenTelemetry, gets deployed behind GCP or AWS HTTP load balancer, my expectation to visualize the complete trace starting from the load balancer fails. This is because GCP HTTP Load Balancer supports their proprietary trace context header "X-Cloud-Trace-Context". See GCP documentation for more details. AWS Elastic Load Balancer supports their proprietary trace context header "X-Amzn-Trace-Id". See AWS documentation for more details. My application generates and logs the W3C traceparent header. So, the unique-identifier generated by the GCP/AWS load balancer is not propagated further by my application. This is the typical problem of broken traces, also described above. So how can a developer leverage the out-of-the-box OpenTelemetry trace context propagation functionality? Solution: GCP Trace Context Transformer We have solved this problem by transforming the GCP/AWS proprietary trace context header (X-Cloud-Trace-Context/ X-Amzn-Trace-Id) to the W3C trace context header (traceparent). Service mesh is a key component in a distributed system to enforce organization policies consistently across all the applications. One of the popular service mesh, Istio, can help in solving our problem. The diagram below elaborates on the solution: A common Trace-Id value across all the logs generated from the load balancer, istio-ingress gateway, istio-sidecar, and application logs helps in stitching all the logs for a request processing. Istio allows you to extend the data-plane behavior by writing custom logic using either Lua or WASM. We have extended the istio-ingress gateway by injecting a Lua filter. This filter extracts the trace-id and span-id from X-Cloud-Trace-Context and creates the traceparent request header using these values. Note: For the sake of simplicity, below filter code is built only for the GCP "X-Cloud-Trace-Context". One can write a similar filter for AWS "X-Amzn-Trace-Id". While adopting the filter in your infrastructure, don't forget to choose the right namespace and workloadSelector label. This filter has been tested on Istio 1.20.1 version. YAML apiVersion: networking.istio.io/v1alpha3 kind: EnvoyFilter metadata: name: gcp-trace-context-transformer-gateway namespace: istio-system spec: workloadSelector: labels: istio: ingressgateway configPatches: - applyTo: HTTP_FILTER # http connection manager is a filter in Envoy match: context: GATEWAY patch: operation: INSERT_BEFORE value: name: envoy.filters.http.lua typed_config: "@type": "type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua" inlineCode: | function envoy_on_request(request_handle) local z = request_handle:headers():get("traceparent") if z ~= nil then return end local x = request_handle:headers():get("X-Cloud-Trace-Context") if x == nil then return end local y = string.gmatch(x, "%x+") local traceid = y() if (traceid == nil) then return end -- generate a new 16 hex-character random span math.randomseed(os.time()) local parentid = math.random(0xffffffff, 0x7fffffffffffffff) local traceparent = string.format("00-%s-%016x-01", traceid, parentid) request_handle:headers():add("traceparent", traceparent) end function envoy_on_response(response_handle) return end Alternate Solution: Custom GCP Trace Context Propagator Another possible solution could be extending OpenTelemetry to support the propagation of GCP proprietary trace context. One implementation exists on GitHub but, alas, it is still in alpha state (at the time of publishing this article). Further, this solution will only work for GCP environments, similar propagators will be needed for different cloud providers (AWS, etc).
For a long time, AWS CloudTrail has been the foundational technology that enabled organizations to meet compliance requirements by capturing audit logs for all AWS API invocations. CloudTrail Lake extends CloudTrail's capabilities by adding support for a SQL-like query language to analyze audit events. The audit events are stored in a columnar format called ORC to enable high-performance SQL queries. An important capability of CloudTrail Lake is the ability to ingest audit logs from custom applications or partner SaaS applications. With this capability, an organization can get a single aggregated view of audit events across AWS API invocations and their enterprise applications. As each end-to-end business process can span multiple enterprise applications, an aggregated view of audit events across them becomes a critical need. This article discusses an architectural approach to leverage CloudTrail Lake for auditing enterprise applications and the corresponding design considerations. Architecture Let us start by taking a look at the architecture diagram. This architecture uses SQS Queues and AWS Lambda functions to provide an asynchronous and highly concurrent model for disseminating audit events from the enterprise application. At important steps in business transactions, the application will call relevant AWS SDK APIs to send the audit event details as a message to the Audit event SQS queue. A lambda function is associated with the SQS queue so that it is triggered whenever a message is added to the queue. It will call the putAuditEvents() API provided by CloudTrail Lake to ingest Audit Events into the Event Data Store configured for this enterprise application. Note that the architecture shows two other Event Data stores to illustrate that events from the enterprise application can be correlated with events in the other data stores. Required Configuration Start by creating an Event Data Store which accepts events of category AuditEventLog. Note down the ARN of the event data store created. It will be needed for creating an integration channel. Shell aws cloudtrail create-event-data-store \ --name custom-events-datastore \ --no-multi-region-enabled \ --retention-period 90 \ --advanced-event-selectors '[ { "Name": "Select all external events", "FieldSelectors": [ { "Field": "eventCategory", "Equals": ["ActivityAuditLog"] } ] } ]' Create an Integration with the source as "My Custom Integration" and choose the delivery location as the event data store created in the previous step. Note the ARN of the channel created; it will be needed for coding the Lambda function. Shell aws cloudtrail create-channel \ --region us-east-1 \ --destinations '[{"Type": "EVENT_DATA_STORE", "Location": "<event data store arn>"}]' \ --name custom-events-channel \ --source Custom Create a Lambda function that would contain the logic to receive messages from an SQS queue, transform the message into an audit event, and send it to the channel created in the previous step using the putAuditEvents() API. Refer to the next section to understand the main steps to be included in the lambda function logic. Add permissions through an inline policy for the Lambda function, to be authorized to put audit events into the Integration channel. JSON { "Version": "2012-10-17", "Statement": [ { "Sid": "Statement1", "Effect": "Allow", "Action": "cloudtrail-data:PutAuditEvents", "Resource": "<channel arn>" }] } Create a SQS queue of type "Standard" with an associated dead letter queue. Add permissions to the Lambda function using an inline policy to allow receiving messages from the SQS Queue. JSON { "Version": "2012-10-17", "Statement": [ { "Sid": "Statement1", "Effect": "Allow", "Action": "sqs:*", "Resource": "<SQS Queue arn>" } ] } In the Lambda function configuration, add a trigger by choosing the source as "SQS" and specifying the ARN of the SQS queue created in the previous step. Ensure that Report batch item failures option is selected. Finally, ensure that permissions to send messages to this queue are added to the IAM Role assigned to your enterprise application. Lambda Function Code The code sample will focus on the Lambda function, as it is at the crux of the solution. Java public class CustomAuditEventHandler implements RequestHandler<SQSEvent, SQSBatchResponse> { Java public SQSBatchResponse handleRequest(final SQSEvent event, final Context context) { List<SQSMessage> records = event.getRecords(); AWSCloudTrailData client = AWSCloudTrailDataClientBuilder.defaultClient(); PutAuditEventsRequest request = new PutAuditEventsRequest(); List<AuditEvent> auditEvents = new ArrayList<AuditEvent>(); request.setChannelArn(channelARN); for (SQSMessage record : records) { AuditEvent auditEvent = new AuditEvent(); // Add logic in the transformToEventData() operation to transform contents of // the message to the event data format needed by Cloud Trail Lake. String eventData = transformToEventData(record); context.getLogger().log("Event Data JSON: " + eventData); auditEvent.setEventData(eventData); // Set a source event ID. This could be useful to correlate the event // data stored in Cloud Trail Lake to relevant information in the enterprise // application. auditEvent.setId(record.getMessageId()); auditEvents.add(auditEvent); } request.setAuditEvents(auditEvents); PutAuditEventsResult putAuditEvents = client.putAuditEvents(request); context.getLogger().log("Put Audit Event Results: " + putAuditEvents.toString()); SQSBatchResponse response = new SQSBatchResponse(); List<BatchItemFailure> failures = new ArrayList<SQSBatchResponse.BatchItemFailure>(); for (ResultErrorEntry result : putAuditEvents.getFailed()) { BatchItemFailure batchItemFailure = new BatchItemFailure(result.getId()); failures.add(batchItemFailure); context.getLogger().log("Failed Event ID: " + result.getId()); } response.setBatchItemFailures(failures); return response; } The first thing to note is that the type specification for the Class uses SQSBatchResponse, as we want the audit event messages to be processed as batches. Each Enterprise application would have its own format for representing audit messages. The logic to transform the messages to the format needed by CloudTrail Lake data schema should be part of the Lambda function. This would allow for using the same architecture even if the audit events need to be ingested into a different (SIEM) tool instead of CloudTrail Lake. Apart from the event data itself, the putAuditEvents() API of CloudTrail Lake expects a source event id to be provided for each event. This could be used to tie the audit event stored in the CloudTrail Lake to relevant information in the enterprise application. The messages which failed to be ingested should be added to list of failed records in the SQSBatchResponse object. This will ensure that all the successfully processed records are deleted from the SQS Queue and failed records are retried at a later time. Note that the code is using the source event id (result.getID()) as the ID for failed records. This is because the source event id was set as the message id earlier in the code. If a different identifier has to be used as the source event id, it has to be mapped to the message id. The mapping will help with finding the message ids for records that were not successfully ingested while framing the lambda function response. Architectural Considerations This section discusses the choices made for this architecture and the corresponding trade-offs. These need to be considered carefully while designing your solution. FIFO VS Standard Queues Audit events are usually self-contained units of data. So, the order in which they are ingested into the CloudTrail Lake should not affect the information conveyed by them in any manner. Hence, there is no need to use a FIFO queue to maintain the information integrity of audit events. Standard queues provide higher concurrency than FIFO queues with respect to fanning out messages to Lambda function instances. This is because, unlike FIFO queues, they do not have to maintain the order of messages at the queue or message group level. Achieving a similar level of concurrency with FIFO queues would require increasing the complexity of the source application as it has to include logic to fan out messages across message groups. With standard queues, there is a small chance of multiple deliveries of the same message. This should not be a problem as duplicates could be filtered out as part of the Cloud Data Lake queries. SNS Vs SQS: This architecture uses SQS instead of SNS for the following reasons: SNS does not support Lambda functions to be triggered for standard topics. SQS through its retry logic, provides better reliability with respect to delivering messages to the recipient than SNS. This is a valuable capability, especially for data as important as audit events. SQS can be configured to group audit events and send those to Lambda to be processed in batches. This helps with the performance/cost of the Lambda function and avoids overwhelming CloudTrail Lake with a high number of concurrent connection requests. There are other factors to consider as well such as the usage of private links, VPC integration, and message encryption in transit, to securely transmit audit events. The concurrency and message delivery settings provided by SQS-Lambda integration should also be tuned based on the throughput and complexity of the audit events. The approach presented and the architectural considerations discussed provide a good starting point for using CloudTrail Lake with enterprise applications.
Over the years Docker containers have completely changed how developers create, share, and run applications. With their flexible design, Docker containers ensure an environment, across various platforms simplifying the process of deploying applications reliably. When integrated with .NET, developers can harness Dockers capabilities to streamline the development and deployment phases of .NET applications. This article delves into the advantages of using Docker containers with .NET applications and offers a guide on getting started. Figure courtesy of Docker Why Choose Docker for .NET? 1. Consistent Development Environment Docker containers encapsulate all dependencies and configurations for running an application guaranteeing consistency across development, testing, and production environments. By leveraging Docker, developers can avoid the typical statement of "it works on my machine" issue, as they can create environments that operate flawlessly across various development teams and devices. 2. Simplified Dependency Management Docker eliminates the need to manually install and manage dependencies on developer machines. By specifying dependencies in a Docker file developers can effortlessly bundle their .NET applications with libraries and dependencies reducing setup time and minimizing compatibility issues. 3. Scalability and Resource Efficiency Due to its nature and containerization technology, Docker is well suited for horizontally or vertically scaling .NET applications. Developers have the ability to easily set up instances of their applications using Docker Swarm or Kubernetes which helps optimize resource usage and enhance application performance. 4. Simplified Deployment Process Docker simplifies the deployment of .NET applications. Developers have the ability to wrap their applications into Docker images. These can be deployed to any Docker-compatible environment, including local servers, cloud platforms like AWS or Azure, and even IOT devices. This not only streamlines the deployment process but also accelerates the release cycle of .NET applications Starting With Docker and .NET Step 1: Installing Docker Installing Docker is easy by navigating to the Docker desktop. Docker desktop is available for Windows, Mac, and Linux. I have downloaded and installed it for Windows. Once installed, the Docker (whale) icon is shown on the systems side tray as shown below. When you click on the icon, it will open the Docker desktop dashboard as shown below. You can see the list of containers, images, volumes, builds, and extensions. In the below figure, it shows the list of containers I have created on my local machine. Step 2: Creating a .NET Application Create a .NET application using the tool of your choice like Visual Studio, Visual Studio Code, or the.NET CLI. For example, you can use the following command directly from the command line. PowerShell dotnet new web -n MinimalApiDemo Step 3: Setting up Your Application With a Docker Create a Dockerfile in the root folder of your .NET project to specify the Docker image for your application. Below is an example of a Dockerfile for an ASP.NET Core application which was created in the previous step. Dockerfile # Use the official ASP.NET Core runtime as a base image FROM mcr.microsoft.com/dotnet/aspnet:8.0 AS base WORKDIR /app EXPOSE 8080 # Use the official SDK image to build the application FROM mcr.microsoft.com/dotnet/sdk:8.0 AS build WORKDIR /src COPY ["MinimalApiDemo.csproj", "./"] RUN dotnet restore "MinimalApiDemo.csproj" COPY . . WORKDIR "/src/" RUN dotnet build "MinimalApiDemo.csproj" -c Release -o /app/build # Publish the application FROM build AS publish RUN dotnet publish "MinimalApiDemo.csproj" -c Release -o /app/publish # Final image with only the published application FROM base AS final WORKDIR /app COPY --from=publish /app/publish . ENTRYPOINT ["dotnet", "MinimalApiDemo.dll"] Step 4: Creating and Launching Your Docker Image Create a Docker image by executing the command from a terminal window (use lowercase letters). PowerShell docker build -t minimalapidemo . After finishing the construction process you are ready to start up your Docker image by running it inside a container. Run the below docker command to spin up a new container. PowerShell docker run -d -p 8080:8080 --name myminimalapidemo minimalapidemo Your API service is currently running within a Docker container and can be reached at this localhost as shown below. Refer to my previous article to see how I created products controllers using Minimal API's with different HTTP endpoints. Here Are Some Recommended Strategies for Dockerizing .NET Applications 1. Reduce Image Size Enhance the efficiency of your Docker images by utilizing stage builds eliminating dependencies and minimizing layers in your Docker file. 2. Utilize .dockerignore File Generate a .dockerignore file to exclude files and directories from being transferred into the Docker image thereby decreasing image size and enhancing build speed. 3. Ensure Container Security Adhere to security practices during the creation and operation of Docker containers including updating base images conducting vulnerability scans and restricting container privileges. 4. Employ Docker Compose for Multi Container Applications For applications with services or dependencies, leverage Docker Compose to define and manage multi-container applications simplifying both development and deployment processes. 5. Monitor and Troubleshoot Containers Monitor the performance and health of your Docker containers using Docker’s own monitoring tools or third-party solutions. Make use of tools such as Docker logs and debugging utilities to promptly resolve issues and boost the efficiency of your containers. Conclusion Docker containers offer an efficient platform for the development, packaging, and deployment of .NET applications. By containerizing these applications, developers can create development environments, simplify dependency management, and streamline deployment processes. Whether the focus is on microservices, web apps, or APIs, Docker provides a proficient method to operate .NET applications across various environments. By adhering to best practices and maximizing Docker’s capabilities, developers can fully leverage the benefits of containerization, thereby accelerating the process of constructing and deploying .NET applications
I’m a senior solution architect and polyglot programmer interested in the evolution of programming languages and their impact on application development. Around three years ago, I encountered WebAssembly (Wasm) through the .NET Blazor project. This technology caught my attention because it can execute applications at near-native speed across different programming languages. This was especially exciting to me as a polyglot programmer since my programming expertise ranges across multiple programming languages including .NET, PHP, Node.js, Rust, and Go. Most of the work I do is building cloud-native enterprise applications, so I have been particularly interested in advancements that broaden Wasm’s applicability in cloud-native development. WebAssembly 2.0 was a significant leap forward, improving performance and flexibility while streamlining integration with web and cloud infrastructures to make Wasm an even more powerful tool for developers to build versatile and dynamic cloud-native applications. I aim to share the knowledge and understanding I've gained, providing an overview of Wasm’s capabilities and its potential impact on the cloud-native development landscape. Polyglot Programming and the Component Model My initial attraction to WebAssembly stemmed from its capability to enhance browser functionalities for graphic-intensive and gaming applications, breaking free from the limitations of traditional web development. It also allows developers to employ languages like C++ or Rust to perform high-efficiency computations and animations, offering near-native performance within the browser environment. Wasm’s polyglot programming capability and component model are two of its flagship capabilities. The idea of leveraging the strengths of various programming languages within a unified application environment seemed like the next leap in software development. Wasm offers the potential to leverage the unique strengths of various programming languages within a single application environment, promoting a more efficient and versatile development process. For instance, developers could leverage Rust's speed for performance-critical components and .NET's comprehensive library support for business logic to optimize both development efficiency and application performance. This led me to Spin, an open-source tool for the creation and deployment of Wasm applications in cloud environments. To test Wasm’s polyglot programming capabilities, I experimented with the plugins and middleware models. I divided the application business logic into one component, and the other component with the Spin component supported the host capabilities (I/O, random, socket, etc.) to work with the host. Finally, I composed with http-auth-middleware, an existing component model from Spin for OAuth 2.0, and wrote more components for logging, rate limit, etc. All of them were composed together into one app and run on the host world (Component Model). Cloud-Native Coffeeshop App The first app I wrote using WebAssembly was an event-driven microservices coffeeshop app written in Golang and deployed using Nomad, Consul Connect, Vault, and Terraform (you can see it on my GitHub). I was curious about how it would work with Kubernetes, and then Dapr. I expanded it and wrote several use cases with Dapr such as entire apps with Spin, polyglot apps (Spin and other container apps with Docker), Spin apps with Dapr, and others. What I like about it is the speed of start-up time (it’s very quick to get up and running), and the size of the app – it looks like a tiny but powerful app. The WebAssembly ecosystem has matured a lot in the past year as it relates to enterprise projects. For the types of cloud-native projects I’d like to pursue, it would benefit from a more developed support system for stateful applications, as well as an integrated messaging system between components. I would love to see more capabilities that my enterprise customers need such as gRPC or other communication protocols (Spin currently only supports HTTP), data processing and transformation like data pipelines, a multi-threading mechanism, CQRS, polyglot programming language aggregations (internal modular monolith style or external microservices style), and content negotiation (XML, JSON, Plain-text). We also need real-world examples demonstrating Wasm’s capabilities to tackle enterprise-level challenges, fostering a better understanding and wider technology adoption. We can see how well ZEISS does from their presentation at KubeCon in Paris last month. I would like to see more and more companies like them involved in this game, then, from the developer perspective, we will benefit a lot. Not only can we easily develop WebAssembly apps, but many enterprise scenarios shall also be addressed, and we will work together to make WebAssembly more handy and effective. The WebAssembly Community Sharing my journey with the WebAssembly community has been a rewarding part of my exploration, especially with the Spin community who have been so helpful in sharing best practices and new ideas. Through tutorials and presentations at community events, I've aimed to contribute to the collective understanding of WebAssembly and cloud-native development, and I hope to see more people sharing their experiences. I will continue creating tutorials and educational content, as well as diving into new projects using WebAssembly to inspire and educate others about its potential. I would encourage anyone getting started to get involved in the Wasm community of your choice to accelerate your journey. WebAssembly’s Cloud-Native Future I feel positive about the potential for WebAssembly to change how we do application development, particularly in the cloud-native space. I’d like to explore how Wasm could underpin the development of hybrid cloud platforms and domain-specific applications. One particularly exciting prospect is the potential for building an e-commerce platform based on WebAssembly, leveraging its cross-platform capabilities and performance benefits to offer a superior user experience. The plugin model existed for a long time in the e-commerce world (see what Shopify did), and with WebAssembly’s component model, we can build the application with polyglot programming languages such as Rust, Go, TypeScript, .NET, Java, PHP, etc. WebAssembly 2.0 supports the development of more complex and interactive web applications, opening the door for new use cases such as serverless stateless functions, data transformation, and the full-pledge of web API functionalities, moving into edge devices (some embedded components). New advancements like WASI 3.0 with asynchronous components are bridging the gaps. I eagerly anticipate the further impact of WebAssembly on our approaches to building and deploying applications. We’re just getting started.
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Modern API Management: Connecting Data-Driven Architectures Alongside AI, Automation, and Microservices. A recent conversation with a fellow staff engineer at a Top 20 technology company revealed that their underlying infrastructure is self-managed and does not leverage cloud-native infrastructure offered by major providers like Amazon, Google, or Microsoft. Hearing this information took me a minute to comprehend given how this conflicts with my core focus on leveraging frameworks, products, and services for everything that doesn't impact intellectual property value. While I understand the pride of a Top 20 technology company not wanting to contribute to the success of another leading technology company, I began to wonder just how successful they could be if they utilized a cloud-native approach. That also made me wonder how many other companies have yet to adopt a cloud-native approach… and the impact it is having on their APIs. Why Cloud? Why Now? For the last 10 years, I have been focused on delivering cloud-native API services for my projects. While cloud adoption continues to gain momentum, a decent percentage of corporations and technology providers still utilize traditional on-premises designs. According to The Cloud in 2021: Adoption Continues report by O'Reilly Media, Figure 1 provides a summary of the state of cloud adoption in December 2021. Figure 1. Cloud technology usage Image adapted from The Cloud in 2021: Adoption Continues, O'Reilly Media Since the total percentages noted in Figure 1 exceed 100%, the underlying assumption is that it is common for respondents to maintain both a cloud and on-premises design. However, for those who are late to enter the cloud native game, I wanted to touch on some common benefits that are recognized with cloud adoption: Focus on delivering or enhancing laser-focused APIs — stop worrying about and managing on-premises infrastructure. Scale your APIs up (and down) as needed to match demand — this is a primary use case for cloud adoption. Reduce risk by expanding your API presence — leverage availability zones, regions, and countries. Describe the supporting API infrastructure as code (IaC) — faster recovery and expandability into new target locations. Making the transition toward cloud native has become easier than ever, with the major providers offering free or discounted trial periods. Additionally, smaller platform-as-a-service (PaaS) providers like Heroku and Render provide solutions that allow teams to focus on their products and services and not worry about the underlying infrastructure design. The Cloud Native Impact on Your API Since this Trend Report is focused on modern API management, I wanted to focus on a few of the benefits that cloud native can have on APIs. Availability and Latency Objectives When providing APIs for your consumers to consume, the concept of service-level agreements (SLAs) is a common onboarding discussion topic. This is basically where expectations are put into easy-to-understand wording that becomes a binding contract between the API provider and the consumer. Failure to meet these expectations can result in fees and, in some cases, legal action. API service providers often take things a step further by establishing service-level objectives (SLOs) that are even more stringent. The goal here is to establish monitors and alerts to remediate issues before they breach contractual SLAs. But what happens when the SLOs and SLAs struggle to be met? This is where the primary cloud native use case can assist. If the increase in latency is due to hardware limitations, the service can be scaled up vertically (by increasing the hardware) or horizontally (by adding more instances). If the increase in latency is driven by geographical location, introducing service instances in closer regions is something cloud native providers can provide to remedy this scenario. API Management As your API infrastructure expands, a cloud-native design provides the necessary tooling to ease supportability and manageability efforts. From an infrastructure perspective, the underlying definition of the service is defined using an IaC approach, allowing the service itself to become defined in a single location. As updates are made to that base design, those changes can be rolled out to each target service instance, avoiding any drift between service instances. From an API management perspective, cloud native providers include the necessary tooling to manage the APIs from a usage perspective. Here, API keys can be established, which offer the ability to impose thresholds on requests that can be made or features that align with service subscription levels. Cloud Native !== Utopia While APIs flourish in cloud native implementations, it is important to recognize that a cloud-native approach is not without its own set of challenges. Cloud Cost Management CloudZero's The State Of Cloud Cost Intelligence 2022 report concluded that only 40% of respondents indicated that their cloud costs were at an expected level as noted in Figure 2. Figure 2. Cloud native cost realities Image adapted from The State Of Cloud Cost Intelligence, CloudZero This means that 60% of respondents are dealing with higher-than-expected cloud costs, which ultimately impact an organization's ability to meet planned objectives. Cloud native spending can often be remediated by adopting the following strategies: Require team-based tags or cloud accounts to help understand levels of spending at a finer grain. Focus on storage buckets and database backups to understand if the cost is in line with the value. Engage a cloud business partner that specializes in cloud spending analysis. Account Takeover The concept of accounts becoming "hacked" is prevalent in social media. At times, I feel like my social media feed contains more "my account was hacked" messages than the casual updates I was tuning in to read. Believe it or not, the concept of account takeover is becoming a common fear for cloud native adopters. Imagine starting your day only to realize you no longer have access to any of your cloud-native services. Soon thereafter, your customers begin to flood your support lines to ask what is going on… and where the data they were expecting to see with each API call is. Another potential consequence is that the APIs are shut down completely, forcing customers to seek out competing APIs. Remember, your account protection is only as strong as your weakest link. Make sure to employ everything possible to protect your account and move away from simple username + password account protection. Disaster Recovery It is also important to recognize that cloud native is not a replacement for maintaining a strong disaster recovery posture. Understand the impact of availability zone and region-wide outages — both are expected to happen. Plan to implement immutable backups — avoid relying on traditional backups and snapshots. Leverage IaC to establish all aspects of cloud native — and test it often. Alternative Flows Exist While a cloud-native approach provides an excellent landscape to help your business and partnerships be successful, there are likely use cases that present themselves as alternative flows for cloud native adoption: Regulatory requirements for a given service can often present themselves as an alternative flow and not be a candidate for cloud native adoption. Point of presence requirements can also become a blocker for cloud native adoption when the closest cloud-native location is not close enough to meet the established SLAs and SLOs. On the Other Side of API Cloud Adoption By adopting a cloud-native approach, it is possible to extend an API across multiple availability zones and geographical regions within a given point of presence. Figure 3. Multi-region cloud native adoption In Figure 3, each region contains an API service instance in three different geographical regions. Additionally, each region contains an API service instance running in three different availability zones — each with its own network and power source. In this example, there are nine distinct instances running across the United States. By introducing a global common name, consumers always receive a service response from the least-latent and available service instance. This approach easily allows for entire regions to be taken offline for disaster recovery validation without any interruptions of service at the consumer level. Conclusion Readers familiar with my work may recall that I have been focused on the following mission statement, which I feel can apply to any IT professional: Focus your time on delivering features/functionality that extend the value of your intellectual property. Leverage frameworks, products, and services for everything else. —John Vester When I think about my conversion with the staff engineer at the Top 20 tech company, I wonder how much more successful his team would be without having to worry about the underlying infrastructure being managed with their on-premises approach. While the other side of cloud native is not without challenges, it does adhere to my mission statement. As a result, projects that I have worked on for the last 10 years have been able to remain focused on meeting the needs of API consumers while staying in line with corporate objectives. From an API perspective, cloud native offers additional ways to adhere to my personal mission statement by describing everything related to the service using IaC and leveraging built-in tooling to manage the APIs across different availability zones and regions. Have a really great day! This is an excerpt from DZone's 2024 Trend Report, Modern API Management: Connecting Data-Driven Architectures Alongside AI, Automation, and Microservices.Read the Free Report
Imagine building a complex machine with numerous independent parts, each performing its function, but all needing to communicate effectively with each other to accomplish a task. This is the challenge we face when designing cloud-native applications, which consist of interconnected microservices and serverless components. In this article, we explore the specifics of designing robust and resilient communication systems that can effectively coordinate these independent elements both within and beyond the application boundaries. These finely-grained services engage in internal and external interactions, employing various communication methods, synchronous or asynchronous. In synchronous communication, a service invokes another service using HTTP or gRPC, awaiting a response within a specified timeframe before proceeding. Conversely, asynchronous communication involves exchanging messages without expecting an immediate response. Message brokers such as RabbitMQ or Kafka serve as intermediaries, buffering messages to ensure reliable delivery. In cloud-native applications, embracing a combination of communication patterns is often a practical approach. Let's begin with synchronous communications first. What Is Synchronous Communication? Synchronous communication is like a conversation. One service (let’s call it Service A) initiates a request and then waits for a response from another service (Service B) or external APIs. This is akin to asking a question and waiting for an answer. Service A sends a request over HTTP and waits. It’s either waiting for a response from Service B or for a maximum waiting time to expire. During this waiting period, Service A is temporarily blocked, much like a person who pauses their activities to wait for a response. This pattern, often referred to as a request-reply pattern, is relatively simple to implement. However, using it extensively can introduce challenges that require careful consideration. Synchronous communication in the cloud Challenges of Synchronous Communication While synchronous communication is a powerful tool in our cloud-native toolkit, it comes with its own set of challenges that require careful consideration. Temporal Coupling Excessive reliance on synchronous communication throughout the solution can lead to temporal coupling issues. It occurs when numerous synchronous calls are chained together, resulting in extended wait times for client applications to receive responses. Availability Dependency Synchronous communication necessitates simultaneous availability of all communicating services. If backend services are under unexpected loads, client applications may experience failures due to timeout errors, impacting overall performance. Network Quality Impact Network quality can directly impact the performance of synchronous communication, including available bandwidth and the duration required for responses to traverse between serving backend services. Despite these challenges, synchronous communication can prove invaluable in specific scenarios. Let us explore some use cases in the next section where synchronous communication might be the better choice. When To Use Synchronous Communication Here are some situations where using synchronous communication can prove to be a better choice. Real-Time Data Access or Guaranteed Outcome When immediate or real-time feedback is needed, synchronous communication offers efficiency. For instance, when a customer places an order on an e-commerce website, the e-commerce front end needs to check the inventory system to ensure the item is in stock. This is a synchronous operation because the application needs to wait for the inventory system’s response before it can proceed with the order. Orchestrating Sequence of Dependent Tasks In cases where a service must execute a sequence of tasks, each dependent on the previous one, synchronous communication maintains order. It is specifically appropriate for workflows where task order is critical. Maintaining Transactional Integrity When maintaining data consistency across multiple components is vital, synchronous communication can help maintain atomic transactions. It is relevant for scenarios like financial transactions where data integrity is paramount. Synchronous communication is a powerful tool and has its challenges. The good news is that we also have the option of asynchronous communication—a complementary style that can work alongside synchronous methods. Let us explore this further in the next section. What Is Asynchronous Communication? Asynchronous communication patterns offer a dynamic and efficient approach to inter-service communication. Unlike synchronous communication, asynchronous communication allows a service to initiate a request without awaiting an immediate response. In this model, responses may not be immediate or arrive asynchronously on a separate channel, such as a callback queue. This mode of communication relies on protocols like the Advanced Message Queuing Protocol (AMQP) and messaging middleware, including message brokers or event brokers. This messaging middleware acts as an intermediary with minimal business logic. It receives messages from the source or producer service and then channels them to the intended consuming service. Integrating message middleware can significantly boost the resilience and fault tolerance of this decoupled approach. Asynchronous communication encompasses various implementations. Let us explore those further. One-To-One Communication In a One-to-One message communication, a producer dispatches messages exclusively to a receiver using a messaging broker. Typically, the message broker relies on queues to ensure reliable communication and offer delivery guarantees, such as at least once. The implementation resembles a command pattern, where the delivered messages act as commands consumed by the subscriber service to trigger actions. Let us consider an example of an online retail shop to illustrate its use. An online business heavily depends on the reliability of its website. The pattern provides fault tolerance and message guarantees, ensuring once a customer has placed an order on the website, the backend fulfillment systems receive the order to be processed. The messages broker preserves the message even if the backend system is down and will deliver when these can be processed. For instance, in an e-commerce application, when a customer places an order, the order details can be sent as a message from the order service (producer) to the fulfillment service (consumer) using a message broker. This is an example of one-to-one communication. Asynchronous One-to-One communication in the cloud An extension of the one-to-one message pattern is the asynchronous request-reply pattern. In this scenario, the dispatcher sends a message without expecting a response. But in a few specific scenarios, the consumer must respond to the producing service, utilizing the queues in the same message broker infrastructure queues. The response from the consumer may contain additional metadata such as ID to correlate the initial request or address to respond. Since the producer does not expect an immediate response, an independent producer workflow manages these replies. The fulfillment service (consumer) responds back to the frontend order service (producer) once the order has been dispatched so that the customer can be updated on the website. Asynchronous One-to-One Request Reply communication in cloud The single consumer communication comes in handy when two services communicate point to point. However, there could be scenarios when a publisher must send a particular event to multiple subscribers, which leads us to the following pattern. One-To-Many Communication This communication style is valuable when a single component (publisher) needs to broadcast an event to multiple components and services (subscribers). The one-to-many communication uses a concept of topic, which is analogous to an online forum. It is like an online forum where multiple users can post articles, and their followers can read them in their own time, responding as they fit. Similarly, applications can have topics where producer services write to these topics, and consuming services can read from the topic. It is one of the most popular patterns in real-world applications. Consider again the e-commerce platform has a service that updates product prices and multiple services need this information (like a subscription service, a recommendation service, etc.), the price update can be sent as a message to a topic in a message broker. All interested services (subscribers) can listen to this topic and receive the price update. This is an example of one-to-many communication. Several tools are available to implement this pattern, with Apache Kafka, Redis Pub/Sub, Amazon SNS, and Azure Event Grid ranking among the most popular choices. Asynchronous One-to-Many communication in cloud Challenges of Asynchronous Communication While asynchronous communication offers many benefits, it also introduces its own set of challenges. Resiliency and Fault Tolerance With numerous microservices and serverless components, each having multiple instances, failures are inevitable. Instances can crash, become overwhelmed, or experience transient faults. Moreover, the sender does not wait for the message to be processed, so it might not be immediately aware if an error occurs. We must adopt strategies like: Retry mechanisms: Retrying failed network calls for transient faults Circuit Breaker pattern: Preventing repeated calls to failing services to avoid resource bottlenecks Distributed Tracing Asynchronous communication can span multiple services, making it challenging to monitor overall system performance. Implementing distributed tracing helps tie logs and metrics together to understand transaction flow. Complex Debugging and Monitoring Asynchronous communications can be more difficult to debug and monitor because operations do not follow a linear flow. Specialized tools and techniques are often required to effectively debug and monitor these systems. Resource Management Asynchronous systems often involve long-lived connections and background processing, which can lead to resource management challenges. Care must be taken to manage resources effectively to prevent memory leaks or excessive CPU usage. Understanding these challenges can help design more robust and resilient asynchronous communication systems in their cloud-native applications. Final Words The choice between synchronous and asynchronous communication patterns is not binary but rather a strategic decision based on the specific requirements of the application. Synchronous communication is easy to implement and provides immediate feedback, making it suitable for real-time data access, orchestrating dependent tasks, and maintaining transactional integrity. However, it comes with challenges such as temporal coupling, availability dependency, and network quality impact. On the other hand, asynchronous communication allows a service to initiate a request without waiting for an immediate response, enhancing the system’s responsiveness and scalability. It offers flexibility, making it ideal for scenarios where immediate feedback is not necessary. However, it introduces complexities in resiliency, fault tolerance, distributed tracing, debugging, monitoring, and resource management. In conclusion, designing robust and resilient communication systems for cloud-native applications requires a deep understanding of both synchronous and asynchronous communication patterns. By carefully considering the strengths and weaknesses of each pattern and aligning them with the requirements, architects can design systems that effectively coordinate independent elements, both within and beyond the application boundaries, to deliver high-performing, scalable, and reliable cloud-native applications.
Batch processing is a common need across varied machine learning use cases such as video production, financial modeling, drug discovery, or genomic research. The elasticity of the cloud provides efficient ways to scale and simplify batch processing workloads while cutting costs. In this post, you’ll learn a scalable and cost-effective approach to configure AWS Batch Array jobs to process datasets that are stored on Amazon S3 and presented to compute instances with Amazon FSx for Lustre. To demonstrate this solution, we will create a sample batch processing workload to train a machine learning model. Our sample workload will run a Random forest Machine Learning algorithm on an input dataset. Random forest is a supervised learning algorithm. Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. We’ve trained a machine learning model based on a publicly available direct marketing dataset that predicts customer behavior and has a provided training script. You can download the dataset as a ZIP file. As our focus is on the scalability and cost-effectiveness of the batch processing approach, this post will not go into further details of the specific data, or how to build a machine learning model. We will, however, provide the necessary model file and test dataset that can be used for a “real world” batch processing task you might deploy on AWS. Overview of the Solution The architecture we will create is depicted in Figure 1. Figure 1 The solution architecture shows AWS Batch managing EC2 Spot Instances that mount an Amazon FSx for Lustre shared filesystem to perform the analysis. Data are synced with Amazon S3 and container images as pulled from Amazon ECR registries. We will use Amazon S3 to store the input and final output datasets. The advantages of using Amazon S3 for data management are its scalability, security, object lifecycle management, and integrations with other AWS services. Using Amazon FSx for Lustre, we can access the input dataset on S3 from compute instances using normal POSIX file operations. We configure FSx for Lustre to import object metadata into the filesystem as objects are added to the S3 bucket. When applications access file data, FSx for Lustre fetches the S3 object data and saves it to the filesystem. Visit here for more information on FSx for Lustre. With the FSx for Lustre filesystem in place, we can access its data concurrently from multiple compute instances. We will use AWS Batch to manage those compute instances, as well as the batch processing job itself. AWS Batch enables developers, scientists, and engineers to easily and efficiently run batch computing jobs on AWS. AWS Batch dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory-optimized instances) based on the volume and specific resource requirements of batch jobs submitted. Our AWS CloudFormation template will create the following AWS Batch components: Compute environment: The Compute Environment will provision EC2 Spot Instances to process the workload. Spot Instances can provide up to a 90% discount compared to On-Demand Instance prices. We will use an instance typesetting of “optimal”, which instructs AWS Batch to select instances from the C, M, and R instance families to satisfy the demand of our Job Queue. We’ll also specify a Launch Template to apply on instance startup. This Launch Template installs the Lustre client software and mounts the FSx for the Lustre filesystem to the instance. Job definition: The Job Definition specifies how Jobs are to be run. We configure the vCPU and memory to allocate, the Docker image repository location in Amazon ECR. We also specify the Docker volume and mount point used to mount the FSx for the Lustre filesystem within the container and set the required environment variables. Job queue: We associate a Job Queue to the Compute Environment to enable us to submit Jobs for processing. Visit here for more information on AWS Batch and its components. Next, we’ll create this architecture and submit a couple of test jobs. Walkthrough To create this architecture in your own AWS account, follow these steps. The steps we will perform are as follows: Create the AWS CloudFormation stack Upload input dataset to S3 Create and upload Docker image Submit sample AWS Batch jobs Create FSx for Lustre Data Repository export task The link to the GitHub repository containing the necessary scripts and templates is located here. To avoid unexpected charges, be sure to follow the clean-up procedures at the end of this post. Prerequisites For this walkthrough, you should have the following prerequisites: An AWS account The AWS CLI installed and configured Docker installed Create the CloudFormation Stack You can create the AWS CloudFormation stack using the console or AWS CLI. The stack creation should take about 20 minutes to complete. To Create the Stack Using the Console Log in to the AWS CloudFormation console Create the stack following these steps. Select the infra/template.yml file as the template to upload. Name the stack aws-hpc-blog-batch-processing. Once stack creation is complete, select the Outputs tab to view the identifiers for the resources that were created. These values will be used in the proceeding steps. To Create the Stack Using the AWS CLI In a command line terminal, run the following command to create the stack: aws cloudformation create-stack --stack-name aws-hpc-blog-batch-processing --template-body file://infra/template.yml --capabilities CAPABILITY_IAM Run the following command to wait for stack creation to complete: Shell aws cloudformation wait stack-create-complete --stack-name aws-hpc-blog-batch-processing Get the stack outputs to view the identifiers for the resources that were created. These values will be used in the proceeding steps: Shell aws cloudformation describe-stacks --stack-name aws-hpc-blog-batch-processing --query "Stacks[].Outputs[]" --output text Upload Input Dataset to S3 We will add 100 copies of the test-data1.csv file to the S3 bucket created in the CloudFormation stack. Recall that the FSx for the Lustre filesystem is configured to import the S3 object metadata into the Lustre filesystem as these objects are created. To Upload the Input Dataset to S3 Refer to the CloudFormation stack outputs to set a variable containing the value for <BucketName>: Shell BucketName=<BucketName from stack outputs> Upload the file: Shell aws s3 cp model/test-data1.csv s3://${BucketName}/input/ Create copies of the file: Shell for a in {2..100}; do aws s3 cp s3://${BucketName}/input/test-data1.csv s3://${BucketName}/input/test-data${a}.csv; done Create and Upload Docker Image Next, create the Docker image and upload it to Amazon ECR, where it will be accessible to the Compute Environment. To Create and Upload the Docker Image Refer to the CloudFormation stack outputs to set a variable containing the value for <RepositoryUri>: Shell RepositoryUri=<RepositoryUri from stack outputs> Build the Docker image: Shell cd model docker build -t $RepositoryUri . cd .. Push the Docker image to ECR: Shell aws ecr get-login-password | docker login --username AWS --password-stdin $RepositoryUri docker push $RepositoryUri Submit Sample AWS Batch Jobs Before we submit the first test job, let’s review the Docker entry point bash script. This script demonstrates how each Array job worker selects its own discrete list of input files to process. It determines this list using the set of all input file names found on the FSx for Lustre filesystem, the number of workers assigned to the Array job, and the AWS Batch-provided AWS_BATCH_JOB_ARRAY_INDEX environment variable. Shell #!/bin/bash -e # Get sorted list of all input file names SORTED_FILELIST=($(find $INPUT_DIR -type f | sort)) # Calculate number of files for this worker to process: # ceiling(length(SORTED_FILELIST) / NUMBER_OF_WORKERS) BATCH_SIZE=$(((${#SORTED_FILELIST[@]} + NUMBER_OF_WORKERS - 1) / NUMBER_OF_WORKERS)) # Select list of files for this worker to process FILES_TO_PROCESS=(${SORTED_FILELIST[@]:$((AWS_BATCH_JOB_ARRAY_INDEX * BATCH_SIZE)):$BATCH_SIZE}) # Create worker output directory WORKER_OUTPUT_DIR="${OUTPUT_DIR}/${AWS_BATCH_JOB_ID}" mkdir -p $WORKER_OUTPUT_DIR echo "job $(( AWS_BATCH_JOB_ARRAY_INDEX + 1 )) of ${NUMBER_OF_WORKERS}, processing ${#FILES_TO_PROCESS[@]} files" for input_file in ${FILES_TO_PROCESS[@]} do output_file="${WORKER_OUTPUT_DIR}/$(basename $input_file)" if [[ -f $output_file ]] then echo "output file $output_file already exists, skipping..." continue fi echo "processing $input_file" python predict.py --input_file $input_file --output_file $output_file done To Submit Array Job With Two Workers Using AWS CLI Run the following command to submit a job with two workers: Shell aws batch submit-job --cli-input-json file://test-2-workers.json Open the AWS Batch dashboard to view job status. In the left navigation pane, choose Jobs For Job queue, select aws-hpc-blog-batch-processing-job-queue. Select the test-2-workers job Select the Job index link to view worker details. Click on the Log stream name to view worker logs in CloudWatch. In our test, this job was completed in 6 minutes and 22 seconds. Since we used an input dataset of 100 files, each worker processed 50 files. The two workers altogether processed about 16 files per minute. To Submit Array Job With Ten Workers Using AWS CLI Run the following command to submit a job with ten workers: Shell aws batch submit-job --cli-input-json file://test-10-workers.json Open the AWS Batch dashboard to view job status. In the left navigation pane, choose Jobs. For Job queue, select aws-hpc-blog-batch-processing-job-queue. Select the test-10-workers Select the Job index link to view worker details. Click on the Log stream name to view worker logs in CloudWatch. In our test, this job was completed in 1 minute and 4 seconds. Since we used an input dataset of 100 files, each worker processed 10 files. The ten workers altogether processed about 94 files per minute. Table 1 has the job summary data. Job Workers Input files Files per worker Total time Files per minute Table 1. Summary of jobs and runtimes. test-2-workers 2 100 50 6m 22s 16 test-10-workers 10 100 10 1m 4s 94 Create FSx for Lustre Data Repository Export Task Now that the batch jobs have generated a set of predictions, let’s copy them to S3 by running an FSx for Lustre Data Repository export task. To Create Data Repository Export Task Refer to the CloudFormation stack outputs to set a variable containing the value for <FSxFilesystemId>: Shell FSxFilesystemId=<FSxFilesystemId from stack outputs> Run the following command to create the export task: Shell aws fsx create-data-repository-task --type EXPORT_TO_REPOSITORY --file-system-id $FSxFilesystemId --report Enabled=false View the progress of the Data Repository task following this procedure. When complete, log in to the S3 console to view the output files on the S3 bucket at the FSx for Lustre filesystem’s configured export path (output/). Cleaning Up To avoid incurring future charges, delete the resources using the following instructions: Follow this procedure to empty the S3 bucket created by the stack. Follow this procedure to delete the ECR repository created by the stack. Run the following command to delete the CloudFormation stack: Shell aws cloudformation delete-stack --stack-name aws-compute-blog-batch-processing Conclusion In this post, we’ve demonstrated a batch inference approach using AWS Batch and Amazon FSx for Lustre to create scalable, cost-effective batch processing. We’ve shared a design that enables simple scaling of the number of concurrent jobs deployed to process a set of input files. To further enhance this process, you may consider wrapping it with a workflow orchestrator such as Amazon Managed Workflows for Apache Airflow or using the AWS Batch integration provided by AWS Step Functions. For analysis of batch inference results, consider cataloging them with AWS Glue for interactive querying with Amazon Athena.
Abhishek Gupta
Principal Developer Advocate,
AWS
Daniel Oh
Senior Principal Developer Advocate,
Red Hat
Pratik Prakash
Principal Solution Architect,
Capital One