Low-Code Development: Leverage low and no code to streamline your workflow so that you can focus on higher priorities.
DZone Security Research: Tell us your top security strategies in 2024, influence our research, and enter for a chance to win $!
You Can Shape Trend Reports: Participate in DZone Research Surveys + Enter the Prize Drawings!
Applying the Pareto Principle To Learn a New Programming Language
Low-Code Development
Low code, no code, citizen development, AI automation, scalability — if you work in the tech world, it's likely that you have been encouraged to use tools in at least one of these spaces. And it's for a good reason as Gartner has projected that by 2025, 70% of applications developed within organizations will have been built using low- and/or no-code technologies. So does the practice live up to the hype? Year over year, the answer is a resounding "yes" as the industry continues to evolve. Organizations have an increased demand for more frequent application releases and updates, and with that comes the need for increased efficiencies. And this is where low-code and no-code development practices shine. Sprinkle AI automation into low- and no-code development, and the scalability opportunities are endless. This Trend Report covers the evolving landscape of low- and no-code development by providing a technical exploration of integration techniques into current development processes, the role AI plays in relation to low- and no-code development, governance, intelligent automated testing, and adoption challenges. In addition to findings from our original research, technical experts from the DZone Community contributed articles addressing important topics in the low code space, including scalability, citizen development, process automation, and much more. To ensure that you, the developer, can focus on higher priorities, this Trend Report aims to provide all the tools needed to successfully leverage low code in your tech stack.
Open Source Migration Practices and Patterns
MongoDB Essentials
Whether on the cloud or setting up your AIOps pipeline, automation has simplified the setup, configuration, and installation of your deployment. Infrastructure as Code(IaC) especially plays an important role in setting up the infrastructure. With IaC tools, you will be able to describe the configuration and state of your infrastructure that are desirable. The popular tools for IaC include Terraform, Pulumi, AWS CloudFormation, and Ansible; each of them has different possibilities for automating the deployment and management of infrastructure both in the cloud and on-premises. With the growing complexity of applications and heightened focus on security in software development, the tools SonarQube and Mend are more predisposed. As explained in my previous article, SonarQube is a code analysis tool aimed at helping developers have high-quality code by spotting bugs and vulnerabilities across several programming languages. SonarQube is very well integrated into pipelines of Continuous Integration/Continuous Deployment, producing continuous feedback while forcing enforcement of coding standards. Mend deals with software composition analysis (SCA) helping organizations manage and secure their open-source OS components. Mend, formerly WhiteSource, is a very well-integrated security solution with IaC tools for improving the security posture of infrastructure deployments. Mend automates vulnerability scanning and management for IaC code, allowing their customers to manage incubated security issues very early in the development cycle. Terraform for Infrastructure as Code Terraform is a HashiCorp-developed tool that enables developers and operations teams to define, provision, and manage infrastructure using a declarative language known as HashiCorp Configuration Language, HCL. HCL2 is the current version. Terraform is a provider-agnostic tool that provides the ability to manage resources across several cloud platforms and services by use of a single tool. Some of Terraform's standout features include: Declarative syntax: This is a way of telling the user what they want, and Terraform basically figures out how to create it. Plan and apply workflow: Terraform's plan command shows what changes will be made before actually applying them. This reduces the risk of unintended modifications. State management: Terraform keeps track of your current state. This will turn on incremental changes and detect drift. Modularity: Reusable modules allow teams to standardize and share infrastructure elements across projects. IaC Tools in the Ecosystem Alongside Terraform, a number of other tools offer different capabilities based on what users need and where they are running out of the IaC tool. AWS CloudFormation: Specifically designed for AWS, it provides deep integration with AWS services but lacks multi-cloud support. Azure Resource Manager (ARM) templates: Similar to CloudFormation, but for Azure resources Google Cloud Deployment Manager: Google Cloud's native IaC solution Pulumi: Allows developers to use familiar programming languages like Python, TypeScript, and Go to define infrastructure Ansible: While primarily a configuration management tool, Ansible can also be used for infrastructure provisioning. Chef and Puppet: Configuration management tools that can be extended for infrastructure provisioning Enhancing Security With Mend With the growth of IaC adoption, the demand for better security management also grows. This is where Mend comes in to provide a robust scanning and securing solution for IaC code. Mend will enable smooth incorporation into the development process as well as continuous security scanning of Terraform and other IaC tools. The following are some ways through which Mend boosts security measures without compromising on productivity: Automated scanning: Mend can scan your IaC code automatically in search of vulnerabilities, misconfigurations, and compliance issues. Early detection: If integrated with CI/CD pipelines, Mend will spot security vulnerabilities at an early stage during the development phase thus reducing cost and effort for fixing them later on. Custom policies: Teams can develop custom security policies to meet their specific needs and compliance requirements. Remediation guidance: Upon detection of a problem, Mend provides clear instructions on what steps should be taken to rectify it helping developers address security concerns promptly. Compliance mapping: Issues identified are mapped by mend as per the particular requirements of different standards or regulations so that organizations can maintain compliance. Continuous monitoring: Even after deployment, Mend continues to monitor your infrastructure for new vulnerabilities or drift from secure configurations. Integration with DevOps tools: Mend integrates with famous version control systems, CI/CD platforms, and ticketing systems, making it part of existing workflows. This proactive approach to security allows teams to move fast and innovate while significantly minimizing the risk of security breaches, misconfigurations, and compliance violations when they adopt Mend in their IaC practices. Along with Terraform, Mend supports the following IaC environments and their configuration files: Bicep CloudFormation Kubernetes ARM Templates Serverless Helm Integrate Mend With GitHub Mend provides several integration options and tools that GitHub users can use to further drive security and vulnerability management in their repositories. Overview of Mend's Presence on GitHub Mend for GitHub.com App This GitHub App has both SCA and SAST capabilities. It can be installed directly from the GitHub Marketplace to allow easy integration with your repositories. Mend Bolt Mend Bolt performs repository scans looking for vulnerabilities in open-source components. It is available free of cost as an app on the GitHub Marketplace, supporting over 200 programming languages while supporting the following features: Scanning: This happens automatically after every "push." It detects vulnerabilities in open source libraries and has a five-scan per-day limit per repository. Opening issues for vulnerable, open source libraries Dependency tree management, along with the visualizing of dependency trees Checks for suggested fixes for vulnerabilities Integration with GitHub Checks stops pull requests with new vulnerabilities from getting merged. Mend Toolkit Mend maintains a GitHub Organization, "mend-toolkit", containing various repositories that host integration knowledge bases, examples of implementation, and tools. This includes: Mend implementation examples Mend SBOM Exporter CLI Parsing scripts for YAML files Import tools for SPDX or CSV SBOM into Mend Mend Examples Repository Under the mend-toolkit organization, there is a "mend-examples" repository with examples of several scanning and result-pulling techniques in Mend. This includes, among other things: SCM integration Integrating self-hosted repo setup Integration of CI/CD Examples of policy checks Mend prioritizes scans by language Terms Mend SAST and Mend SCA implementations Set Up Mend for GitHub In this article, you will learn how to set up Mend Bolt. 1. Install the Mend App Go to the GitHub Marketplace. Click "Install" and select the repositories you want to scan. Install Mend Bolt for GitHub After selecting the repositories, click on Install and complete authorization. 2. Complete the Mend Registration You'll be redirected to the Mend registration page. Complete the registration if you are a new Mend user and click on Submit. Mend Registration 3. Merge the Configuration Pull Request Mend will automatically create a pull request(PR) in your repository. This PR adds a .whitesource configuration file: Mend PR Review the PR and merge it to initiate your first scan. Review and merge the PR 4. Customize Scan Settings Open the .whitesource file in your repository. Modify settings as needed. The key setting to enable IaC scans is enableIaC: true. JSON { "scanSettings": { "enableIaC": true, "baseBranches": ["main"] }, "checkRunSettings": { "vulnerableCheckRunConclusionLevel": "failure", "displayMode": "diff", "useMendCheckNames": true }, "issueSettings": { "minSeverityLevel": "LOW", "issueType": "DEPENDENCY" } } Check the other configuration options (Configure Mend for GitHub.com for IaC). Note: Iac scans can only be performed on base branches. JSON { "scanSettings": { "enableIaC": true, "baseBranches": ["main"] }, "checkRunSettings": { "useMendCheckNames": true, "iacCheckRunConclusionLevel": "failure" } } Commit changes to update your scan configuration. 5. Monitor and Review Results Mend will now scan your repository on each push (limited to 5 scans/day per repo for Mend Bolt). Mend scan report Check the "Issues" tab in your GitHub repository for vulnerability reports. Review the Mend dashboard for a comprehensive overview of your security status. 6. Remediate Issues For each vulnerability, Mend provides detailed information and suggested fixes. Create pull requests to update vulnerable dependencies based on Mend's recommendations. 7. Continuous Monitoring Regularly review Mend scan results and GitHub issues. Keep your .whitesource configuration file updated as your security needs evolve. You have successfully integrated Mend with GitHub, enabling automated security scanning and vulnerability management for your repositories. Along with GitHub, Mend supports, Git Enterprise, GitLab, BitBucket, etc., You can find the supported platforms in the Mend documentation. Conclusion The power of IaC tools like Terraform, combined with robust security solutions such as Mend, sets any infrastructure management base on very strong ground. These technologies and best practices help keep organizations safe while ensuring adaptability and scalability for modern, fast-moving digital environments. Indeed, the importance of integrating security throughout the whole life cycle of our infrastructure cannot be overemphasized as we continue raising the bar on what is possible with infrastructure automation. There are additional best practices like version control, modularization, use of right access permissions, auditing your code for compliance, etc., providing added security to your IaC code.
The first lie detector which relied on eye movement appeared in 2014. The Converus team together with Dr. John C. Kircher, Dr. David C. Raskin, and Dr. Anne Cook launched EyeDetect — a brand-new solution to detect deception quickly and accurately. This event became a turning point in the polygraph industry. In 2021, we finished working on a contactless lie detection technology based on eye-tracking and presented it at the International Scientific and Practical Conference. As I was part of the developers’ team, in this article, I would like to share some insights into how we worked on the creation of the new system, particularly how we chose our backend stack. What Is a Contactless Lie Detector and How Does It Work? We created a multifunctional hardware and software system for contactless lie detection. This is how it works: the system tracks a person's psychophysiological reactions by monitoring eye movements and pupil dynamics and automatically calculates the final test results. Its software consists of 3 applications. Administrator application: Allows the creation of tests and the administration of processes Operator application: Enables scheduling test dates and times, assigning tests, and monitoring the testing process Respondent application: Allows users to take tests using a special code On the computer screen, along with simultaneous audio (either synthesized or pre-recorded by a specialist), the respondent is given instructions on how to take the test. This is followed by written true/false statements based on developed testing methodologies. The respondent reads each statement and presses the "true" or "false" key according to their assessment of the statement's relevance. After half a second, the computer displays the next statement. Then, the lie-detector measures response time and error frequency, extracts characteristics from recordings of eye position and pupil size, and calculates the significance of the statement or the "probability of deception." To make it more visual here is a comparison of the traditionally used polygraph and lie-detector. Criteria Classic Polygraph Contactless Lie Detector Working Principle Registers changes in GSR, cardiovascular, and respiratory activity to measure emotional arousal Registers involuntary changes in eye movements and pupil diameter to measure cognitive effort Duration Tests take from 1.5 to 5 hours, depending on the type of examination Tests take from 15 to 40 minutes Report Time From 5 minutes to several hours; written reports can take several days Test results and reports in less than 5 minutes automatically Accuracy Screening test: 85% Investigation: 89% Screening test: 86-90% Investigation: 89% Sensor contact Sensors are placed on the body, some of which cause discomfort, particularly the two pneumatic tubes around the chest and the blood pressure cuff No sensors are attached to the person Objectivity Specialists interpret changes in responses. The specialist can influence the result. Manual evaluation of polygraphs requires training and is a potential source of errors. Automated testing process ensuring maximum reliability and objectivity. AI evaluates responses and generates a report. Training Specialists undergo 2 to 10 weeks of training. Regular advanced training courses Standard operator training takes less than 4 hours; administrator training for creating tests takes 8 hours. Remote training with a qualification exam. As you can see, our lie detector made the process more comfortable and convenient compared to traditional lie detectors. First of all, the tests take less time, from 15 to 40 minutes. Besides, one can get the results almost immediately. They are generated automatically within minutes. Another advantage is that there are no physically attached sensors which can be even more uncomfortable in an already stressful environment. Operator training is also less time-consuming. Most importantly, the results' credibility is still very high. Backend Stack Choice Our team had experience with Python and asyncio. Previously, we developed projects using Tornado. But at that time FastAPI was gaining popularity, so this time we decided to use Python with FastAPI and SQLAlchemy (with asynchronous support). To complement our choice of a popular backend stack, we decided to host our infrastructure on virtual machines using Docker. Avoiding Celery Given the nature of our lie detector, several mathematical operations require time to complete, making real-time execution during HTTP requests impractical. We developed multiple background tasks. Although Celery is a popular framework for such tasks, we opted to implement our own task manager. This decision stemmed from our use of CI/CD, where we restart various services independently. Sometimes, services could lose connection with Redis during these restarts. Our custom task manager, extending the base aioredis library, ensures reconnection if a connection is lost. Background Tasks Architecture At the project's outset, we had a few background tasks, which increased as functionality expanded. Some tasks were interdependent, requiring sequential execution. Initially, we used a queue manager where each task, upon completion, would trigger the next task via a message queue. However, asynchronous execution could lead to data issues due to varying execution speeds of related tasks. We then replaced this with a task manager that uses gRPC to call related tasks, ensuring execution order and resolving data dependency issues between tasks. Logging We couldn't use popular bug-tracking systems like Sentry for a few reasons. First, we didn’t want to use any third-party services managed and deployed outside of our infrastructure, so we were limited to using a self-hosted Sentry. At that time, we only had one dedicated server divided into multiple virtual servers, and there weren't enough resources for Sentry. Additionally, we needed to store not only bugs but also all information about requests and responses, which required the use of Elastic. Thus, we chose to store logs in Elasticsearch. However, memory leak issues led us to switch to Prometheus and Typesense. Maintaining backward compatibility between Elasticsearch and Typesense was a priority for us, as we were still determining if the new setup would meet our needs. This decision worked quite well, and we saw improvements in resource usage. The main reason for switching from Elastic to Typesense was resource usage. Elastic often requires a huge amount of memory, which is never sufficient. This is a common problem discussed in various forums, such as this one. Since Typesense is developed in C, it requires considerably fewer resources. Full-Text Search (FTS) Using PostgreSQL as our main database, we needed an efficient FTS mechanism. Based on previous experience, PostgreSQL's built-in ts_query and ts_vector could have performed better with Cyrillic text. Thus, we decided to synchronize PostgreSQL with Elasticsearch. While not the fastest solution, it provided enough speed and flexibility for our needs. PDF Report Generation As you may know, generating PDFs in Python can be quite complicated. This issue is rather common — the main challenge here is that to generate a PDF in Python you need to create an HTML file and only then convert it to PDF, similar to how it's done in other languages. This conversion process can sometimes produce unpredictable artifacts that are difficult to debug. Meanwhile, generating PDFs with JavaScript is much easier. We used Puppeteer to create an HTML page and then save it as a PDF just as we would in a browser, avoiding these problems altogether. To Conclude In conclusion, I would like to stress that this project turned out to be demanding in terms of choosing the right solutions but at the same time, it was more than rewarding. We received numerous unconventional customer requests that often questioned standard rules and best practices. The most exciting part of the journey was implementing mathematical models developed by another team into the backend architecture and designing a database architecture to handle a vast amount of unique data. It made me realize once again that popular technologies and tools are not always the best option for every case. We always need to explore different methodologies and remain open to unconventional solutions for common tasks.
When it comes to data integration, some people may wonder what there is to discuss — isn't it just ETL? That is, extracting from various databases, transforming, and ultimately loading into different data warehouses. However, with the rise of big data, data lakes, real-time data warehouses, and large-scale models, the architecture of data integration has evolved from the ETL of the data warehouse era to the ELT of the big data era and now to the current stage of EtLT. In the global tech landscape, emerging EtLT companies like FiveTran, Airbyte, and Matillion have emerged, while giants like IBM have invested $2.3 billion in acquiring StreamSets and webMethods to upgrade their product lines from ETL to EtLT (DataOps). Whether you're a data professional or a manager, it's crucial to re-evaluate the recent changes and future trends in data integration. Chapter 1: From ETL to ELT, to EtLT When it comes to data integration, many in the industry may think it's just about ETL. However, with the rise of big data, data lakes, real-time data warehouses, and large-scale models, the architecture of data integration has evolved from the ETL of the data warehouse era to the ELT of the big data era and now to the EtLT. Globally, new emerging EtLT companies like FiveTran, Airbyte, and Matllion have been established, while established players like IBM are upgrading their product lines from ETL to EtLT (DataOps) with offerings such as StreamSet and webMethods. Whether you're a manager in an enterprise or a professional in the data field, it's essential to re-examine the changes in data integration in recent times and future trends. ETL Architecture Most experts in the data field are familiar with the term ETL. During the heyday of data warehousing, ETL tools like IBM DataStage, Informatica, Talend, and Kettle were popular. Some companies still use these tools to extract data from various databases, transform it, and load it into different data warehouses for reporting and analysis. The pros and cons of the ETL architecture are as follows: Advantages of ETL Architecture Data consistency and quality Integration of complex data sources Clear technical architecture Implementation of business rules Disadvantages of ETL Architecture Lack of real-time processing High hardware costs Limited flexibility Maintenance costs Limited handling of unstructured data ELT Architecture With the advent of the big data era, facing the challenges of ETL's inability to load complex data sources and its poor real-time performance, a variant of ETL architecture, ELT, emerged. Companies started using ELT tools provided by various data warehousing vendors, such as Teradata's BETQ/Fastload/TPT and Hadoop Hive's Apache Sqoop. The characteristics of ELT architecture include directly loading data into data warehouses or big data platforms without complex transformations and then using SQL or H-SQL to process the data. The pros and cons of the ELT architecture are as follows: Advantages of ELT Architecture Handling large data volumes Improved development and operational efficiency Cost-effectiveness Flexibility and scalability Integration with new technologies Disadvantages of ELT Architecture Limited real-time support High data storage costs Data quality issues Dependence on target system capabilities EtLT Architecture With the popularity of data lakes and real-time data warehouses, the weaknesses of ELT architecture in real-time processing and handling unstructured data have been highlighted. Thus, a new architecture, EtLT, has emerged. EtLT architecture enhances ELT by adding real-time data extraction from sources like SaaS, Binlog, and cloud components, as well as incorporating small-scale transformations before loading the data into the target storage. This trend has led to the emergence of several specialized companies worldwide, such as StreamSets, Attunity (acquired by Qlik), Fivetran, and SeaTunnel by the Apache Foundation. The pros and cons of the EtLT architecture are as follows: Advantages of EtLT Architecture Real-time data processing Support for complex data sources Cost reduction Flexibility and scalability Performance optimization Support for large models Data quality and governance Disadvantages of EtLT Architecture Technical complexity Dependence on target system capabilities Management and monitoring challenges Increased data change management complexity Dependency on tools and platforms Overall, in recent years, with the rise of data, real-time data warehouses, and large models, the EtLT architecture has gradually become mainstream worldwide in the field of data integration. For specific historical details, you can refer to the relevant content in the article, "ELT is dead, and EtLT will be the end of modern data processing architecture." Under this overarching trend, let's interpret the maturity model of the entire data integration track. Overall, there are four clear trends: In the trend of ETL evolving into EtLT, the focus of data integration has shifted from traditional batch processing to real-time data collection and batch-stream integrated data integration. The hottest scenarios have also shifted from past single-database batch integration scenarios to hybrid cloud, SaaS, and multiple data sources integrated in a batch-stream manner. Data complexity transformation has gradually shifted from traditional ETL tools to processing complex transformations in data warehouses. At the same time, support for automatic schema changes (Schema Evolution) in the case of DDL (field definition) changes during real-time data integration has also begun. Even adapting to DDL changes in lightweight transformations has become a trend. Support for data source types has expanded from files and traditional databases to include emerging data sources, open-source big data ecosystems, unstructured data systems, cloud databases, and support for large models. These are also the most common scenarios encountered in every enterprise, and in the future, real-time data warehouses, lakes, clouds, and large models will be used in different scenarios within each enterprise. In terms of core capabilities and performance, diversity of data sources, high accuracy, and ease of troubleshooting are the top priorities for most enterprises. Conversely, there are not many examination points for capabilities such as high throughput and high real-time performance. Regarding data virtualization, DataFabric, and ZeroETL mentioned in the report, let's delve into the interpretation of the data integration maturity model below. Chapter 2: Data Integration Maturity Model Interpretation Data Production The data production segment refers to how data is obtained, distributed, transformed, and stored within the context of data integration. This part poses the greatest workload and challenges in integrating data. When users in the industry use data integration tools, their primary consideration is whether the tools support integration with their databases, cloud services, and SaaS systems. If these tools do not support the user's proprietary systems, then additional costs are incurred for customizing interfaces or exporting data into compatible files, which can pose challenges to the timeliness and accuracy of data. Data Collection Most data integration tools now support batch collection, rate limiting, and HTTP collection. However, real-time data acquisition (CDC) and DDL change detection are still in their growth and popularity stages. Particularly, the ability to handle DDL changes in source systems is crucial. Real-time data processing is often interrupted by changes in source system structures. Effectively addressing the technical complexity of DDL changes remains a challenge, and various industry vendors are still exploring solutions. Data Transformation With the gradual decline of ETL architectures, complex business processing (e.g., Join, Group By) within integration tools has gradually faded into history. Especially in real-time scenarios, there is limited memory available for operations like stream window Join and aggregation. Therefore, most ETL tools are migrating towards ELT and EtLT architectures. Lightweight data transformation using SQL-like languages has become mainstream, allowing developers to perform data cleaning without having to learn various data integration tools. Additionally, the integration of data content monitoring and DDL change transformation processing, combined with notification, alerts, and automation, is making data transformation a more intelligent process. Data Distribution Traditional JDBC loading, HTTP, and bulk loading have become essential features of every mainstream data integration tool, with competition focusing on the breadth of data source support. Automated DDL changes reduce developers' workload and ensure the smooth execution of data integration tasks. Various vendors employ their methods to handle complex scenarios where data table definitions change. Integration with large models is emerging as a new trend, allowing internal enterprise data to interface with large models, though it is currently the domain of enthusiasts in some open-source communities. Data Storage Next-generation data integration tools come with caching capabilities. Previously, this caching existed locally, but now distributed storage and distributed checkpoint/snapshot technologies are used. Effective utilization of cloud storage is also becoming a new direction, especially in scenarios involving large data caches requiring data replay and recording. Data Structure Migration This part deals with whether automatic table creation and inspection can be performed during the data integration process. Automatic table creation involves automatically creating tables/data structures in the target system that are compatible with those in the source system. This significantly reduces the workload of data development engineers. Automatic schema inference is a more complex scenario. In the EtLT architecture, in the event of real-time data DDL changes or changes in data fields, automatic inference of their rationality allows users to identify issues with data integration tasks before they run. The industry is still in the experimentation phase regarding this aspect. Computational Model The computational model evolves with the changing landscape of ETL, ELT, and EtLT. It has transitioned from emphasizing computation in the early stages to focusing on transmission in the middle stages, and now emphasizes lightweight computation during real-time transmission: Offline Data Synchronization This has become the most basic data integration requirement for every enterprise. However, the performance varies under different architectures. Overall, ETL architecture tools have much lower performance than ELT and EtLT tools under conditions of large-scale data. Real-Time Data Synchronization With the popularity of real-time data warehouses and data lakes, real-time data synchronization has become an essential factor for every enterprise to consider when integrating data. More and more companies are beginning to use real-time synchronization. Batch-Streaming Integration New-generation data integration engines are designed from the outset to consider batch-stream integration, providing more effective synchronization methods for different enterprise scenarios. In contrast, most traditional engines were designed to focus on either real-time or offline scenarios, resulting in poor performance for batch data synchronization. Unified use of batch and streaming can perform better in data initialization and hybrid batch-stream environments. Cloud-Native Overseas data integration tools are more aggressive in this aspect because they are billed on a pay-as-you-go basis. Therefore, the ability to quickly obtain/release responsive computing resources for each task is the core competitiveness and profit source for every company. In contrast, progress in big data cloud-native integration in China is still relatively slow, so it remains a subject of exploration for only a few companies domestically. Data Types and Typical Scenarios File Collection This is a basic feature of every integration tool. However, unlike in the past, apart from standard text files, the collection of data in formats like Parquet and ORC has become standard. Big Data Collection With the popularity of emerging data sources such as Snowflake, Redshift, Hudi, Iceberg, ClickHouse, Doris, and StarRocks, traditional data integration tools are significantly lagging in this regard. Users in China and the United States are generally at the same level in terms of big data usage, hence requiring vendors to adapt to these emerging data sources. Binlog Collection This is a burgeoning industry in China, as it has replaced traditional tools like DataStage and Informatica during the process of informatization. However, the replacement of databases like Oracle and DB2 has not been as rapid, resulting in a large number of specialized Binlog data collection companies emerging to solve CDC problems overseas. Informatization Data Collection This is a scenario unique to China. With the process of informatization, numerous domestic databases have emerged. Whether these databases' batch and real-time collection can be adapted, presents a higher challenge for Chinese vendors. Sharding In most large enterprises, sharding is commonly used to reduce the pressure on databases. Therefore, whether data integration tools support sharding has become a standard feature of professional data integration tools. Message Queues Driven by data lakes and real-time data warehouses, everything related to real-time is booming. Message queues, as the representatives of enterprise real-time data exchange centers, have become indispensable options for advanced enterprises. Whether data integration tools support a sufficient number of memory/disk message queue types has become one of the hottest features. Unstructured Data Non-structural data sources such as MongoDB and Elasticsearch have become essential for enterprises. Data integration also supports such data sources correspondingly. Big Model Data Numerous startups worldwide are working on quickly interacting with enterprise data and large datasets. SaaS Integration This is a very popular feature overseas but has yet to generate significant demand in China. Data Unified Scheduling Integrating data integration with scheduling systems, especially coordinating real-time data through scheduling systems and subsequent data warehouse tasks, is essential for building real-time data warehouses. Real-Time Data Warehouse/Data Lake These are currently the most popular scenarios for enterprises. Real-time data entry into warehouses/lakes enables the advantages of next-generation data warehouses/lakes to be realized. Data Disaster Recovery Backup With the enhancement of data integration real-time capabilities and CDC support, integration in the traditional disaster recovery field has emerged. Some data integration and disaster recovery vendors have begun to work in each other's areas. However, due to significant differences in detail between disaster recovery and integration scenarios, vendors penetrating each other's domains may lack functionality and require iterative improvements over time. Operation and Monitoring In data integration, operation and monitoring are essential functionalities. Effective operation and monitoring significantly reduce the workload of system operation and development personnel in case of data issues. Flow Control Modern data integration tools control traffic from multiple aspects such as task parallelism, single-task JDBC parallelism, and single JDBC reading volume, ensuring minimal impact on source systems. Task/Table-Level Statistics Task-level and table-level synchronization statistics are crucial for managing operations and maintenance personnel during data integration processes. Step-By-Step Trial Run Due to support for real-time data, SaaS, and lightweight transformation, running a complex data flow directly becomes more complicated. Therefore, some advanced companies have introduced step-by-step trial run functionality for efficient development and operation. Table Change Event Capture This is an emerging feature in real-time data processing, allowing users to make changes or alerts in a predefined manner when table changes occur in the source system, thereby maximizing the stability of real-time data. Batch-Stream Integrated Scheduling After real-time CDC and stream processing, integration with traditional batch data warehouse tasks is inevitable. However, ensuring accurate startup of batch data without affecting data stream operation remains a challenge. This is why integration and batch-stream integrated scheduling are related. Intelligent Diagnosis/Tuning/Resource Optimization In cluster and cloud-native scenarios, effectively utilizing existing resources and recommending correct solutions in case of problems are hot topics among the most advanced data integration companies. However, achieving production-level intelligent applications may take some time. Core Capabilities There are many important functionalities in data integration, but the following points are the most critical. The lack of these capabilities may have a significant impact during enterprise usage. Full/Incremental Synchronization Separate full/incremental synchronization has become a necessary feature of every data integration tool. However, the automatic switch from full to incremental mode has not yet become widespread among small and medium-sized vendors, requiring manual switching by users. CDC Capture As enterprise demands for real-time data increase, CDC capture has become a core competitive advantage of data integration. The support for the CDC from multiple data sources, the requirements, and the impact of the CDC on source databases, often become the core competitiveness of data integration tools. Data Diversity Supporting multiple data sources has become a "red ocean competition" in data integration tools. Better support for users' existing system data sources often leads to a more advantageous position in business competition. Checkpoint Resumption Whether real-time and batch data integration supports checkpoint resumption is helpful in quickly recovering from error data scenes in many scenarios or assisting in recovery in some exceptional cases. However, only a few tools currently support this feature. Concurrency/Limiting Speed Data integration tools need to be highly concurrent when speed is required and effectively reduce the impact on source systems when slow. This has become a necessary feature of integration tools. Multi-Table Synchronization/Whole-Database Migration This refers not only to convenient selection in the interface but also to whether JDBC or existing integration tasks can be reused at the engine level, thereby making better use of existing resources and completing data integration quickly. Performance Optimization In addition to core capabilities, performance often represents whether users need more resources or whether the hardware and cloud costs of data integration tools are low enough. However, extreme performance is currently unnecessary, and it is often considered the third factor after interface support and core capabilities. Timeliness Minute-level integration has gradually exited the stage of history, and supporting second-level data integration has become a very popular feature. However, millisecond-level data integration scenarios are still relatively rare, mostly appearing in disaster recovery special scenarios. Data Scale Most scenarios currently involve Tb-level data integration, while Pb-level data integration is implemented by open-source tools used by Internet giants. Eb-level data integration will not appear in the short term. High Throughput High throughput mainly depends on whether integration tools can effectively utilize network and CPU resources to achieve the maximum value of theoretical data integration. In this regard, tools based on ELT and EtLT have obvious advantages over ETL tools. Distributed Integration Dynamic fault tolerance is more important than dynamic scaling and cloud-native. The ability of a large data integration task to automatically tolerate errors in hardware and network failure situations is a basic function when doing large-scale data integration. Scalability and cloud-native are derived requirements in this scenario. Accuracy How data integration ensures consistency is a complex task. In addition to using multiple technologies to ensure "Exactly Once" CRC verification is done. Third-party data quality inspection tools are also needed rather than just "self-certification." Therefore, data integration tools often cooperate with data scheduling tools to verify data accuracy. Stability This is the result of multiple functions. Ensuring the stability of individual tasks is important in terms of availability, task isolation, data isolation, permissions, and encryption control. When problems occur in a single task or department, they should not affect other tasks and departments. Ecology Excellent data integration tools have a large ecosystem that supports synchronization with multiple data sources and integration with upstream and downstream scheduling and monitoring systems. Moreover, tool usability is also an important indicator involving enterprise personnel costs. Chapter 3: Trends In the coming years, with the proliferation of the EtLT architecture, many new scenarios will emerge in data integration, while data virtualization and DataFabric will also have significant impacts on future data integration: Multi-Cloud Integration This is already widespread globally, with most data integrations having cross-cloud integration capabilities. In China, due to the limited prevalence of clouds, this aspect is still in the early incubation stage. ETL Integration As the ETL cycle declines, most enterprises will gradually migrate from tools like Kettle, Informatica, Talend, etc., to emerging EtLT architectures, thereby supporting batch-stream integrated data integration and more emerging data sources. ELT Currently, most mainstream big data architectures are based on ELT. With the rise of real-time data warehouses and data lakes, ELT-related tools will gradually upgrade to EtLT tools, or add real-time EtLT tools to compensate for the lack of real-time data support in ELT architectures. EtLT Globally, companies like JPMorgan, Shein, Shoppe, etc., are embedding themselves in the EtLT architecture. More companies will integrate their internal data integration tools into the EtLT architecture, combined with batch-stream integrated scheduling systems to meet enterprise DataOps-related requirements. Automated Governance With the increase in data sources and real-time data, traditional governance processes cannot meet the timeliness requirements for real-time analysis. Automated governance will gradually rise within enterprises in the next few years. Big Model Support As large models penetrate enterprise applications, providing data to large models becomes a necessary skill for data integration. Traditional ETL and ELT architectures are relatively difficult to adapt to real-time, large batch data scenarios, so the EtLT architecture will deepen its penetration into most enterprises along with the popularization of large models. ZeroETL This is a concept proposed by Amazon, suggesting that data stored on S3 can be accessed directly by various engines without the need for ETL between different engines. In a sense, if the data scenario is not complex, and the data volume is small, a small number of engines can meet the OLAP and OLTP requirements. However, due to limited scenario support and poor performance, it will take some time for more companies to recognize this approach. DataFabric Currently, many companies propose using DataFabric metadata to manage all data, eliminating the need for ETL/ELT during queries and directly accessing underlying data. This technology is still in the experimental stage, with significant challenges in query response and scenario adaptation. It can meet the needs of simple scenarios with small data queries, but for complex big data scenarios, the EtLT architecture will still be necessary for the foreseeable future. Data Virtualization The basic idea is similar to the execution layer of DataFabric. Data does not need to be moved; instead, it is queried directly through ad-hoc query interfaces and compute engines (e.g., Presto, TrinoDB) to translate data stored in underlying data storage or data engines. However, in the case of large amounts of data, engine query efficiency and memory consumption often fail to meet expectations, so it is only used in scenarios with small amounts of data. Conclusion From an overall trend perspective, with the explosive growth of global data, the emergence of large models, and the proliferation of data engines for various scenarios, the rise of real-time data has brought data integration back to the forefront of the data field. If data is considered a new energy source, then data integration is like the pipeline of this new energy. The more data engines there are, the higher the efficiency, data source compatibility, and usability requirements of the pipeline will be. Although data integration will eventually face challenges from Zero ETL, data virtualization, and DataFabric, in the visible future, the performance, accuracy, and ROI of these technologies have always failed to reach the level of popularity of data integration. Otherwise, the most popular data engines in the United States should not be SnowFlake or DeltaLake but TrinoDB. Of course, I believe that in the next 10 years, under the circumstances of DataFabric x large models, virtualization + EtLT + data routing may be the ultimate solution for data integration. In short, as long as data volume grows, the pipelines between data will always exist. Chapter 4: How To Use the Data Integration Maturity Model Firstly, the maturity model provides a comprehensive view of current and potential future technologies that may be utilized in data integration over the next 10 years. It offers individuals insight into personal skill development and assists enterprises in designing and selecting appropriate technological architectures. Additionally, it guides key development areas within the data integration industry. For enterprises, technology maturity aids in assessing the level of investment in a particular technology. For a mature technology, it is likely to have been in use for many years, supporting business operations effectively. However, as technological advancements reach a plateau, consideration can be given to adopting newer, more promising technologies to achieve higher business value. Technologies in decline are likely to face increasing limitations and issues in supporting business operations, gradually being replaced by newer technologies within 3-5 years. When introducing such technologies, it's essential to consider their business value and the current state of the enterprise. Popular technologies, on the other hand, are prioritized by enterprises due to their widespread validation among early adopters, with the majority of businesses and technology companies endorsing them. Their business value has been verified, and they are expected to dominate the market in the next 1-2 years. Growing technologies require consideration based on their business value, having passed the early adoption phase, and having their technological and business values validated by early adopters. They have not yet been fully embraced in the market due to reasons such as branding and promotion but are likely to become popular technologies and future industry standards. Forward-looking technologies are generally cutting-edge and used by early adopters, offering some business value. However, their general applicability and ROI have not been fully validated. Enterprises can consider limited adoption in areas where they provide significant business value. For individuals, mature and declining technologies offer limited learning and research value, as they are already widely adopted. Focusing on popular technologies can be advantageous for employment prospects, as they are highly sought after in the industry. However, competition in this area is fierce, requiring a certain depth of understanding to stand out. Growing technologies are worth delving into as they are likely to become popular in the future, and early experience can lead to expertise when they reach their peak popularity. Forward-looking technologies, while potentially leading to groundbreaking innovations, may also fail. Individuals may choose to invest time and effort based on personal interests. While these technologies may be far from job requirements and practical application, forward-thinking companies may inquire about them during interviews to assess the candidate's foresight. Definitions of Technological Maturity Forward-looking: Technologies are still in the research and development stage, with the community exploring their practical applications and potential market value. Although the industry's understanding of these technologies is still shallow, high-value demands have been identified. Growing: Technologies begin to enter the practical application stage, with increasing competition in the market and parallel development of various technological paths. The community focuses on overcoming challenges in practical applications and maximizing their commercial value, although their value in business is not fully realized. Popular: Technology development reaches its peak, with the community striving to maximize technological performance. Industry attention peaks and the technology begins to demonstrate significant commercial value. Declining: Technology paths begin to show clear advantages and disadvantages, with the market demanding higher optimization and integration. The industry begins to recognize the limitations and boundaries of technology in enhancing business value. Mature: Technology paths tend to unify and standardize, with the community focusing on reducing costs and improving efficiency. The industry also focuses on cost-effectiveness analysis to evaluate the priority and breadth of technology applications. Definitions of Business Value 5 stars: The cost reduction/revenue contribution of relevant technologies/business units accounts for 50% or more of the department's total revenue, or is managed by senior directors or higher-level executives (e.g., VPs). 4 stars: The cost reduction/revenue contribution of relevant technologies/business units accounts for between 40% and 50% of the department's total revenue, or is managed by directors. 3 stars: The cost reduction/revenue contribution of relevant technologies/business units accounts for between 30% and 40% of the department's total revenue, or is managed by senior managers. 2 stars: The cost reduction/revenue contribution of relevant technologies/business units accounts for between 20% and 30% of the department's total revenue, or is managed by managers. 1 star: The cost reduction/revenue contribution of relevant technologies/business units accounts for between 5% and 20% of the department's total revenue, or is managed by supervisors. Definitions of Technological Difficulty 5 stars: Invested in top industry expert teams for over 12 months 4 stars: Invested in industry experts or senior architects for over 12 months 3 stars: Invested in architect teams for approximately 6 months 2 stars: Invested in senior programmer teams for 1-3 months 1 star: Invested in ordinary programmer teams for 1-3 months
Speed and scalability are significant issues today, at least in the application landscape. Among the critical enablers for fast data access implementation within in-memory data stores are the game changers in recent times, which are technologies like Redis and Memcached. However, the question arises of choosing the best one. This article takes a plunge into the comparative analysis of these two cult technologies, highlights the critical performance metrics concerning scalability considerations, and, through real-world use cases, gives you the clarity to confidently make an informed decision. We have run these benchmarks on the AWS EC2 instances and designed a custom dataset to make it as close as possible to real application use cases. We compare throughput, operations per second, and latency under different loads, namely the P90 and P99 percentiles. We investigate deeper in a clustered environment and try identifying the scalability characteristics of both Redis and Memcached, including the implementation and management complexities of either. This level of comparison detail will assist decision-makers with the information they would need to make a more appropriate choice of an in-memory data store for their needs. Optimization of application performance and scalability can mean a vital competitive advantage for businesses. Our findings contribute to the existing knowledge base and offer practical recommendations to make these technologies effective in real-world settings. Introduction In this world of high-performance applications, the foremost requirements are speed and scalability. Both Redis and Memcached assist in leading that way through the storage of data in RAM, which thereby makes data access almost instant. But how do you make the right choice when all you need to look for is performance and scalability? Let's look at these two things in more detail as we flesh out their performance and scalability. Methodology We used several AWS instances, such as memory-optimized (m5.2xlarge) and compute-optimized (c5.2xlarge). All the instances have default settings/configurations for Redis and Memcached. Workloads Read Heavy: Read and write operations are read in a ratio of 80% to 20%. Write Heavy: Write and read operations are read in a ratio of 80% to 20%. Metrics Collected Operations per second (Ops/sec): This measures the number of operations that can be executed in the system at any one second. P90 latency: The latency was recorded at the 90th percentile. 90 percent of requests are satisfied at this level or faster. Under normal conditions, it gives you a good idea of system performance. P99 latency: The latency of only 1% of requests is slower than. It measures the "tail latency" behavior of your system in the case of its heaviest load or worst-case scenario. Throughput: Measured in gigabytes per second (GB/s), throughput is the rate at which the system handles the quantity of information. To simulate workload and measure performance, standard benchmarking tools were applied: memtier_benchmark for Redis and memcached_benchmark for Memcached. Procedure Initialization During the application's startup, a fixed dataset was used to ensure the results were standardized, which is a critical requirement. Sizes and data types were allowed to vary to reach a dataset resembling the actual usage of applications. Moreover, the dataset was designed to model common usage in terms of read and write ratios and the data structures that are most commonly used, such as strings, hashes, and lists. Implementation The average performance can only be measured over a long period after enough sample points are collected; workloads were run for a constant amount of time, which usually ranged from 1-2 hrs. The execution time was long enough for the application to reach a steady state and good quality measurements, hence implying that the average performance was under constant load. Replication Measurements were taken several (5-7) times for accuracy to mitigate anomalies that would affect the results. The benchmark was run at least 5 times, and the results are put as an average to correct deviations in performance. Data Aggregation Taking metrics and averaging across various runs to capture overall performance. Critical metrics include number of operations per second (Ops/sec), P99/P90 latency, and throughput. Sources of Potential Errors Since the network latency, CPU load, and memory availability on the AWS instance may change, this impacts the benchmarking results. The overhead contributed by the benchmarking tool can influence performance measurements. Results Real-World Performance Based on the methodology described above, we gathered the following performance data for Redis and Memcached on AWS instances. Performance Metrics (AWS EC2 Instances) Instance Type Workload Type System Ops/Sec P90 Latency (ms) P99 Latency (ms) Throughput (GB/s) m5.2xlarge (8 vCPUs, 32 GB RAM) 80% Read 20% Write Memcached 1,200,000 0.25 0.35 1.2 Redis 7 1,000,000 0.30 0.40 1.0 80% Write 20% Read Memcached 1,100,000 0.28 0.38 1.1 Redis 7 900,000 0.33 0.45 0.9 c5.2xlarge (8 vCPUs, 16 GB RAM) 80% Read 20% Write Memcached 1,300,000 0.23 0.33 1.3 Redis 7 1,100,000 0.28 0.38 1.1 80% Write 20% Read Memcached 1,200,000 0.26 0.36 1.2 Redis 7 1,000,000 0.31 0.41 1.0 Performance Summary Redis offers versatility with many data structures and sustained performance of network-bound tasks because of threaded I/O, although single-threaded execution may make the CPU-bound tasks slow. However, the tail latency is higher, especially under heavy write loads, according to the P90 and P99 metrics. Redis is observed to perform well during both reading and writing operations. Memcached works best under multi-threaded execution models and is highly optimized for high-speed, high-throughput caching, which makes it ideal for simple key-value operations with less overhead. Memcached generally performs better under heavy read and write loads, with lower P90 and P99 latency and higher throughput. Scalability Comparison Redis Scalability Redis provides horizontal scaling through the Redis Cluster, slicing data into many nodes. This improves Redis's fault tolerance, which is ideal for a large-scale application. However, it makes managing the Redis cluster complex and consumes more resources. Takes the shape of horizontal scaling by partitioning data within nodes Provides high availability and automatic failover Offers data durability options such as RDB snapshots and AOF logging Memcached Scalability Memcached uses consistent hashing to distribute load evenly across nodes. Adding more nodes makes scaling easy and ensures smooth performance as data and traffic grow. Memcached's simplicity in scaling and management is a big plus. Adding nodes is straightforward. Ensures even load distribution and high availability Less maintenance and configuration are needed. Scalability Benchmarking Using a 10-node cluster configuration, we benchmarked the scalability of Redis 7 and Memcached on AWS, including P90 and P99 latency metrics to provide insights into tail latency performance. Scalability Metrics (AWS EC2 Instances) Instance Type Workload Type System Ops/Sec P90 Latency (ms) P99 Latency (ms) Throughput (GB/s) m5.2xlarge (8 vCPUs, 32 GB RAM) 80% Read 20% Write Memcached 12,000,000 0.35 0.45 12.0 Redis 7 10,000,000 0.40 0.50 10.0 80% Write 20% Read Memcached 11,000,000 0.38 0.48 11.0 Redis 7 9,000,000 0.43 0.53 9.0 c5.2xlarge (8 vCPUs, 16 GB RAM) 80% Read 20% Write Memcached 1,300,000 0.33 0.43 13.0 Redis 7 1,100,000 0.38 0.48 11.0 80% Write 20% Read Memcached 12,000,000 0.36 0.46 12.0 Redis 7 10,000,000 0.41 0.51 10.0 Scalability Summary Redis can be scalable through the Redis Cluster, thus realizing its distributed nature, high availability, and persistence, but it requires more difficult management and resources. The P90 and P99 latency metrics show more tail latency under heavy loads. Large-scale operations have good throughput in Redis. Memcached is another simple solution that can be easily scaled because of consistent hashing. By using Memcached, it is easy to add more nodes, making its management lightweight. Generally, Memcached works better under heavy read and write load conditions, with lower P90 and P99 latency and better throughput. The observed differences in throughput metrics between performance and scalability tests are primarily due to the varying test setups and workload distributions. In single-instance performance tests, throughput is limited by the resources of individual instances. Conversely, in scalability tests, the workload is distributed across a larger cluster, allowing for higher aggregate throughput due to parallel processing and more efficient resource utilization. Additionally, network overheads and caching efficiencies play a role in enhancing throughput in clustered environments. Analysis of Benchmark Results Read-Heavy Workloads Memcached had higher throughput and lower latency in read-heavy workloads because of its multi-threaded architecture, which can simultaneously process multiple operations for reading. Much faster response times are afforded, with a higher information flow rate. Though Redis was slightly lagging, it still did well. It is single-threaded in data operation, which may sometimes lead to limitations concerning the number of simultaneous read requests it serves. However, Redis is suitable for managing complex structures, thus allowing one to manipulate versatile data, which is essential for applications that require complex queries. Write-Heavy Workloads Memcached is much more beneficial in write-heavy scenarios here than Redis. Its multi-threaded pattern enables it to process many write operations simultaneously, reducing general latency and increasing throughput. Redis had shown higher latencies and a lower throughput rate if the write load became heavy. It poses a bottleneck to processing heavy write operations because of its single-threaded feature. However, Redis features like durability using AOF and RDB and capacities to handle complex data structures make it rigid and flexible, areas where Memcached does not. Recommendations for Optimizing Performance Optimizing Redis Leverage the use of Thread I/O introduced in Redis 6.0 and further improved in Redis 7.0 to bring better performance on network-bound tasks. Implement Redis Cluster to distribute the load over several nodes for better scalability and fault tolerance. Choose the proper use case for the right data structure, such as using hashes for storing objects and sorted sets for ranking systems, so you do not leave any air holes in your memory and perform well. Configure AOF and RDB snapshots based on the durability requirements of the application in question, making the right tradeoff between performance and data safety. Optimizing Memcached Leverage Memcached multithreading to effectively service high-concurrency workloads. Memcached adopts consistent hashing, so the load balance between nodes can be realized. That is, it scales out gradually while enjoying high availability. Memory allocation settings should then be set according to workload characteristics, which will result in the best settings for attaining maximum cache efficiency. Make the cache operations simple; avoid complex data manipulations, be fast with high throughput, and keep low latency. Conclusion Redis and Memcached are compelling tools for high-performance applications, though their appropriateness will vary by use case. The rich versatility and features of Redis make it an excellent fit for complex applications that need real-time analytics and data persistence, along with sophisticated data manipulations. Memcached is quite streamlined and fast, really great where simple key-value caching and rapid data retrieval are a must. Now, armed with knowledge from these two areas strengths and weaknesses and additional factors like ease of setup, maintenance/monitoring, security, and other benchmarks, it should help you choose the best one to make your application performance and scalability optimal and provide a user experience that's even more smooth and responsive. Additional Contributor This article was co-authored by Seema Phalke. References Redis Documentation Memcached Documentation Antirez, Salvatore. Redis in Action. Manning Publications, 2013. Karwin, Baron Schwartz, Peter Zaitsev. High Performance MySQL. O'Reilly Media, 2012. Brewer, Eric A. "CAP Twelve Years Later: How the 'Rules' Have Changed." Computer, vol. 45, no. 2, 2012, pp. 23–29.
Did you know that up to 84% of Waterfall and 47% of Agile projects either fail or don’t meet expected results? That may sound like an alarming number, but the point of this is not to be a doomsayer or to focus on the perceived negative. What those numbers reflect is an opportunity. An opportunity to turn a seemingly failed project into a flourishing one, to learn lessons, and to leverage the things that make Agile principles work so smoothly when they do. The name for Agile methodologies was chosen well and done so for a reason. To be agile is to be quick on your feet, to be able to adapt quickly, and to be able to solve unpredictable problems while keeping momentum. Just like in a fast-paced race, when things start to go wrong, they tend to do so quickly. Agile problems require agile solutions — to be solved as speedily as they appear or change or ripple outwards. So, why do some projects fail and others don’t, and what can be done about it? The reasons for failure or underperformance can be as varied as the people involved or project milestones. We’ve all heard of, or encountered versions of them before; missed deadlines, budget overruns, low morale, personnel changes, scope creep — the list can go on. But there is a common denominator here — almost all these issues are inevitable. Inevitable in the sense that project management is built around project failure — something will stray from the plan, so the goal is not to try and eliminate this aspect, but to manage it with agility. The successful recovery of a project, or having to recover in the first place, is then not a failure of the system but many times a failure of imagination. How can we then use the methodologies of Agile, to recover a failing project? Understanding Agile Principles First, let’s quickly recap the core tenets of Agile. Agile is a methodology, yes, but it’s not just a method — it’s a mindset that combines the best aspects of humans working together. It values flexibility, collaboration, and continuous improvement. Broadly, Agile is an umbrella term that encompasses various iterative software development methodologies but has branched out into all sectors where that workflow makes sense. It focuses on working in short cycles called sprints, and doing so welcomes responding to change, collaboration, and delivering on customer’s evolving needs. Feature Agile Waterfall Approach Iterative, incremental, flexible Linear, sequential, rigid Change Embraces change throughout the project Resistant to change after planning is complete Planning Continuous, adaptable planning Detailed upfront planning Team Structure Self-organizing, cross-functional teams Specialized teams working in silos Communication Frequent, informal communication Formal communication with regular reports Documentation Prioritises working software over comprehensive documentation Emphasises comprehensive documentation at each phase Diagnosing The Failing Project Before we can fix the problem, we need to identify the problem - sounds obvious right? It needs to be diagnosed constructively though, beyond simply labeling a cause and effect. It’s the first step in getting to the root of the problem, not just trying to eliminate the symptoms. 1. Conduct a Retrospective A structured meeting where the team reflects on the past sprint or iteration to identify what went well, what didn't, and what can be improved. Encourage open and honest discussions and collaboration by creating a blame-free environment — emphasize learning from mistakes and then take action based on findings. 2. Analyse Sprint Burndown Chart Burndown charts visually track the remaining work in a sprint against the time remaining, helping with scope creep, underestimation, and blockers. 3. Review Velocity Trends A declining or inconsistent velocity can signal problems like team burnout, skill gaps, and poor estimation. 4. Gather Feedback From Stakeholders Clients or end-users can give valuable feedback and reveal issues such as misaligned goals, communication gaps, or dissatisfaction. 5. Conduct a Root Cause Analysis (RCA) By using tools like the "5 Whys" technique (asking "why?" five times), you can drill down from surface-level symptoms to the root causes of project issues. This helps you address the fundamental problems, not just the symptoms. Using Agile Methodologies To Get Back on Track Let’s now cover some of the aspects intrinsic to Agile and how they are beneficial to rescuing any project. Individuals and Interactions Over Processes and Tools Why it matters for recovery: Relying solely on rigid processes won’t solve a failing project. Motivated, knowledgeable, and empowered individuals need to communicate and collaborate effectively. How to apply it: You need to encourage open communication, while trusting your team’s expertise and empower them by knowing that they can come up with solutions and decisions. Working Solutions Over Comprehensive Documentation Why it matters for recovery: Whether you create software or offer business solutions to clients, a failing project needs tangible results, not just more paperwork. How to apply it: Using sprints, provide value by cutting the fat — in other words, prioritize features that mean something to clients, by getting it in front of them and then getting feedback. Customer Collaboration Over Contract Negotiation Why it matters for recovery: Similar to the previous point, but more focused on involving the customer or client and veering back to what they actually want. How to apply it: Regularly gather feedback from the customer, involve them in sprint reviews, and be transparent about progress and challenges. They will appreciate the transparency and involvement. Responding to Change Over Following a Plan Why it matters for recovery: As mentioned earlier, the hallmark of an Agile project and a quick recovery is adapting to change. The more rigid a plan, the more likely it is to crumble under pressure. How to apply it: Again, the practical way to approach this is by using short sprints, with tangible results. Then, based on those results, adapt the plan as needed while being open to feedback. To put it another way, we can use 6 A’s to define project recovery; Aligned to strategy: Making sure everyone is on the same page, from those who are doing the work, to those with the expertise, to the clients and end-users. Active sponsorship: Without oversight, leadership, and adherence to budget, any project can soon derail. The sponsorship needs to be active, not passive and distant. Agile team of teams: Each member of the team needs to have a stake and be empowered, whether it’s managerial or delivery, all teams work together to be nimble yet focused. All-team planning: Informed, capable, and experienced leaders know that planning needs to happen together. A plan can be meticulous and extensive, but if different teams or silos have different plans, alignment will always be out of reach. Adaptive culture: The core of an Agile environment. The true reflection of a project, plan or organization in its problem-solving is not just the ideology on paper, but in practice. The culture needs to be encouraged from both the ground up and the top down. If someone notices an issue, they do what they can to help, not simply what is mandated. Agile expertise: Knowledge is power. With rapid feedback, you also need to be able to apply that feedback timely and effectively. This is where experience and knowledge of Agile systems and principles in practice come in. Without bespoke solutions, real-world application, and hands-on experience, Agile methodologies are in just as much danger of failing as others. You need people on your team who know not only the map but also the territory. When it comes to re-aligning a project, saving it from failure, or future-proofing new projects, all these principles will stand you in good stead. They are tried and tested, not just by us but by countless organizations and projects around the world. This experience, iteration, and adaptation goes beyond good business practices and filters through any organization to form a solid, dependable foundation.
In today's security landscape, OAuth2 has become a standard for securing APIs, providing a more robust and flexible approach than basic authentication. My journey into this domain began with a critical solution architecture decision: migrating from basic authentication to OAuth2 client credentials for obtaining access tokens. While Spring Security offers strong support for both authentication methods, I encountered a significant challenge. I could not find a declarative approach that seamlessly integrated basic authentication and JWT authentication within the same application. This gap in functionality motivated me to explore and develop a solution that not only meets the authentication requirements but also supports comprehensive integration testing. This article shares my findings and provides a detailed guide on setting up Keycloak, integrating it with Spring Security and Spring Boot, and utilizing the Spock Framework for repeatable integration tests. By the end of this article, you will clearly understand how to configure and test your authentication mechanisms effectively with Keycloak as an identity provider, ensuring a smooth transition to OAuth2 while maintaining the flexibility to support basic authentication where necessary. Prerequisites Before you begin, ensure you have met the following requirements: You have installed Java 21. You have a basic understanding of Maven and Java. This is the parent project for the taptech-code-accelerator modules. It manages common dependencies and configurations for all the child modules. You can get it from here taptech-code-accelerator. Building taptech-code-accelerator To build the taptech-code-accelerator project, follow these steps: git clone the project from the repository: git clone https://github.com/glawson6/taptech-code-accelerator.git Open a terminal and change the current directory to the root directory of the taptech-code-accelerator project. cd path/to/taptech-code-accelerator Run the following command to build the project: ./build.sh This command cleans the project, compiles the source code, runs any tests, packages the compiled code into a JAR or WAR file, and installs the packaged code in your local Maven repository. It also builds the local Docker image that will be used to run later. Please ensure you have the necessary permissions to execute these commands. Keycloak Initial Setup Setting up Keycloak for integration testing involves several steps. This guide will walk you through creating a local environment configuration, starting Keycloak with Docker, configuring realms and clients, verifying the setup, and preparing a PostgreSQL dump for your integration tests. Step 1: Create a local.env File First, navigate to the taptech-common/src/test/resources/docker directory. Create a local.env file to store environment variables needed for the Keycloak service. Here's an example of what the local.env file might look like: POSTGRES_DB=keycloak POSTGRES_USER=keycloak POSTGRES_PASSWORD=admin KEYCLOAK_ADMIN=admin KEYCLOAK_ADMIN_PASSWORD=admin KC_DB_USERNAME=keycloak KC_DB_PASSWORD=keycloak SPRING_PROFILES_ACTIVE=secure-jwk KEYCLOAK_ADMIN_CLIENT_SECRET=DCRkkqpUv3XlQnosjtf8jHleP7tuduTa IDP_PROVIDER_JWKSET_URI=http://172.28.1.90:8080/realms/offices/protocol/openid-connect/certs Step 2: Start the Keycloak Service Next, start the Keycloak service using the provided docker-compose.yml file and the ./start-services.sh script. The docker-compose.yml file should define the Keycloak and PostgreSQL services. version: '3.8' services: postgres: image: postgres volumes: - postgres_data:/var/lib/postgresql/data #- ./dump:/docker-entrypoint-initdb.d environment: POSTGRES_DB: keycloak POSTGRES_USER: ${KC_DB_USERNAME} POSTGRES_PASSWORD: ${KC_DB_PASSWORD} networks: node_net: ipv4_address: 172.28.1.31 keycloak: image: quay.io/keycloak/keycloak:23.0.6 command: start #--import-realm environment: KC_HOSTNAME: localhost KC_HOSTNAME_PORT: 8080 KC_HOSTNAME_STRICT_BACKCHANNEL: false KC_HTTP_ENABLED: true KC_HOSTNAME_STRICT_HTTPS: false KC_HEALTH_ENABLED: true KEYCLOAK_ADMIN: ${KEYCLOAK_ADMIN} KEYCLOAK_ADMIN_PASSWORD: ${KEYCLOAK_ADMIN_PASSWORD} KC_DB: postgres KC_DB_URL: jdbc:postgresql://172.28.1.31/keycloak KC_DB_USERNAME: ${KC_DB_USERNAME} KC_DB_PASSWORD: ${KC_DB_PASSWORD} ports: - 8080:8080 volumes: - ./realms:/opt/keycloak/data/import restart: always depends_on: - postgres networks: node_net: ipv4_address: 172.28.1.90 volumes: postgres_data: driver: local networks: node_net: ipam: driver: default config: - subnet: 172.28.0.0/16 Then, use the ./start-services.sh script to start the services: Step 3: Access Keycloak Admin Console Once Keycloak has started, log in to the admin console at http://localhost:8080 using the configured admin username and password (default is admin/admin). Step 4: Create a Realm and Client Create a Realm: Log in to the Keycloak admin console. In the left-hand menu, click on "Add Realm". Enter the name of the realm (e.g., offices) and click "Create". Create a Client: Select your newly created realm from the left-hand menu. Click on "Clients" in the left-hand menu. Click on "Create" in the right-hand corner. Enter the client ID (e.g., offices), choose openid-connect as the client protocol, and click "Save." Click "Save." Extract the admin-cli Client Secret: Follow directions in the doc EXTRACTING-ADMIN-CLI-CLIENT-SECRET.md to extract the admin-cli client secret. Save the client secret for later use. Step 5: Verify the Setup With HTTP Requests To verify the setup, you can use HTTP requests to obtain tokens. Get access token: http -a admin-cli:[client secret] --form POST http://localhost:8080/realms/master/protocol/openid-connect/token grant_type=password username=admin password=Pa55w0rd Step 6: Create a PostgreSQL Dump After verifying the setup, create a PostgreSQL dump of the Keycloak database to use for seeding the database during integration tests. Create the dump: docker exec -i docker-postgres-1 /bin/bash -c "PGPASSWORD=keycloak pg_dump --username keycloak keycloak" > dump/keycloak-dump.sql Save the file: Save the keycloak-dump.sql file locally. This file will be used to seed the database for integration tests. Following these steps, you will have a Keycloak instance configured and ready for integration testing with Spring Security and the Spock Framework. Spring Security and Keycloak Integration Tests This section will set up integration tests for Spring Security and Keycloak using Spock and Testcontainers. This involves configuring dependencies, setting up Testcontainers for Keycloak and PostgreSQL, and creating a base class to hold the necessary configurations. Step 1: Add Dependencies First, add the necessary dependencies to your pom.xml file. Ensure that Spock, Testcontainers for Keycloak and PostgreSQL, and other required libraries are included (check here). Step 2: Create the Base Test Class Create a base class to hold the configuration for your integration tests. package com.taptech.common.security.keycloak import com.taptech.common.security.user.InMemoryUserContextPermissionsService import com.fasterxml.jackson.databind.ObjectMapper import dasniko.testcontainers.keycloak.KeycloakContainer import org.keycloak.admin.client.Keycloak import org.slf4j.Logger import org.slf4j.LoggerFactory import org.springframework.beans.factory.annotation.Autowired import org.springframework.context.annotation.Bean import org.springframework.context.annotation.Configuration import org.testcontainers.containers.Network import org.testcontainers.containers.PostgreSQLContainer import org.testcontainers.containers.output.Slf4jLogConsumer import org.testcontainers.containers.wait.strategy.ShellStrategy import org.testcontainers.utility.DockerImageName import org.testcontainers.utility.MountableFile import spock.lang.Shared import spock.lang.Specification import spock.mock.DetachedMockFactory import java.time.Duration import java.time.temporal.ChronoUnit class BaseKeyCloakInfraStructure extends Specification { private static final Logger logger = LoggerFactory.getLogger(BaseKeyCloakInfraStructure.class); static String jdbcUrlFormat = "jdbc:postgresql://%s:%s/%s" static String keycloakBaseUrlFormat = "http://%s:%s" public static final String OFFICES = "offices"; public static final String POSTGRES_NETWORK_ALIAS = "postgres"; @Shared static Network network = Network.newNetwork(); @Shared static PostgreSQLContainer<?> postgres = createPostgresqlContainer() protected static PostgreSQLContainer createPostgresqlContainer() { PostgreSQLContainer container = new PostgreSQLContainer<>("postgres") .withNetwork(network) .withNetworkAliases(POSTGRES_NETWORK_ALIAS) .withCopyFileToContainer(MountableFile.forClasspathResource("postgres/keycloak-dump.sql"), "/docker-entrypoint-initdb.d/keycloak-dump.sql") .withUsername("keycloak") .withPassword("keycloak") .withDatabaseName("keycloak") .withLogConsumer(new Slf4jLogConsumer(logger)) .waitingFor(new ShellStrategy() .withCommand( "psql -q -o /dev/null -c \"SELECT 1\" -d keycloak -U keycloak") .withStartupTimeout(Duration.of(60, ChronoUnit.SECONDS))) return container } public static final DockerImageName KEYCLOAK_IMAGE = DockerImageName.parse("bitnami/keycloak:23.0.5"); @Shared public static KeycloakContainer keycloakContainer; @Shared static String adminCC = "admin@cc.com" def setup() { } // run before every feature method def cleanup() {} // run after every feature method def setupSpec() { postgres.start() String jdbcUrl = String.format(jdbcUrlFormat, POSTGRES_NETWORK_ALIAS, 5432, postgres.getDatabaseName()); keycloakContainer = new KeycloakContainer("quay.io/keycloak/keycloak:23.0.6") .withNetwork(network) .withExposedPorts(8080) .withEnv("KC_HOSTNAME", "localhost") .withEnv("KC_HOSTNAME_PORT", "8080") .withEnv("KC_HOSTNAME_STRICT_BACKCHANNEL", "false") .withEnv("KC_HTTP_ENABLED", "true") .withEnv("KC_HOSTNAME_STRICT_HTTPS", "false") .withEnv("KC_HEALTH_ENABLED", "true") .withEnv("KEYCLOAK_ADMIN", "admin") .withEnv("KEYCLOAK_ADMIN_PASSWORD", "admin") .withEnv("KC_DB", "postgres") .withEnv("KC_DB_URL", jdbcUrl) .withEnv("KC_DB_USERNAME", "keycloak") .withEnv("KC_DB_PASSWORD", "keycloak") keycloakContainer.start() String authServerUrl = keycloakContainer.getAuthServerUrl(); String adminUsername = keycloakContainer.getAdminUsername(); String adminPassword = keycloakContainer.getAdminPassword(); logger.info("Keycloak getExposedPorts: {}", keycloakContainer.getExposedPorts()) String keycloakBaseUrl = String.format(keycloakBaseUrlFormat, keycloakContainer.getHost(), keycloakContainer.getMappedPort(8080)); //String keycloakBaseUrl = "http://localhost:8080" logger.info("Keycloak authServerUrl: {}", authServerUrl) logger.info("Keycloak URL: {}", keycloakBaseUrl) logger.info("Keycloak adminUsername: {}", adminUsername) logger.info("Keycloak adminPassword: {}", adminPassword) logger.info("JDBC URL: {}", jdbcUrl) System.setProperty("spring.datasource.url", jdbcUrl) System.setProperty("spring.datasource.username", postgres.getUsername()) System.setProperty("spring.datasource.password", postgres.getPassword()) System.setProperty("spring.datasource.driverClassName", "org.postgresql.Driver"); System.setProperty("POSTGRES_URL", jdbcUrl) System.setProperty("POSRGRES_USER", postgres.getUsername()) System.setProperty("POSRGRES_PASSWORD", postgres.getPassword()); System.setProperty("idp.provider.keycloak.base-url", authServerUrl) System.setProperty("idp.provider.keycloak.admin-client-secret", "DCRkkqpUv3XlQnosjtf8jHleP7tuduTa") System.setProperty("idp.provider.keycloak.admin-client-id", KeyCloakConstants.ADMIN_CLI) System.setProperty("idp.provider.keycloak.admin-username", adminUsername) System.setProperty("idp.provider.keycloak.admin-password", adminPassword) System.setProperty("idp.provider.keycloak.default-context-id", OFFICES) System.setProperty("idp.provider.keycloak.client-secret", "x9RIGyc7rh8A4w4sMl8U5rF3HuNm2wOC3WOD") System.setProperty("idp.provider.keycloak.client-id", OFFICES) System.setProperty("idp.provider.keycloak.token-uri", "/realms/offices/protocol/openid-connect/token") System.setProperty("idp.provider.keycloak.jwkset-uri", authServerUrl + "/realms/offices/protocol/openid-connect/certs") System.setProperty("idp.provider.keycloak.issuer-url", authServerUrl + "/realms/offices") System.setProperty("idp.provider.keycloak.admin-token-uri", "/realms/master/protocol/openid-connect/token") System.setProperty("idp.provider.keycloak.user-uri", "/admin/realms/{realm}/users") System.setProperty("idp.provider.keycloak.use-strict-jwt-validators", "false") } // run before the first feature method def cleanupSpec() { keycloakContainer.stop() postgres.stop() } // run after @Autowired Keycloak keycloak @Autowired KeyCloakAuthenticationManager keyCloakAuthenticationManager @Autowired InMemoryUserContextPermissionsService userContextPermissionsService @Autowired KeyCloakManagementService keyCloakService @Autowired KeyCloakIdpProperties keyCloakIdpProperties @Autowired KeyCloakJwtDecoderFactory keyCloakJwtDecoderFactory def test_config() { expect: keycloak != null keyCloakAuthenticationManager != null keyCloakService != null } static String basicAuthCredsFrom(String s1, String s2) { return "Basic " + toBasicAuthCreds(s1, s2); } static toBasicAuthCreds(String s1, String s2) { return Base64.getEncoder().encodeToString((s1 + ":" + s2).getBytes()); } @Configuration @EnableKeyCloak public static class TestConfig { @Bean ObjectMapper objectMapper() { return new ObjectMapper(); } DetachedMockFactory mockFactory = new DetachedMockFactory() } } In the BaseKeyCloakInfraStructure class, a method named createPostgresqlContainer() is used to set up a PostgreSQL test container. This method configures the container with various settings, including network settings, username, password, and database name. This class sets up the entire Postgresql and Keycloak env. One of the key steps in this method is the use of a PostgreSQL dump file to populate the database with initial data. This is done using the withCopyFileToContainer() method, which copies a file from the classpath to a specified location within the container. If you have problems starting, you might need to restart the Docker Compose file and extract the client secret. This is explained in EXTRACTING-ADMIN-CLI-CLIENT-SECRET. The code snippet for this is: .withCopyFileToContainer(MountableFile.forClasspathResource("postgres/keycloak-dump.sql"), "/docker-entrypoint-initdb.d/keycloak-dump.sql") Step 3: Extend the Base Class End Run Your Tests package com.taptech.common.security.token import com.taptech.common.EnableCommonConfig import com.taptech.common.security.keycloak.BaseKeyCloakInfraStructure import com.taptech.common.security.keycloak.EnableKeyCloak import com.taptech.common.security.keycloak.KeyCloakAuthenticationManager import com.taptech.common.security.user.UserContextPermissions import com.taptech.common.security.utils.SecurityUtils import com.fasterxml.jackson.databind.ObjectMapper import org.slf4j.Logger import org.slf4j.LoggerFactory import org.springframework.beans.factory.annotation.Autowired import org.springframework.boot.test.autoconfigure.web.reactive.WebFluxTest import org.springframework.context.annotation.Bean import org.springframework.context.annotation.Configuration import org.springframework.security.oauth2.client.registration.InMemoryReactiveClientRegistrationRepository import org.springframework.test.context.ContextConfiguration import org.springframework.test.web.reactive.server.EntityExchangeResult import org.springframework.test.web.reactive.server.WebTestClient import spock.mock.DetachedMockFactory import org.springframework.boot.autoconfigure.security.reactive.ReactiveSecurityAutoConfiguration @ContextConfiguration(classes = [TestApiControllerConfig.class]) @WebFluxTest(/*controllers = [TokenApiController.class],*/ properties = [ "spring.main.allow-bean-definition-overriding=true", "openapi.token.base-path=/", "idp.provider.keycloak.initialize-on-startup=true", "idp.provider.keycloak.initialize-realms-on-startup=false", "idp.provider.keycloak.initialize-users-on-startup=true", "spring.test.webtestclient.base-url=http://localhost:8888" ], excludeAutoConfiguration = ReactiveSecurityAutoConfiguration.class) class TokenApiControllerTest extends BaseKeyCloakInfraStructure { private static final Logger logger = LoggerFactory.getLogger(TokenApiControllerTest.class); /* ./mvnw clean test -Dtest=TokenApiControllerTest ./mvnw clean test -Dtest=TokenApiControllerTest#test_public_validate */ @Autowired TokenApiApiDelegate tokenApiDelegate @Autowired KeyCloakAuthenticationManager keyCloakAuthenticationManager @Autowired private WebTestClient webTestClient @Autowired TokenApiController tokenApiController InMemoryReactiveClientRegistrationRepository clientRegistrationRepository def test_configureToken() { expect: tokenApiDelegate } def test_public_jwkkeys() { expect: webTestClient.get().uri("/public/jwkKeys") .exchange() .expectStatus().isOk() .expectBody() } def test_public_login() { expect: webTestClient.get().uri("/public/login") .headers(headers -> { headers.setBasicAuth(BaseKeyCloakInfraStructure.adminCC, "admin") }) .exchange() .expectStatus().isOk() .expectBody() .jsonPath(".access_token").isNotEmpty() .jsonPath(".refresh_token").isNotEmpty() } def test_public_login_401() { expect: webTestClient.get().uri("/public/login") .headers(headers -> { headers.setBasicAuth(BaseKeyCloakInfraStructure.adminCC, "bad") }) .exchange() .expectStatus().isUnauthorized() } def test_public_refresh_token() { given: def results = keyCloakAuthenticationManager.passwordGrantLoginMap(BaseKeyCloakInfraStructure.adminCC, "admin", OFFICES).toFuture().join() def refreshToken = results.get("refresh_token") expect: webTestClient.get().uri("/public/refresh") .headers(headers -> { headers.set("Authorization", SecurityUtils.toBearerHeaderFromToken(refreshToken)) headers.set("contextId", OFFICES) }) .exchange() .expectStatus().isOk() .expectBody() .jsonPath(".access_token").isNotEmpty() .jsonPath(".refresh_token").isNotEmpty() } def test_public_validate() { given: def results = keyCloakAuthenticationManager.passwordGrantLoginMap(BaseKeyCloakInfraStructure.adminCC, "admin", OFFICES).toFuture().join() def accessToken = results.get("access_token") expect: EntityExchangeResult<UserContextPermissions> entityExchangeResult = webTestClient.get().uri("/public/validate") .headers(headers -> { headers.set("Authorization", SecurityUtils.toBearerHeaderFromToken(accessToken)) }) .exchange() .expectStatus().isOk() .expectBody(UserContextPermissions.class) .returnResult() logger.info("entityExchangeResult: {}", entityExchangeResult.getResponseBody()) } @Configuration @EnableCommonConfig @EnableKeyCloak @EnableTokenApi public static class TestApiControllerConfig { @Bean ObjectMapper objectMapper() { return new ObjectMapper(); } DetachedMockFactory mockFactory = new DetachedMockFactory() } } Conclusion With this setup, you have configured Testcontainers to run Keycloak and PostgreSQL within a Docker network, seeded the PostgreSQL database with a dump file, and created a base test class to manage the lifecycle of these containers. You can now write your integration tests extending this base class to ensure your Spring Security configuration works correctly with Keycloak.
In this short tutorial, I will show how to build a business application with the Business Process Modelling Notation – BPMN. This approach differs from the usual data-centric approach as we focus on process management instead of data processing. Data Processing vs. Process Management When we follow the classic approach of building a Data-Centric Business Application, we usually first design a data schema. The data schema defines what kind of data can be managed. The application allows us to create new data sets, edit existing data, and, of course, search for data. In a Process-Centric Business Application, we instead first try to answer the question of how data should be processed to give each actor the best access to information to reach a specific business goal. This kind of question becomes more and more important in today’s rapidly evolving business landscape. BPMN offers the perfect approach to model a workflow with its business goals from the beginning to the end. BPMN models can be created with various tools like the Open Source BPMN designer Open-BPMN, for example. The big advantage of BPMN is that it does not only give all stakeholders a clear understanding about the process, but a BPMN 2.0 Model can also be executed by a suitable process engine. This “low-code” or “model-driven” approach leads to a much more flexible way to implement business applications. Of course, data still plays an important role, and workflow engines allow us to manage business data in various ways. So let’s see how this works. Start With a Business Process First of all, we have to think about the business process behind our business app. Before we think about data, we should focus on questions like: Why do we need this kind of information? Who is responsible to create or update data? Which steps are necessary to process data correctly? What do we expect to happen next? As mentioned before, a business process defines the way to achieve a concrete business goal. In a BPMN model, we can describe this way from its beginning to the end. See the following example of a simple "Proposal Creation Process": The BPMN model defines a Start- (green) and End- (red) event to mark the beginning and end of the process. An Activity Element (blue boxes in this diagram) defines a single task in the business process. This can also be a milestone to be reached. An event (blue circle) defines the transition into a new task or status. An event can be triggered externally (e.g., "Order received from customer") or by an actor (e.g., "Proposal created and submitted for review"). In this way, we get a sequence flow – the workflow. Business Rules In addition to the sequence flow, we can define business rules to implement additional business logic. For example, we can define that a Review in this example process is only needed for Proposals with a bid amount over 1,000,00 EUR. For this, we can add an Exclusive Gateway with conditional flows to the diagram: In this example, the transition (Sequence Flow) between "Gateway-1" and the "Review Task" now contains a condition; e.g.: workitem.getItemValueDouble('amount')>=1000.0 A BPMN engine can evaluate these kinds of conditions by different script languages based on the data provided in the workflow. The data can either be stored directly in the process instance, or defined by a reference to an external data source. There are a number of more elements defined in the BPMN 2.0 standard that allow the modeling even of much more complex business processes. Run Your Business Process Now let’s see how we can start and control such a BPMN process in an application by using a BPMN engine. There are various BPMN engines available and a lot of them are open source. See the list of awesome-workflow-engines maintained by @meirwah on GitHub. In the following, I use the Imixs-Workflow engine, which supports BPMN 2.0 and provides a Docker container that allows us to start out of the box without the need to write any code. With the Imixs-Microservice we can start the BPMN Workflow engine in a container. Just create a local docker-compose.yaml file with the following content: services: imixs-db: image: postgres:13.11 environment: POSTGRES_PASSWORD: adminadmin POSTGRES_DB: workflow-db imixs-app: image: imixs/imixs-microservice:latest environment: TZ: "CET" LANG: "en_US.UTF-8" POSTGRES_USER: "postgres" POSTGRES_PASSWORD: "adminadmin" POSTGRES_CONNECTION: "jdbc:postgresql://imixs-db/workflow-db" ports: - "8080:8080" Start the service with: $ docker compose up The service starts with a short welcome page at http://localhost:8080 and provides a REST interface at http://localhost:8080/api/openapi-ui/index.html. This REST API user interface allows us to test various methods to start a workflow or to process existing ones. Upload Your Model First, we need to upload our model. For this, we can use the curl command and post our BPMN file to the REST Service endpoint: $ curl --user admin:adminadmin --request POST \ -Tmy-model.bpmn http://localhost:8080/api/model/bpmn Or we can use the REST API UI to post the model at the resource /api/model/bpmn/. You can download the test model from GitHub or you create a new file my-model.bpmn with the following content: <?xml version="1.0" encoding="UTF-8" standalone="no"?> <!-- origin at X=0.0 Y=0.0 --><bpmn2:definitions xmlns:bpmn2="http://www.omg.org/spec/BPMN/20100524/MODEL" xmlns:bpmndi="http://www.omg.org/spec/BPMN/20100524/DI" xmlns:dc="http://www.omg.org/spec/DD/20100524/DC" xmlns:di="http://www.omg.org/spec/DD/20100524/DI" xmlns:ext="http://org.eclipse.bpmn2/ext" xmlns:imixs="http://www.imixs.org/bpmn2" xmlns:open-bpmn="http://open-bpmn.org/XMLSchema" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" exporter="org.eclipse.bpmn2.modeler.core" exporterVersion="1.5.2.SNAPSHOT-v20200526-1743-B1" id="Definitions_1" targetNamespace="http://www.imixs.org/bpmn2"> <bpmn2:extensionElements> <imixs:item name="txtworkflowmodelversion" type="xs:string"> <imixs:value><![CDATA[proposal-en-1.0]]></imixs:value> </imixs:item> <imixs:item name="txtfieldmapping" type="xs:string"> <imixs:value><![CDATA[Team|team]]></imixs:value> <imixs:value><![CDATA[Creator|$creator]]></imixs:value> <imixs:value><![CDATA[CurrentEditor|$editor]]></imixs:value> </imixs:item> <imixs:item name="txtplugins" type="xs:string"> <imixs:value><![CDATA[org.imixs.workflow.engine.plugins.OwnerPlugin]]></imixs:value> <imixs:value><![CDATA[org.imixs.workflow.engine.plugins.HistoryPlugin]]></imixs:value> <imixs:value><![CDATA[org.imixs.workflow.engine.plugins.ResultPlugin]]></imixs:value> <imixs:value><![CDATA[org.imixs.workflow.engine.plugins.LogPlugin]]></imixs:value> <imixs:value><![CDATA[org.imixs.workflow.engine.plugins.ApplicationPlugin]]></imixs:value> </imixs:item> <open-bpmn:auto-align>true</open-bpmn:auto-align> </bpmn2:extensionElements> <bpmn2:collaboration id="Collaboration_1" name="Collaboration 1"> <bpmn2:participant id="Participant_1" name="Proposal" processRef="Process_1"> <bpmn2:documentation id="documentation_0zO0SQ"><![CDATA[Proposal Creation and Review Process]]></bpmn2:documentation> </bpmn2:participant> <bpmn2:participant id="Participant_2" name="Ticket Pool" processRef="ticket"/> <bpmn2:association id="Association_2" sourceRef="CallConversation_1" targetRef="IntermediateCatchEvent_3"/> <bpmn2:callConversation id="CallConversation_1" name="Stock Service"/> </bpmn2:collaboration> <bpmn2:process id="ticket" isExecutable="false" name="Ticket"> <bpmn2:documentation id="documentation_hijN6w"/> </bpmn2:process> <bpmn2:process definitionalCollaborationRef="Collaboration_1" id="Process_1" isExecutable="false" name="Proposal"> <bpmn2:laneSet id="LaneSet_4" name="Lane Set 4"> <bpmn2:lane id="Lane_2" name="Team"> <bpmn2:flowNodeRef>Task_2</bpmn2:flowNodeRef> <bpmn2:flowNodeRef>IntermediateCatchEvent_1</bpmn2:flowNodeRef> <bpmn2:flowNodeRef>StartEvent_1</bpmn2:flowNodeRef> <bpmn2:flowNodeRef>Task_1</bpmn2:flowNodeRef> <bpmn2:flowNodeRef>IntermediateCatchEvent_7</bpmn2:flowNodeRef> <bpmn2:flowNodeRef>Task_4</bpmn2:flowNodeRef> <bpmn2:flowNodeRef>EndEvent_1</bpmn2:flowNodeRef> <bpmn2:flowNodeRef>IntermediateCatchEvent_8</bpmn2:flowNodeRef> <bpmn2:documentation id="documentation_jDXHNg"/> <bpmn2:flowNodeRef>TextAnnotation_2</bpmn2:flowNodeRef> <bpmn2:flowNodeRef>gateway_R0B00Q</bpmn2:flowNodeRef> <bpmn2:flowNodeRef>gateway_7zm0sw</bpmn2:flowNodeRef> </bpmn2:lane> </bpmn2:laneSet> <bpmn2:task id="Task_2" imixs:processid="1100" name="Review"> <bpmn2:extensionElements> <imixs:item name="txtworkflowsummary" type="xs:string"> <imixs:value><![CDATA[<itemvalue>subject</itemvalue> ]]></imixs:value> </imixs:item> <imixs:item name="keyupdateacl" type="xs:boolean"> <imixs:value>true</imixs:value> </imixs:item> <imixs:item name="keyownershipfields" type="xs:string"/> <imixs:item name="keyaddwritefields" type="xs:string"/> </bpmn2:extensionElements> <bpmn2:documentation id="documentation_gfOqDA"/> <bpmn2:outgoing>sequenceFlow_fCuqCw</bpmn2:outgoing> <bpmn2:incoming>sequenceFlow_7jbDFQ</bpmn2:incoming> </bpmn2:task> <bpmn2:endEvent id="EndEvent_1" name="End"> <bpmn2:incoming>SequenceFlow_12</bpmn2:incoming> <bpmn2:documentation id="documentation_XhiRag"/> </bpmn2:endEvent> <bpmn2:intermediateCatchEvent id="IntermediateCatchEvent_1" imixs:activityid="10" name="Submit"> <bpmn2:extensionElements> <imixs:item name="rtfresultlog" type="CDATA"> <imixs:value><![CDATA[Order submitted by <itemvalue>$Editor</itemvalue>]]></imixs:value> </imixs:item> <imixs:item name="txtactivityresult" type="CDATA"> <imixs:value><![CDATA[<item name="batch.event.id">20</item>]]></imixs:value> </imixs:item> <imixs:item name="keyupdateacl" type="xs:boolean"> <imixs:value>false</imixs:value> </imixs:item> <imixs:item name="keyownershipfields" type="xs:string"/> <imixs:item name="keyaddreadfields" type="xs:string"/> <imixs:item name="keyaddwritefields" type="xs:string"/> </bpmn2:extensionElements> <bpmn2:documentation id="Documentation_12"><b>Submit</b> a new ticket</bpmn2:documentation> <bpmn2:incoming>SequenceFlow_11</bpmn2:incoming> <bpmn2:outgoing>SequenceFlow_3</bpmn2:outgoing> <bpmn2:outputSet id="OutputSet_1" name="Output Set 1"/> </bpmn2:intermediateCatchEvent> <bpmn2:sequenceFlow id="SequenceFlow_3" sourceRef="IntermediateCatchEvent_1" targetRef="gateway_7zm0sw"> <bpmn2:documentation id="documentation_Go9yMg"/> </bpmn2:sequenceFlow> <bpmn2:task id="Task_4" imixs:processid="1900" name="Completed"> <bpmn2:extensionElements> <imixs:item name="txtworkflowsummary" type="xs:string"> <imixs:value><![CDATA[<itemvalue>subject</itemvalue> ]]></imixs:value> </imixs:item> <imixs:item name="keyupdateacl" type="xs:boolean"> <imixs:value>true</imixs:value> </imixs:item> </bpmn2:extensionElements> <bpmn2:outgoing>SequenceFlow_12</bpmn2:outgoing> <bpmn2:documentation id="documentation_kIH5yg"/> <bpmn2:incoming>sequenceFlow_BYx0Eg</bpmn2:incoming> <bpmn2:incoming>sequenceFlow_O00HWA</bpmn2:incoming> </bpmn2:task> <bpmn2:sequenceFlow id="SequenceFlow_12" sourceRef="Task_4" targetRef="EndEvent_1"> <bpmn2:documentation id="documentation_yNPUlA"/> </bpmn2:sequenceFlow> <bpmn2:startEvent id="StartEvent_1" name="Start"> <bpmn2:outgoing>SequenceFlow_1</bpmn2:outgoing> <bpmn2:documentation id="documentation_igq0Jw"/> </bpmn2:startEvent> <bpmn2:sequenceFlow id="SequenceFlow_1" sourceRef="StartEvent_1" targetRef="Task_1"> <bpmn2:documentation id="documentation_JM9HUQ"/> </bpmn2:sequenceFlow> <bpmn2:task id="Task_1" imixs:processid="1000" name="Create Draft"> <bpmn2:extensionElements> <imixs:item name="txtworkflowsummary" type="xs:string"> <imixs:value><![CDATA[<itemvalue>subject</itemvalue> ]]></imixs:value> </imixs:item> <imixs:item name="txtworkflowabstract" type="CDATA"> <imixs:value><![CDATA[Create a new Ticket workflow]]></imixs:value> </imixs:item> </bpmn2:extensionElements> <bpmn2:documentation id="Documentation_1">Create a new ticket</bpmn2:documentation> <bpmn2:incoming>SequenceFlow_1</bpmn2:incoming> <bpmn2:outgoing>SequenceFlow_11</bpmn2:outgoing> <bpmn2:incoming>sequenceFlow_jspJig</bpmn2:incoming> </bpmn2:task> <bpmn2:sequenceFlow id="SequenceFlow_11" sourceRef="Task_1" targetRef="IntermediateCatchEvent_1"> <bpmn2:documentation id="documentation_O8sN9Q"/> </bpmn2:sequenceFlow> <bpmn2:intermediateCatchEvent id="IntermediateCatchEvent_7" imixs:activityid="10" name="Approve"> <bpmn2:extensionElements> <imixs:item name="rtfresultlog" type="CDATA"> <imixs:value><![CDATA[Oder placed, payment initialized]]></imixs:value> </imixs:item> <imixs:item name="keyupdateacl" type="xs:boolean"> <imixs:value>false</imixs:value> </imixs:item> <imixs:item name="keyaddwritefields" type="xs:string"/> <imixs:item name="keypublicresult" type="xs:string"> <imixs:value><![CDATA[1]]></imixs:value> </imixs:item> </bpmn2:extensionElements> <bpmn2:documentation id="documentation_7H9ztw"/> <bpmn2:incoming>sequenceFlow_Wn3NGw</bpmn2:incoming> <bpmn2:outgoing>sequenceFlow_BYx0Eg</bpmn2:outgoing> </bpmn2:intermediateCatchEvent> <bpmn2:intermediateCatchEvent id="IntermediateCatchEvent_8" imixs:activityid="20" name="Reject"> <bpmn2:extensionElements> <imixs:item name="rtfresultlog" type="CDATA"> <imixs:value><![CDATA[ticket solved by <itemvalue>namcurrentEditor</itemvalue>]]></imixs:value> </imixs:item> <imixs:item name="keyupdateacl" type="xs:boolean"> <imixs:value>false</imixs:value> </imixs:item> <imixs:item name="keyaddwritefields" type="xs:string"/> <imixs:item name="keypublicresult" type="xs:string"> <imixs:value><![CDATA[1]]></imixs:value> </imixs:item> </bpmn2:extensionElements> <bpmn2:documentation id="documentation_gk0TWg"/> <bpmn2:incoming>sequenceFlow_GNxY5A</bpmn2:incoming> <bpmn2:outgoing>sequenceFlow_jspJig</bpmn2:outgoing> </bpmn2:intermediateCatchEvent> <bpmn2:textAnnotation id="TextAnnotation_2" textFormat=""> <bpmn2:text><![CDATA[Proposal Creation and Review Process with conditional events.]]></bpmn2:text> <bpmn2:documentation id="documentation_XzWNOw"/> </bpmn2:textAnnotation> <bpmn2:eventBasedGateway gatewayDirection="Diverging" id="gateway_R0B00Q" name="Gateway-2"> <bpmn2:documentation id="documentation_rJq1dg"/> <bpmn2:incoming>sequenceFlow_fCuqCw</bpmn2:incoming> <bpmn2:outgoing>sequenceFlow_GNxY5A</bpmn2:outgoing> <bpmn2:outgoing>sequenceFlow_Wn3NGw</bpmn2:outgoing> </bpmn2:eventBasedGateway> <bpmn2:sequenceFlow id="sequenceFlow_fCuqCw" sourceRef="Task_2" targetRef="gateway_R0B00Q"> <bpmn2:documentation id="documentation_GihGJA"/> </bpmn2:sequenceFlow> <bpmn2:sequenceFlow id="sequenceFlow_GNxY5A" sourceRef="gateway_R0B00Q" targetRef="IntermediateCatchEvent_8"> <bpmn2:documentation id="documentation_y73G7A"/> </bpmn2:sequenceFlow> <bpmn2:sequenceFlow id="sequenceFlow_jspJig" sourceRef="IntermediateCatchEvent_8" targetRef="Task_1"> <bpmn2:documentation id="documentation_AG6lqA"/> </bpmn2:sequenceFlow> <bpmn2:sequenceFlow id="sequenceFlow_Wn3NGw" sourceRef="gateway_R0B00Q" targetRef="IntermediateCatchEvent_7"> <bpmn2:documentation id="documentation_DgUkAA"/> </bpmn2:sequenceFlow> <bpmn2:sequenceFlow id="sequenceFlow_BYx0Eg" sourceRef="IntermediateCatchEvent_7" targetRef="Task_4"> <bpmn2:documentation id="documentation_Hdvb8g"/> </bpmn2:sequenceFlow> <bpmn2:exclusiveGateway default="sequenceFlow_O00HWA" gatewayDirection="Diverging" id="gateway_7zm0sw" name="Gateway-1"> <bpmn2:documentation id="documentation_1V8eMg"/> <bpmn2:incoming>SequenceFlow_3</bpmn2:incoming> <bpmn2:outgoing>sequenceFlow_7jbDFQ</bpmn2:outgoing> <bpmn2:outgoing>sequenceFlow_O00HWA</bpmn2:outgoing> </bpmn2:exclusiveGateway> <bpmn2:sequenceFlow id="sequenceFlow_7jbDFQ" name=">=1.000" sourceRef="gateway_7zm0sw" targetRef="Task_2"> <bpmn2:documentation id="documentation_L6UXlA"/> <bpmn2:conditionExpression id="formalExpression_SDqySQ" xsi:type="bpmn2:tFormalExpression"><![CDATA[workitem.getItemValueDouble('amount')>=1000.0]]></bpmn2:conditionExpression> </bpmn2:sequenceFlow> <bpmn2:sequenceFlow id="sequenceFlow_O00HWA" name="<1.000" sourceRef="gateway_7zm0sw" targetRef="Task_4"> <bpmn2:documentation id="documentation_3uCjHQ"/> </bpmn2:sequenceFlow> </bpmn2:process> <bpmndi:BPMNDiagram id="BPMNDiagram_1" name="Default Process Diagram"> <bpmndi:BPMNPlane bpmnElement="Collaboration_1" id="BPMNPlane_1"> <bpmndi:BPMNShape bpmnElement="Participant_1" id="BPMNShape_Participant_1" isHorizontal="true"> <dc:Bounds height="330.0" width="1210.0" x="100.0" y="150.0"/> </bpmndi:BPMNShape> <bpmndi:BPMNShape bpmnElement="Lane_2" id="BPMNShape_Lane_2" isHorizontal="true"> <dc:Bounds height="330.0" width="1180.0" x="130.0" y="150.0"/> </bpmndi:BPMNShape> <bpmndi:BPMNShape bpmnElement="StartEvent_1" id="BPMNShape_1"> <dc:Bounds height="36.0" width="36.0" x="207.0" y="317.0"/> <bpmndi:BPMNLabel id="BPMNLabel_1" labelStyle="BPMNLabelStyle_1"> <dc:Bounds height="20.0" width="100.0" x="175.5" y="356.0"/> </bpmndi:BPMNLabel> </bpmndi:BPMNShape> <bpmndi:BPMNShape bpmnElement="EndEvent_1" id="BPMNShape_2"> <dc:Bounds height="36.0" width="36.0" x="1237.0" y="317.0"/> <bpmndi:BPMNLabel id="BPMNLabel_2" labelStyle="BPMNLabelStyle_1"> <dc:Bounds height="20.0" width="100.0" x="1208.0" y="356.0"/> </bpmndi:BPMNLabel> </bpmndi:BPMNShape> <bpmndi:BPMNShape bpmnElement="Task_1" id="BPMNShape_Task_1"> <dc:Bounds height="50.0" width="110.0" x="300.0" y="310.0"/> </bpmndi:BPMNShape> <bpmndi:BPMNShape bpmnElement="Task_2" id="BPMNShape_Task_2"> <dc:Bounds height="50.0" width="110.0" x="680.0" y="310.0"/> </bpmndi:BPMNShape> <bpmndi:BPMNShape bpmnElement="Task_4" id="BPMNShape_Task_4"> <dc:Bounds height="50.0" width="110.0" x="1070.0" y="310.0"/> </bpmndi:BPMNShape> <bpmndi:BPMNShape bpmnElement="IntermediateCatchEvent_1" id="BPMNShape_IntermediateCatchEvent_1"> <dc:Bounds height="36.0" width="36.0" x="457.0" y="317.0"/> <bpmndi:BPMNLabel id="BPMNLabel_8" labelStyle="BPMNLabelStyle_1"> <dc:Bounds height="20.0" width="100.0" x="422.0" y="358.0"/> </bpmndi:BPMNLabel> </bpmndi:BPMNShape> <bpmndi:BPMNShape bpmnElement="CallConversation_1" id="BPMNShape_CallConversation_1"> <dc:Bounds height="50.0" width="58.0" x="679.0" y="560.0"/> <bpmndi:BPMNLabel id="BPMNLabel_38"> <dc:Bounds height="14.0" width="74.0" x="671.0" y="610.0"/> </bpmndi:BPMNLabel> </bpmndi:BPMNShape> <bpmndi:BPMNShape bpmnElement="IntermediateCatchEvent_7" id="BPMNShape_IntermediateCatchEvent_7"> <dc:Bounds height="36.0" width="36.0" x="967.0" y="317.0"/> <bpmndi:BPMNLabel id="BPMNLabel_43"> <dc:Bounds height="20.0" width="100.0" x="938.0" y="356.0"/> </bpmndi:BPMNLabel> </bpmndi:BPMNShape> <bpmndi:BPMNShape bpmnElement="IntermediateCatchEvent_8" id="BPMNShape_IntermediateCatchEvent_8"> <dc:Bounds height="36.0" width="36.0" x="867.0" y="397.0"/> <bpmndi:BPMNLabel id="BPMNLabel_44"> <dc:Bounds height="20.0" width="100.0" x="834.5" y="436.0"/> </bpmndi:BPMNLabel> </bpmndi:BPMNShape> <bpmndi:BPMNShape bpmnElement="TextAnnotation_2" id="BPMNShape_TextAnnotation_2"> <dc:Bounds height="89.0" width="176.0" x="183.0" y="166.0"/> </bpmndi:BPMNShape> <bpmndi:BPMNEdge bpmnElement="SequenceFlow_1" id="BPMNEdge_SequenceFlow_1" sourceElement="BPMNShape_1" targetElement="BPMNShape_Task_1"> <bpmndi:BPMNLabel id="BPMNLabel_3" labelStyle="BPMNLabelStyle_1"/> <di:waypoint x="243.0" y="335.0"/> <di:waypoint x="300.0" y="335.0"/> </bpmndi:BPMNEdge> <bpmndi:BPMNEdge bpmnElement="SequenceFlow_3" id="BPMNEdge_SequenceFlow_3" sourceElement="BPMNShape_IntermediateCatchEvent_1" targetElement="BPMNShape_D6F8wQ"> <bpmndi:BPMNLabel id="BPMNLabel_12" labelStyle="BPMNLabelStyle_1"/> <di:waypoint x="493.0" y="335.0"/> <di:waypoint x="540.0" y="335.0"/> </bpmndi:BPMNEdge> <bpmndi:BPMNEdge bpmnElement="SequenceFlow_12" id="BPMNEdge_SequenceFlow_12" sourceElement="BPMNShape_Task_4" targetElement="BPMNShape_2"> <bpmndi:BPMNLabel id="BPMNLabel_26" labelStyle="BPMNLabelStyle_1"/> <di:waypoint x="1180.0" y="335.0"/> <di:waypoint x="1237.0" y="335.0"/> </bpmndi:BPMNEdge> <bpmndi:BPMNEdge bpmnElement="SequenceFlow_11" id="BPMNEdge_SequenceFlow_11" sourceElement="BPMNShape_Task_1" targetElement="BPMNShape_IntermediateCatchEvent_1"> <bpmndi:BPMNLabel id="BPMNLabel_25"/> <di:waypoint x="410.0" y="335.0"/> <di:waypoint x="457.0" y="335.0"/> </bpmndi:BPMNEdge> <bpmndi:BPMNEdge bpmnElement="Association_2" id="BPMNEdge_Association_2" sourceElement="BPMNShape_CallConversation_1" targetElement="BPMNShape_IntermediateCatchEvent_3"> <di:waypoint x="708.0" xsi:type="dc:Point" y="560.0"/> <di:waypoint x="708.0" xsi:type="dc:Point" y="458.0"/> <di:waypoint x="708.0" xsi:type="dc:Point" y="356.0"/> <bpmndi:BPMNLabel id="BPMNLabel_39"/> </bpmndi:BPMNEdge> <bpmndi:BPMNShape bpmnElement="gateway_R0B00Q" id="BPMNShape_LOkyeA"> <dc:Bounds height="50.0" width="50.0" x="860.0" y="310.0"/> <bpmndi:BPMNLabel id="BPMNLabel_9Q0Elg"> <dc:Bounds height="20.0" width="100.0" x="834.0" y="366.0"/> </bpmndi:BPMNLabel> </bpmndi:BPMNShape> <bpmndi:BPMNEdge bpmnElement="sequenceFlow_fCuqCw" id="BPMNEdge_Og8YJw" sourceElement="BPMNShape_Task_2" targetElement="BPMNShape_LOkyeA"> <di:waypoint x="790.0" y="335.0"/> <di:waypoint x="860.0" y="335.0"/> </bpmndi:BPMNEdge> <bpmndi:BPMNEdge bpmnElement="sequenceFlow_GNxY5A" id="BPMNEdge_dksumA" sourceElement="BPMNShape_LOkyeA" targetElement="BPMNShape_IntermediateCatchEvent_8"> <di:waypoint x="885.0" y="360.0"/> <di:waypoint x="885.0" y="397.0"/> </bpmndi:BPMNEdge> <bpmndi:BPMNEdge bpmnElement="sequenceFlow_jspJig" id="BPMNEdge_9kE0mA" sourceElement="BPMNShape_IntermediateCatchEvent_8" targetElement="BPMNShape_Task_1"> <di:waypoint x="867.0" y="415.0"/> <di:waypoint x="355.0" y="415.0"/> <di:waypoint x="355.0" y="360.0"/> </bpmndi:BPMNEdge> <bpmndi:BPMNEdge bpmnElement="sequenceFlow_Wn3NGw" id="BPMNEdge_MDnJhQ" sourceElement="BPMNShape_LOkyeA" targetElement="BPMNShape_IntermediateCatchEvent_7"> <di:waypoint x="910.0" y="335.0"/> <di:waypoint x="967.0" y="335.0"/> </bpmndi:BPMNEdge> <bpmndi:BPMNEdge bpmnElement="sequenceFlow_BYx0Eg" id="BPMNEdge_EV7fFQ" sourceElement="BPMNShape_IntermediateCatchEvent_7" targetElement="BPMNShape_Task_4"> <di:waypoint x="1003.0" y="335.0"/> <di:waypoint x="1070.0" y="335.0"/> </bpmndi:BPMNEdge> <bpmndi:BPMNShape bpmnElement="gateway_7zm0sw" id="BPMNShape_D6F8wQ"> <dc:Bounds height="50.0" width="50.0" x="540.0" y="310.0"/> <bpmndi:BPMNLabel id="BPMNLabel_KzYbZw"> <dc:Bounds height="20.0" width="100.0" x="515.0" y="363.0"/> </bpmndi:BPMNLabel> </bpmndi:BPMNShape> <bpmndi:BPMNEdge bpmnElement="sequenceFlow_7jbDFQ" id="BPMNEdge_r6zgwg" sourceElement="BPMNShape_D6F8wQ" targetElement="BPMNShape_Task_2"> <di:waypoint x="590.0" y="335.0"/> <di:waypoint x="680.0" y="335.0"/> </bpmndi:BPMNEdge> <bpmndi:BPMNEdge bpmnElement="sequenceFlow_O00HWA" id="BPMNEdge_spt5Tg" sourceElement="BPMNShape_D6F8wQ" targetElement="BPMNShape_Task_4"> <di:waypoint x="565.0" y="310.0"/> <di:waypoint x="565.0" y="225.0"/> <di:waypoint x="1124.0" y="225.0"/> <di:waypoint x="1124.0" y="310.0"/> </bpmndi:BPMNEdge> </bpmndi:BPMNPlane> <bpmndi:BPMNLabelStyle id="BPMNLabelStyle_1"> <dc:Font name="arial" size="9.0"/> </bpmndi:BPMNLabelStyle> </bpmndi:BPMNDiagram> </bpmn2:definitions> The status of all available models can be verified at the REST API endpoint, http://localhost:8080/api/model. Note: I use the user ID admin with the password adminadmin here. More users with different roles are defined by this service. Find the details on the GitHub repo linked earlier. Start a New Process Instance Now that the service is up and running, we can start our first process instance. For this, we just need to post an XML document containing the model information and our business data. We can use the following REST API Endpoint to post a new process instance: POST: /api/workflow/workitem. <document xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <item name="$modelversion"><value xsi:type="xs:string">proposal-en-1.0</value></item> <item name="$taskid"><value xsi:type="xs:int">1000</value></item> <item name="$eventid"><value xsi:type="xs:int">10</value></item> <item name="subject"> <value xsi:type="xs:string">My first propsal...</value> </item> <item name="amount"> <value xsi:type="xs:double">500.0</value> </item> </document> Again, you can post the data directly with the REST API UI. In this example, I define the modelversion, the initial task and the event to be processed, as well as some custom business data (subject and amount). The workflow engine automatically executes the data according to our uploaded BPMN model and returns a result object. To verify the result we can check the tasklist with all process instances created by the user admin at http://localhost:8080/api/workflow/tasklist/creator/admin. As you can see, the workflow engine has processed our data and applied the status Completed to our new workflow instance. This was because the amount was below 1.000,00. To test our business logic we can now change the amount to a value greater than 1.000,00 which will start another new process instance in the status Review according to our business rules. We can now easily change the process logic and add new tasks or business rules. All we have to do is to upload the new version of the BPMN model. No redeployment of the application is needed. To update the data of an existing process instance we can post the data together with the $uniqueid and the corresponding event from our model: <document xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <item name="$uniqueid"><value xsi:type="xs:string">8333c61c-b973-4591-aabf-ec92bd59d74b</value></item> <item name="$eventid"><value xsi:type="xs:int">10</value></item> <item name="address"> <value xsi:type="xs:string">Baker Street 221b.</value> </item> </document> The field $uniqueid is returned by the first creation of a new workflow instance and is a reference for further processing steps. In this example, I only provide the reference ID, an Event ID, and some new data. The Workflow engine will verify the request and the event according to the assigned model. This ensures that the data is always processed according to our business process! We can also use the $uniqueID to fetch the data from the workflow engine – e.g., to display the data in a web interface. How To Integrate BPMN Into Your App In this tutorial, I have deliberately omitted the design of a web interface because I wanted to focus on the business logic. When using a REST API – as in this example – the workflow engine can be integrated into an application easily by using different frontend technologies. In addition, there are often other ways to connect a BPMN engine to your own application depending on the framework used by the engine. For example, a lot of open-source engines are based on Java and can be integrated by common build tools like Maven or Gradle. For example, you can integrate the Imixs Workflow engine into a Jakarta EE app with the following dependencies: ... <dependency> <groupId>org.imixs.workflow</groupId> <artifactId>imixs-workflow-engine</artifactId> <version>${org.imixs.workflow.version}</version> </dependency> <dependency> <groupId>org.imixs.workflow</groupId> <artifactId>imixs-workflow-index-lucene</artifactId> <version>${org.imixs.workflow.version}</version> </dependency> ... Call the engine within your code using the CDI Framework like this: @Inject private org.imixs.workflow.engine.WorkflowService workflowService; ItemCollection workitem=new ItemCollection().model("proposal-en-1.0").task(1000).event(10); // assign some business data... workitem.setItemValue("amount",500.0); // process the workitem workitem = workflowService.processWorkItem(workitem); This code example is equal to the REST API call in the example above. Data Integration As mentioned before, another aspect is the way you handle your business data. In my example code, I embedded the business data directly into the workflow instance, so no external database was needed. But of course, in some scenarios, it may be useful to just store a reference to an external dataset and hold the business data in a separate database. Or you may reference the workflow instance from your data processing application by extending the data schema. In the last two scenarios, we also need to provide a data service to be able to process the previously mentioned business logic in the process engine. All modern BPMN engines provide interfaces to connect a data source with the process logic. Conclusion In summary, incorporating BPMN 2.0 into your business application offers a structured and efficient way to handle your business processes. The ability to visually process models not only simplifies the design and management of complex workflows but also enhances collaboration across technical and non-technical teams. By using tools like Open-BPMN or Imixs-Workflow, you can easily implement BPMN 2.0 standards, ensuring that your business applications are both scalable and adaptable. Embracing BPMN 2.0 can lead to more organized, transparent, and effective business operations, setting the stage for greater success and growth.
This is Part 2, a continuation of Javac and Java Katas, Part 1: Class Path, where we will run through the same exercises (katas) but this time the main focus will be the usage of the Java Platform Module System. Getting Started As in Part 1, all commands in this article are executed inside a Docker container to make sure that they work and to mitigate any environment-specific setup. So, let's clone the GitHub repository and run the command below from its java-javac-kata folder: Shell docker run --rm -it --name java_kata -v .:/java-javac-kata --entrypoint /bin/bash maven:3.9.6-amazoncorretto-17-debian Kata 1: "Hello, World!" Warm Up We will start with a primitive Java application, /module-path-part/kata-one-hello-world-warm-up, which does not have any third-party dependencies. The directory structure is as follows: In the picture above, we can see the Java project package hierarchy with two classes in the com.example.kata.one package and the module-info.java file which is a module declaration. Compilation To compile our code, we are going to use javac in the single-module mode, which implies that the module-source-path option is not used: Shell javac -d ./target/classes $(find -name '*.java') As a result, the compiled Java classes should appear in the target/classes folder. The verbose option can provide more details on the compilation process: Shell javac -verbose -d ./target/classes $(find -name '*.java') We can also obtain the compiled module description as follows: Shell java --describe-module com.example.kata.one --module-path target/classes Execution Shell java --module-path target/classes --module com.example.kata.one/com.example.kata.one.Main What should result in Hello World! in your console. Various verbose:[class|module|gc|jni] options can provide more details on the execution process: Shell java -verbose:module --module-path target/classes --module com.example.kata.one/com.example.kata.one.Main Also, experimenting a bit during both the compilation and execution stages, by removing or changing classes and packages, should give you a good understanding of which issues lead to particular errors. Packaging Building Modular JAR According to JEP 261: Module System, "A modular JAR file is like an ordinary JAR file in all possible ways, except that it also includes a module-info.class file in its root directory. " With that in mind, let's build one: Shell jar --create --file ./target/hello-world-warm-up.jar -C target/classes/ . The jar file is placed in the target folder. Also, using the verbose option can give us more details: Shell jar --verbose --create --file ./target/hello-world-warm-up.jar -C target/classes/ . You can view the structure of the built jar by using the following command: Shell jar -tf ./target/hello-world-warm-up.jar And get a module description of the modular jar: Shell jar --describe-module --file ./target/hello-world-warm-up.jar Additionally, we can launch the Java class dependency analyzer, jdeps, to gain even more insight: Shell jdeps ./target/hello-world-warm-up.jar As usual, there is the verbose option, too: Shell jdeps -verbose ./target/hello-world-warm-up.jar With that, let's proceed to run our modular jar: Shell java --module-path target/hello-world-warm-up.jar --module com.example.kata.one/com.example.kata.one.Main Building Modular Jar With the Main Class Shell jar --create --file ./target/hello-world-warm-up.jar --main-class=com.example.kata.one.Main -C target/classes/ . Having specified the main-class, we can run our app by omitting the <main-class> part in the module option: Shell java --module-path target/hello-world-warm-up.jar --module com.example.kata.one Kata 2: Third-Party Dependency Let's navigate to the /module-path-part/kata-two-third-party-dependency project and examine its structure. This kata is also a Hello World! application, but with a third-party dependency, guava-30.1-jre.jar, which has an automatic module name, com.google.common. You can check its name by using the describe-module option: Shell jar --describe-module --file lib/guava-30.1-jre.jar Compilation Shell javac --module-path lib -d ./target/classes $(find -name '*.java') The module-path option points to the lib folder that contains our dependency. Execution Shell java --module-path "target/classes:lib" --module com.example.kata.two/com.example.kata.two.Main Building Modular Jar Shell jar --create --file ./target/third-party-dependency.jar --main-class=com.example.kata.two.Main -C target/classes/ . Now, we can run our application as follows: Shell java --module-path "lib:target/third-party-dependency.jar" --module com.example.kata.two Kata 3: Spring Boot Application Conquest In the /module-path-part/kata-three-spring-boot-app-conquest folder, you will find a Maven project for a primitive Spring Boot application. To get started with this exercise, we need to execute the script below. Shell ./kata-three-set-up.sh The main purpose of this script is to download all necessary dependencies into the ./target/lib folder and remove all other files in the ./target directory. As seen in the picture above, the ./target/lib has three subdirectories. The test directory contains all test dependencies. The automatic-module stores dependencies used by the module declaration. The remaining dependencies used by the application are put into the unnamed-module directory. The intention of this separation will become clearer as we proceed. Compilation Shell javac --module-path target/lib/automatic-module -d ./target/classes/ $(find -P ./src/main/ -name '*.java') Take notice that for complications, we only need the modules specified in the module-info.java, which are stored in the automatic-module directory. Execution Shell java --module-path "target/classes:target/lib/automatic-module" \ --class-path "target/lib/unnamed-module/*" \ --add-modules java.instrument \ --module com.example.kata.three/com.example.kata.three.Main As a result, you should see the application running. For a better understanding of how the class-path option works here together with the module-path, I recommend reading the 3.1: The unnamed module part of "The State of the Module System." Building Modular Jar Let's package our compiled code as a modular jar, with the main class specified: Shell jar --create --file ./target/spring-boot-app-conquest.jar --main-class=com.example.kata.three.Main -C target/classes/ . Now, we can run it: Shell java --module-path "target/spring-boot-app-conquest.jar:target/lib/automatic-module" \ --class-path "target/lib/unnamed-module/*" \ --add-modules java.instrument \ --module com.example.kata.three Test Compilation For simplicity's sake, we will use the class path approach to run tests here. There's little benefit in struggling with tweaks to the module system and adding additional options to make the tests work. With that, let's compile our test code: Shell javac --class-path "./target/classes:./target/lib/automatic-module/*:./target/lib/test/*" -d ./target/test-classes/ $(find -P ./src/test/ -name '*.java') Test Execution Shell java --class-path "./target/classes:./target/test-classes:./target/lib/automatic-module/*:./target/lib/unnamed-module/*:./target/lib/test/*" \ org.junit.platform.console.ConsoleLauncher execute --scan-classpath --disable-ansi-colors For more details, you can have a look at Part 1 of this series (linked in the introduction), which elaborates on the theoretical aspect of this command. Wrapping Up That's it. I hope you found this useful, and that these exercises have provided you with some practical experience regarding the nuances of the Java Platform Module System.
What Is Data Governance? Data governance is a framework that is developed through the collaboration of individuals with various roles and responsibilities. This framework aims to establish processes, policies, procedures, standards, and metrics that help organizations achieve their goals. These goals include providing reliable data for business operations, setting accountability, and authoritativeness, developing accurate analytics to assess performance, complying with regulatory requirements, safeguarding data, ensuring data privacy, and supporting the data management life cycle. Creating a Data Governance Board or Steering Committee is a good first step when integrating a Data Governance program and framework. An organization’s governance framework should be circulated to all staff and management, so everyone understands the changes taking place. The basic concepts needed to successfully govern data and analytics applications. They are: A focus on business values and the organization’s goals An agreement on who is responsible for data and who makes decisions A model emphasizing data curation and data lineage for Data Governance Decision-making that is transparent and includes ethical principles Core governance components include data security and risk management Provide ongoing training, with monitoring and feedback on its effectiveness Transforming the workplace into collaborative culture, using Data Governance to encourage broad participation What Is Data Integration? Data integration is the process of combining and harmonizing data from multiple sources into a unified, coherent format that various users can consume, for example: operational, analytical, and decision-making purposes. The data integration process consists of four primary critical components: 1. Source Systems Source systems, such as databases, file systems, Internet of Things (IoT) devices, media continents, and cloud data storage, provide the raw information that must be integrated. The heterogeneity of these source systems results in data that can be structured, semi-structured, or unstructured. Databases: Centralized or distributed repositories are designed to store, organize, and manage structured data. Examples include relational database management systems (RDBMS) like MySQL, PostgreSQL, and Oracle. Data is typically stored in tables with predefined schemas, ensuring consistency and ease of querying. File systems: Hierarchical structures that organize and store files and directories on disk drives or other storage media. Common file systems include NTFS (Windows), APFS (macOS), and EXT4 (Linux). Data can be of any type, including structured, semi-structured, or unstructured. Internet of Things (IoT) devices: Physical devices (sensors, actuators, etc.) that are embedded with electronics, software, and network connectivity. IoT devices collect, process, and transmit data, enabling real-time monitoring and control. Data generated by IoT devices can be structured (e.g., sensor readings), semi-structured (e.g., device configuration), or unstructured (e.g., video footage). Media repositories: Platforms or systems designed to manage and store various types of media files. Examples include content management systems (CMS) and digital asset management (DAM) systems. Data in media repositories can include images, videos, audio files, and documents. Cloud data storage: Services that provide on-demand storage and management of data online. Popular cloud data storage platforms include Amazon S3, Microsoft Azure Blob Storage, and Google Cloud Storage. Data in cloud storage can be accessed and processed from anywhere with an internet connection. 2. Data Acquisition Data acquisition involves extracting and collecting information from source systems. Different methods can be employed based on the source system's nature and specific requirements. These methods include batch processes, streaming methods utilizing technologies like ETL (Extract, Transform, Load), ELT (Extract, Load, Transform), API (Application Programming Interface), streaming, virtualization, data replication, and data sharing. Batch processes: Batch processes are commonly used for structured data. In this method, data is accumulated over a period of time and processed in bulk. This approach is advantageous for large datasets and ensures data consistency and integrity. Application Programming Interface (API): APIs serve as a communication channel between applications and data sources. They allow for controlled and secure access to data. APIs are commonly used to integrate with third-party systems and enable data exchange. Streaming: Streaming involves continuous data ingestion and processing. It is commonly used for real-time data sources such as sensor networks, social media feeds, and financial markets. Streaming technologies enable immediate analysis and decision-making based on the latest data. Virtualization: Data virtualization provides a logical view of data without physically moving or copying it. It enables seamless access to data from multiple sources, irrespective of their location or format. Virtualization is often used for data integration and reducing data silos. Data replication: Data replication involves copying data from one system to another. It enhances data availability and redundancy. Replication can be synchronous, where data is copied in real-time, or asynchronous, where data is copied at regular intervals. Data sharing: Data sharing involves granting authorized users or systems access to data. It facilitates collaboration, enables insights from multiple perspectives, and supports informed decision-making. Data sharing can be implemented through various mechanisms such as data portals, data lakes, and federated databases. 3. Data Storage Upon data acquisition, storing data in a repository is crucial for efficient access and management. Various data storage options are available, each tailored to specific needs. These options include: Database Management Systems (DBMS): Relational Database Management Systems (RDBMS) are software systems designed to organize, store, and retrieve data in a structured format. These systems offer advanced features such as data security, data integrity, and transaction management. Examples of popular RDBMS include MySQL, Oracle, and PostgreSQL. NoSQL databases, such as MongoDB and Cassandra, are designed to store and manage semi-structured data. They offer flexibility and scalability, making them suitable for handling large amounts of data that may need to fit better into a relational model. Cloud storage services: Cloud storage services offer scalable and cost-effective storage solutions in the cloud. They provide on-demand access to data from anywhere with an internet connection. Popular cloud storage services include Amazon S3, Microsoft Azure Storage, and Google Cloud Storage. Data lakes: Data lakes are large repositories of raw and unstructured data in their native format. They are often used for big data analytics and machine learning. Data lakes can be implemented using Hadoop Distributed File System (HDFS) or cloud-based storage services. Delta lakes: Delta lakes are a type of data lake that supports ACID transactions and schema evolution. They provide a reliable and scalable data storage solution for data engineering and analytics workloads. Cloud data warehouse: Cloud data warehouses are cloud-based data storage solutions designed for business intelligence and analytics. They provide fast query performance and scalability for large volumes of structured data. Examples include Amazon Redshift, Google BigQuery, and Snowflake. Big data files: Big data files are large collections of data stored in a single file. They are often used for data analysis and processing tasks. Common big data file formats include Parquet, Apache Avro, and Apache ORC. On-premises Storage Area Networks (SAN): SANs are dedicated high-speed networks designed for data storage. They offer fast data transfer speeds and provide centralized storage for multiple servers. SANs are typically used in enterprise environments with large storage requirements. Network Attached Storage (NAS): NAS devices are file-level storage systems that connect to a network and provide shared storage space for multiple clients. They are often used in small and medium-sized businesses and offer easy access to data from various devices. Choosing the right data storage option depends on factors such as data size, data type, performance requirements, security needs, and cost considerations. Organizations may use a combination of these storage options to meet their specific data management needs. 4. Consumption This is the final stage of the data integration lifecycle, where the integrated data is consumed by various applications, data analysts, business analysts, data scientists, AI/ML models, and business processes. The data can be consumed in various forms and through various channels, including: Operational systems: The integrated data can be consumed by operational systems using APIs (Application Programming Interfaces) to support day-to-day operations and decision-making. For example, a customer relationship management (CRM) system may consume data about customer interactions, purchases, and preferences to provide personalized experiences and targeted marketing campaigns. Analytics: The integrated data can be consumed by analytics applications and tools for data exploration, analysis, and reporting. Data analysts and business analysts use these tools to identify trends, patterns, and insights from the data, which can help inform business decisions and strategies. Data sharing: The integrated data can be shared with external stakeholders, such as partners, suppliers, and regulators, through data-sharing platforms and mechanisms. Data sharing enables organizations to collaborate and exchange information, which can lead to improved decision-making and innovation. Kafka: Kafka is a distributed streaming platform that can be used to consume and process real-time data. Integrated data can be streamed into Kafka, where it can be consumed by applications and services that require real-time data processing capabilities. AI/ML: The integrated data can be consumed by AI (Artificial Intelligence) and ML (Machine Learning) models for training and inference. AI/ML models use the data to learn patterns and make predictions, which can be used for tasks such as image recognition, natural language processing, and fraud detection. The consumption of integrated data empowers businesses to make informed decisions, optimize operations, improve customer experiences, and drive innovation. By providing a unified and consistent view of data, organizations can unlock the full potential of their data assets and gain a competitive advantage. What Are Data Integration Architecture Patterns? In this section, we will delve into an array of integration patterns, each tailored to provide seamless integration solutions. These patterns act as structured frameworks, facilitating connections and data exchange between diverse systems. Broadly, they fall into three categories: Real-Time Data Integration Near Real-Time Data Integration Batch Data Integration 1. Real-Time Data Integration In various industries, real-time data ingestion serves as a pivotal element. Let's explore some practical real-life illustrations of its applications: Social media feeds display the latest posts, trends, and activities. Smart homes use real-time data to automate tasks. Banks use real-time data to monitor transactions and investments. Transportation companies use real-time data to optimize delivery routes. Online retailers use real-time data to personalize shopping experiences. Understanding real-time data ingestion mechanisms and architectures is vital for choosing the best approach for your organization. Indeed, there's a wide range of Real-Time Data Integration Architectures to choose from. Among them most commonly used architectures are: Streaming-Based Architecture Event-Driven Integration Architecture Lambda Architecture Kappa Architecture Each of these architectures offers its unique advantages and use cases, catering to specific requirements and operational needs. a. Streaming-Based Data Integration Architecture In a streaming-based architecture, data streams are continuously ingested as they arrive. Tools like Apache Kafka are employed for real-time data collection, processing, and distribution. This architecture is ideal for handling high-velocity, high-volume data while ensuring data quality and low-latency insights. Streaming-based architecture, powered by Apache Kafka, revolutionizes data processing. It involves continuous data ingestion, enabling real-time collection, processing, and distribution. This approach facilitates real-time data processing, handles large volumes of data, and prioritizes data quality and low-latency insights. The diagram below illustrates the various components involved in a streaming data integration architecture. b. Event-Driven Integration Architecture An event-driven architecture is a highly scalable and efficient approach for modern applications and microservices. This architecture responds to specific events or triggers within a system by ingesting data as the events occur, enabling the system to react quickly to changes. This allows for efficient handling of large volumes of data from various sources. c. Lambda Integration Architecture The Lambda architecture embraces a hybrid approach, skillfully blending the strengths of batch and real-time data ingestion. It comprises two parallel data pipelines, each with a distinct purpose. The batch layer expertly handles the processing of historical data, while the speed layer swiftly addresses real-time data. This architectural design ensures low-latency insights, upholding data accuracy and consistency even in extensive distributed systems. d. Kappa Data Integration Architecture Kappa architecture is a simplified variation of Lambda architecture specifically designed for real-time data processing. It employs a solitary stream processing engine, such as Apache Flink or Apache Kafka Streams, to manage both historical and real-time data, streamlining the data ingestion pipeline. This approach minimizes complexity and maintenance expenses while simultaneously delivering rapid and precise insights. 2. Near Real-Time Data Integration In near real-time data integration, the data is processed and made available shortly after it is generated, which is critical for applications requiring timely data updates. Several patterns are used for near real-time data integration, a few of them have been highlighted below: a. Change Data Capture — Data Integration Change Data Capture (CDC) is a method of capturing changes that occur in a source system's data and propagating those changes to a target system. b. Data Replication — Data Integration Architecture With the Data Replication Integration Architecture, two databases can seamlessly and efficiently replicate data based on specific requirements. This architecture ensures that the target database stays in sync with the source database, providing both systems with up-to-date and consistent data. As a result, the replication process is smooth, allowing for effective data transfer and synchronization between the two databases. c. Data Virtualization — Data Integration Architecture In Data Virtualization, a virtual layer integrates disparate data sources into a unified view. It eliminates data replication, dynamically routes queries to source systems based on factors like data locality and performance, and provides a unified metadata layer. The virtual layer simplifies data management, improves query performance, and facilitates data governance and advanced integration scenarios. It empowers organizations to leverage their data assets effectively and unlock their full potential. 3. Batch Process: Data Integration Batch Data Integration involves consolidating and conveying a collection of messages or records in a batch to minimize network traffic and overhead. Batch processing gathers data over a period of time and then processes it in batches. This approach is particularly beneficial when handling large data volumes or when the processing demands substantial resources. Additionally, this pattern enables the replication of master data to replica storage for analytical purposes. The advantage of this process is the transmission of refined results. The traditional batch process data integration patterns are: Traditional ETL Architecture — Data Integration Architecture This architectural design adheres to the conventional Extract, Transform, and Load (ETL) process. Within this architecture, there are several components: Extract: Data is obtained from a variety of source systems. Transform: Data undergoes a transformation process to convert it into the desired format. Load: Transformed data is then loaded into the designated target system, such as a data warehouse. Incremental Batch Processing — Data Integration Architecture This architecture optimizes processing by focusing only on new or modified data from the previous batch cycle. This approach enhances efficiency compared to full batch processing and alleviates the burden on the system's resources. Micro Batch Processing — Data Integration Architecture In Micro Batch Processing, small batches of data are processed at regular, frequent intervals. It strikes a balance between traditional batch processing and real-time processing. This approach significantly reduces latency compared to conventional batch processing techniques, providing a notable advantage. Pationed Batch Processing — Data Integration Architecture In this partitioned batch processing approach, voluminous datasets are strategically divided into smaller, manageable partitions. These partitions can then be efficiently processed independently, frequently leveraging the power of parallelism. This methodology offers a compelling advantage by reducing processing time significantly, making it an attractive choice for handling large-scale data. Conclusion Here are the main points to take away from this article: It's important to have a strong data governance framework in place when integrating data from different source systems. The data integration patterns should be selected based on the use cases, such as volume, velocity, and veracity. There are 3 types of Data integration styles, and we should choose the appropriate model based on different parameters.
Ever wondered how Netflix keeps you glued to your screen with uninterrupted streaming bliss? Netflix Architecture is responsible for the smooth streaming experience that attracts viewers worldwide behind the scenes. Netflix's system architecture emphasizes how important it is to determine how content is shaped in the future. Join us on a journey behind the scenes of Netflix’s streaming universe! Netflix is a term that means entertainment, binge-watching, and cutting-edge streaming services. Netflix’s rapid ascent to popularity may be attributed to its vast content collection, worldwide presence, and resilient and inventive architecture. From its start as a DVD rental service in 1997 to its development into a major worldwide streaming company, Netflix has consistently used cutting-edge technology to revolutionize media consumption. Netflix Architecture is designed to efficiently and reliably provide content to millions of consumers at once. The scalability of Netflix’s infrastructure is critical, given its 200 million+ members across more than 190 countries. So, let’s delve into the intricacies of Netflix Architecture and uncover how it continues shaping how we enjoy our favorite shows and movies. Why Understand Netflix System Architecture? It’s important to understand Netflix System Architecture for several reasons. Above all, it sheds light on how Netflix accommodates millions of customers throughout the globe with a flawless streaming experience. We can learn about the technology and tactics that underlie its success better by exploring the nuances of this architecture. Furthermore, other industries can benefit from using Netflix’s design as a blueprint for developing scalable, reliable, and efficient systems. Its design principles and best practices can teach us important lessons about building and optimizing complicated distributed systems. We may also recognize the continual innovation driving the development of digital media consumption by understanding Netflix’s Architecture. Understanding the Requirements for System Design System design is crucial in developing complex software or technological infrastructure. These specifications act as the basis around which the entire system is constructed, driving choices and forming the end product. However, what are the prerequisites for system design, and what makes them crucial? Let’s explore. Functional Requirements The system’s functional requirements specify the features, functions, and capabilities that it must include. These specifications outline the system’s main objective and detail how various parts or modules interact. Functional requirements for a streaming platform like Netflix, for instance, could encompass the following, including but not limited to: Account creation: Users should be able to create accounts easily, providing necessary information for registration. User login: Registered users should have the ability to securely log in to their accounts using authentication credentials. Content suggestion: The platform should offer personalized content suggestions based on user preferences, viewing history, and other relevant data. Video playback capabilities: Users should be able to stream videos seamlessly, with options for playback controls such as play, pause, rewind, and fast forward. Non-Functional Requirements Non-functional requirements define the system’s behavior under different scenarios and ensure that it satisfies certain quality requirements. They cover performance, scalability, dependability, security, and compliance aspects of the system. Non-functional requirements for a streaming platform like Netflix, for instance, could include but are not limited to: Performance requirements: During periods of high utilization, the system must maintain low latency and high throughput. Compliance requirements: Regarding user data protection, the platform must abide by Data Protection Regulations standards. Scalability requirements: The infrastructure must be scalable to handle growing user traffic without sacrificing performance. Security requirements: To prevent unwanted access to user information, strong authentication and encryption procedures must be put in place. Reliability and availability requirements: For uninterrupted service delivery, the system needs to include failover methods and guarantee high uptime. Netflix Architecture: Embracing Cloud-Native After a significant setback due to database corruption in August 2008, Netflix came to the crucial conclusion that it was necessary to move away from single points of failure and towards highly dependable, horizontally scalable, cloud-based solutions. Netflix started a revolutionary journey by selecting Amazon Web Services (AWS) as its cloud provider and moving most of its services to the cloud by 2015. Following seven years of intensive work, the cloud migration was finished in early January 2016, which meant that the streaming service’s last remaining data center components were shut down. But getting to the cloud wasn’t a simple task. Netflix adopted a cloud-native strategy, completely overhauling its operational model and technological stack. This required embracing NoSQL databases, denormalizing their data model, and moving from a monolithic application to hundreds of microservices. Changes in culture were also necessary, such as adopting DevOps procedures, continuous delivery, and a self-service engineering environment. Despite the difficulties, this shift has made Netflix a cloud-native business that is well-positioned for future expansion and innovation in the rapidly changing field of online entertainment. Netflix Architectural Triad A strong architectural triad — the Client, Backend, and Content Delivery Network (CDN) — is responsible for Netflix’s flawless user experience. With millions of viewers globally, each component is essential to delivering content. Client The client-side architecture lies at the heart of the Netflix experience. This includes the wide range of devices users use to access Netflix, such as computers, smart TVs, and smartphones. Netflix uses a mix of web interfaces and native applications to ensure a consistent user experience across different platforms. Regardless of the device, these clients manage playback controls, user interactions, and interface rendering to deliver a unified experience. Users may easily browse the extensive content library and enjoy continuous streaming thanks to the client-side architecture’s responsive optimization. Netflix Architecture: Backend Backend architecture is the backbone of Netflix’s behind-the-scenes operations. The management of user accounts, content catalogs, recommendation algorithms, billing systems, and other systems is done by a complex network of servers, databases, and microservices. In addition to handling user data and coordinating content delivery, the backend processes user requests. Furthermore, the backend optimizes content delivery and personalizes recommendations using state-of-the-art technologies like big data analytics and machine learning, which raises user satisfaction and engagement. The backend architecture of Netflix has changed significantly over time. It moved to cloud infrastructure in 2007 and adopted Spring Boot as its primary Java framework in 2018. When combined with the scalability and dependability provided by AWS (Amazon Web Services), proprietary technologies like Ribbon, Eureka, and Hystrix have been crucial in effectively coordinating backend operations. Netflix Architecture: Content Delivery Network The Content Delivery Network completes Netflix Architectural Triangle. A Content Delivery Network (CDN) is a strategically positioned global network of servers that aims to deliver content to users with optimal reliability and minimum delay. Netflix runs a Content Delivery Network (CDN) called Open Connect. It reduces buffering and ensures smooth playback by caching and serving material from sites closer to users. Even during times of high demand, Netflix reduces congestion and maximizes bandwidth utilization by spreading content over numerous servers across the globe. This decentralized method of content delivery improves global viewers’ watching experiences, also lowering buffering times and increasing streaming quality. Client-Side Components Web Interface Over the past few years, Netflix’s Web Interface has seen a considerable transformation, switching from Silverlight to HTML5 to stream premium video content. With this change, there would be no need to install and maintain browser plug-ins, which should simplify the user experience. Netflix has increased its compatibility with a wide range of online browsers and operating systems, including Chrome OS, Chrome, Internet Explorer, Safari, Opera, Firefox, and Edge, since the introduction of HTML5 video. Netflix’s use of HTML5 extends beyond simple playback. The platform has welcomed HTML5 adoption as an opportunity to support numerous industry standards and technological advancements. Mobile Applications The extension of Netflix’s streaming experience to users of smartphones and tablets is made possible via its mobile applications. These applications guarantee that users may access their favorite material while on the road. They are available on multiple platforms, including iOS and Android. By utilizing a combination of native development and platform-specific optimizations, Netflix provides a smooth and user-friendly interface for a wide range of mobile devices. With features like personalized recommendations, seamless playback, and offline downloading, Netflix’s mobile applications meet the changing needs of viewers on the go. Users of the Netflix mobile app may enjoy continuous viewing of their favorite series and films while driving, traveling, or just lounging around the house. Netflix is committed to providing a captivating and delightful mobile viewing experience with frequent upgrades and improvements. Smart TV Apps The Gibbon rendering layer, a JavaScript application for dynamic updates, and a native Software Development Kit (SDK) comprise the complex architecture upon which the Netflix TV Application is based. The application guarantees fluid UI rendering and responsiveness across multiple TV platforms by utilizing React-Gibbon, a customized variant of React. Prioritizing performance optimization means focusing on measures such as frames per second and key input responsiveness. Rendering efficiency is increased by methods like prop iteration reduction and inline component creation; performance is further optimized by style optimization and custom component development. With a constant focus on enhancing the TV app experience for consumers across many platforms, Netflix cultivates a culture of performance excellence. Revamping the Playback Experience: A Journey Towards Modernization Netflix has completely changed how people watch and consume digital media over the last ten years. But even though the streaming giant has been releasing cutting-edge features regularly, the playback interface’s visual design and user controls haven’t changed much since 2013. After realizing that the playback user interface needed to be updated, the Web UI team set out to redesign it. The team’s three main canvases were Pre Play, Video Playback, and Post Play. Their goal was to increase customer pleasure and engagement. By utilizing technologies like React.js and Redux to expedite development and enhance performance, Netflix revolutionized its playback user interface Netflix Architecture: Backend Infrastructure Content Delivery Network (CDN) Netflix’s infrastructure depends on its Content Delivery Network (CDN), additionally referred to as Netflix Open Connect, which allows content to be delivered to millions of viewers globally with ease. Globally distributed, the CDN is essential to ensuring that customers in various locations receive high-quality streaming content. The way Netflix Open Connect CDN works is that servers, called Open Connect Appliances (OCAs), are positioned strategically so that they are near Internet service providers (ISPs) and their users. When content delivery is at its peak, this proximity reduces latency and guarantees effective performance. Netflix is able to maximize bandwidth utilization and lessen its dependence on costly backbone capacity by pre-positioning content within ISP networks, which improves the total streaming experience. Scalability is one of Netflix’s CDN’s primary features. With OCAs installed in about 1,000 locations across the globe, including isolated locales like islands and the Amazon rainforest, Netflix is able to meet the expanding demand for streaming services across a wide range of geographic areas. Additionally, Netflix grants OCAs to qualified ISPs so they can offer Netflix content straight from their networks. This strategy guarantees improved streaming for subscribers while also saving ISPs’ running expenses. Netflix cultivates a win-win relationship with ISPs by providing localized content distribution and collaborating with them, which enhances the streaming ecosystem as a whole. Transforming Video Processing: The Microservices Revolution at Netflix By implementing microservices, Netflix has transformed its video processing pipeline, enabling unmatched scalability and flexibility to satisfy the needs of studio operations as well as member streaming. With the switch to the microservices-based platform from the monolithic platform, a new age of agility and feature development velocity was brought in. Each step of the video processing workflow is represented by a separate microservice, allowing for simplified orchestration and decoupled functionality. Together, these services—which range from video inspection to complexity analysis and encoding—produce excellent video assets suitable for studio and streaming use cases. Microservices have produced noticeable results by facilitating quick iteration and adaptation to shifting business requirements. Playback Process in Netflix Open Connect Worldwide customers can enjoy a flawless and excellent viewing experience thanks to Netflix Open Connect’s playback procedure. It functions as follows: Health reporting: Open Connect Appliances (OCAs) report to the cache control services in Amazon Web Services (AWS) on a regular basis regarding their learned routes, content availability, and overall health. User request: From the Netflix application hosted on AWS, a user on a client device requests that a TV show or movie be played back. Authorization and file selection: After verifying user authorization and licensing, the AWS playback application services choose the precise files needed to process the playback request. Steering service: The AWS steering service chooses which OCAs to serve files from based on the data that the cache control service has saved. The playback application services receive these OCAs from it when it constructs their URLs. Content delivery: On the client device, the playback application services send the URLs of the relevant OCAs. When the requested files are sent to the client device over HTTP/HTTPS, the chosen OCA starts serving them. Below is a visual representation demonstrating the playback process: Databases in Netflix Architecture Leveraging Amazon S3 for Seamless Media Storage Netflix’s ability to withstand the April 21, 2022, AWS outage demonstrated the value of its cloud infrastructure, particularly its reliance on Amazon S3 for data storage. Netflix’s systems were built to endure such outages by leveraging services like SimpleDB, S3, and Cassandra. Netflix’s infrastructure is built on the foundation of its use of Amazon S3 (Simple Storage Service) for media storage, which powers the streaming giant’s huge collection of films, TV series, and original content. Petabytes of data are needed to service millions of Netflix users worldwide, and S3 is the perfect choice for storing this data since it offers scalable, reliable, and highly accessible storage. Another important consideration that led Netflix to select S3 for media storage is scalability. With S3, Netflix can easily expand its storage capacity without having to worry about adding more hardware or maintaining complicated storage infrastructure as its content collection grows. To meet the growing demand for streaming content without sacrificing user experience or speed, Netflix needs to be scalable. Embracing NoSQL for Scalability and Flexibility The need for structured storage access throughout a highly distributed infrastructure drives Netflix’s database selection process. Netflix adopted the paradigm shift towards NoSQL distributed databases after realizing the shortcomings of traditional relational models in the context of Internet-scale operations. In their database ecosystem, three essential NoSQL solutions stand out: Cassandra, Hadoop/HBase, and SimpleDB. Amazon SimpleDB As Netflix moved to the AWS cloud, SimpleDB from Amazon became an obvious solution for many use cases. It was appealing because of its powerful query capabilities, automatic replication across availability zones, and durability. SimpleDB’s hosted solution reduced operational overhead, which is in line with Netflix’s policy of using cloud providers for non-differentiated operations. Apache HBase Apache HBase evolved as a practical, high-performance solution for Hadoop-based systems. Its dynamic partitioning strategy makes it easier to redistribute load and create clusters, which is crucial for handling Netflix’s growing volume of data. HBase’s robust consistency architecture is enhanced by its support for distributed counters, range queries, and data compression, which makes it appropriate for a variety of use cases. Apache Cassandra The open-source NoSQL database Cassandra provides performance, scalability, and flexibility. Its dynamic cluster growth and horizontal scalability meet Netflix’s requirement for unlimited scale. Because of its adaptable consistency, replication mechanisms, and flexible data model, Cassandra is perfect for cross-regional deployments and scaling without single points of failure. Since each NoSQL tool is best suited for a certain set of use cases, Netflix has adopted a number of them. While Cassandra excels in cross-regional deployments and fault-tolerant scaling, HBase connects with the Hadoop platform naturally. A learning curve and operational expense accompany a pillar of Netflix’s long-term cloud strategy, NoSQL adoption, but the benefits in terms of scalability, availability, and performance make the investment worthwhile. MySQL in Netflix’s Billing Infrastructure Netflix’s billing system experienced a major transformation as part of its extensive migration to AWS cloud-native architecture. Because Netflix relies heavily on billing for its operations, the move to AWS was handled carefully to guarantee that there would be as little of an impact on members’ experiences as possible and that strict financial standards would be followed. Tracking billing periods, monitoring payment statuses, and providing data to financial systems for reporting are just a few of the tasks that Netflix’s billing infrastructure handles. The billing engineering team managed a complicated ecosystem that included batch tasks, APIs, connectors with other services, and data management to accomplish these functionalities. The selection of database technology was one of the most important choices made during the move. MySQL was chosen as the database solution due to the need for scalability and the requirement for ACID transactions in payment processing. Building robust tooling, optimizing code, and removing unnecessary data were all part of the migration process in order to accommodate the new cloud architecture. Before transferring the current member data, a thorough testing process using clean datasets was carried out using proxies and redirectors to handle traffic redirection. It was a complicated process to migrate to MySQL on AWS; it required careful planning, methodical implementation, and ongoing testing and iteration. In spite of the difficulties, the move went well, allowing Netflix to use the scalability and dependability of AWS cloud services for its billing system. In summary, switching Netflix’s billing system to MySQL on AWS involved extensive engineering work and wide-ranging effects. Netflix's system architecture has updated its billing system and used cloud-based solutions to prepare for upcoming developments in the digital space. Here is Netflix’s post-migration architecture: Content Processing Pipeline in Netflix Architecture The Netflix content processing pipeline is a systematic approach for handling digital assets that are provided by partners in content and fulfillment. The three main phases are ingestion, transcoding, and packaging. Ingestion Source files, such as audio, timed text, or video, are thoroughly examined for accuracy and compliance throughout the ingestion stage. These verifications include semantic signal domain inspections, file format validation, decodability of compressed bitstreams, compliance with Netflix delivery criteria, and the integrity of data transfer. Transcoding and Packaging The sources go through transcoding to produce output elementary streams when they make it beyond the ingestion stage. After that, these streams are encrypted and placed in distribution-ready streamable containers. Ensuring Seamless Streaming With Netflix’s Canary Model Since client applications are the main way users engage with a brand, they must be of excellent quality for global digital products. At Netflix's system architecture, significant amounts of money are allocated towards guaranteeing thorough evaluation of updated application versions. Nevertheless, thorough internal testing becomes difficult because Netflix is accessible on thousands of devices and is powered by hundreds of independently deployed microservices. As a result, it is crucial to support release decisions with solid field data acquired during the update process. To expedite the assessment of updated client applications, Netflix’s system architecture has formed a specialized team to mine health signals from the field. Development velocity increased as a result of this system investment, improving application quality and development procedures. Client applications: There are two ways that Netflix upgrades its client apps: through direct downloads and app store deployments. Distribution control is increased with direct downloads. Deployment strategies: Although the advantages of regular, incremental releases for client apps are well known, updating software presents certain difficulties. Since every user’s device delivers data in a stream, efficient signal sampling is crucial. The deployment strategies employed by Netflix are customized to tackle the distinct challenges posed by a wide range of user devices and complex microservices. The strategy differs based on the kind of client — for example, smart TVs vs mobile applications. New client application versions are progressively made available through staged rollouts, which provide prompt failure handling and intelligent backend service scaling. During rollouts, keeping an eye on client-side error rates and adoption rates guarantees consistency and effectiveness in the deployment procedure. Staged rollouts: To reduce risks and scale backend services wisely, staged rollouts entail progressively deploying new software versions. AB tests/client canaries: Netflix employs an intense variation of A/B testing known as “Client Canaries,” which involves testing complete apps to guarantee timely upgrades within a few hours. Orchestration: Orchestration lessens the workload associated with frequent deployments and analysis. It is useful for managing A/B tests and client canaries. In summary, millions of customers may enjoy flawless streaming experiences thanks to Netflix’s use of the client canary model, which guarantees frequent app updates. Netflix Architecture Diagram Netflix system Architecture is a complex ecosystem made up of Python and Java with Spring Boot for backend services, and Apache Kafka and Flink for data processing and real-time event streaming. Redux, React.js, and HTML5 on the front end provide a captivating user experience. Numerous databases offer real-time analytics and handle enormous volumes of media content, including Cassandra, HBase, SimpleDB, MySQL, and Amazon S3. Jenkins and Spinnaker help with continuous integration and deployment, and AWS powers the entire infrastructure with scalability, dependability, and global reach. Netflix’s dedication to providing flawless entertainment experiences to its vast worldwide audience is demonstrated by the fact that these technologies only make up a small portion of its huge tech stack. Conclusion of Netflix Architecture Netflix System Architecture has revolutionized the entertainment industry. Throughout its evolution from a DVD rental service to a major worldwide streaming player, Netflix’s technological infrastructure has been essential to its success. Netflix Architecture, supported by Amazon Web Services (AWS), guarantees uninterrupted streaming for a global user base. Netflix ensures faultless content delivery across devices with its Client, Backend, and Content Delivery Network (CDN). The innovative usage of HTML5 and personalized suggestions by Netflix System Architecture improves user experience. Despite some obstacles along the way, Netflix came out stronger after making the switch to a cloud-native setup. In the quickly evolving field of online entertainment, Netflix has positioned itself for future development and innovation by embracing microservices, NoSQL databases, and cloud-based solutions. Any tech venture can benefit from understanding Netflix's system. Put simply, Netflix's System Architecture aims to transform the way we consume media — it’s not just about technology. This architecture secretly makes sure that everything runs well when viewers binge-watch, increasing everyone’s enjoyment of the entertainment.
Phased Approach to Data Warehouse Modernization
July 4, 2024 by
How To Use the H2 Database With Spring Boot
July 9, 2024 by
Securing Your Machine Identities Means Better Secrets Management
July 9, 2024 by
Explainable AI: Making the Black Box Transparent
May 16, 2023 by CORE
How To Use the H2 Database With Spring Boot
July 9, 2024 by
Master AWS IAM Role Configuration With Terraform
July 9, 2024 by
Securing Your Machine Identities Means Better Secrets Management
July 9, 2024 by
How To Build an AI Knowledge Base With RAG
July 9, 2024 by
Five IntelliJ Idea Plugins That Will Change the Way You Code
May 15, 2023 by