Data Resources

DZone's Featured Data Resources

Snowflake Data Sharing Capabilities

By Krishnamurty Raju Mudunuru

Data drives business in the modern economy; the faster businesses can get to data and provide meaningful insights, the more they can enable informed decision-making. Snowflake has come a long way in this space in recent years, and the progress is impressive. Snowflake is also being increasingly adopted by several firms, as it is well known for its large dataset processing and computing power. It provides scalability, affordability, security, ease of use, customization, and easy data integration. In addition, Snowflake provides a host of specialized services, like Snowflake Arctic, Snowflake for Big Data, Snowflake Data Sharing, and Snow Pipe, as required depending on the use case. They bring a powerful weapon to the table for all enterprises striving to cash in on strategic data utilization. In this paper, I will explore how data sharing works in Snowflake. Data sharing is the process of making data available to multiple users, applications, or organizations while maintaining its quality. Organizations often need to share data with customers, suppliers, and partners, but they face significant challenges such as poor governance, outdated solutions, manual data transfers, and being tied to specific vendors. To become truly data-driven, organizations need an improved method for sharing data. Snowflake offers a modern solution to these challenges, enabling seamless and secure data sharing. Data Sharing When using Snowflake as your data warehouse, you can share selected data objects with another Snowflake account holder or even with someone who doesn't have a Snowflake account (through a Reader Account). One major advantage of data sharing in Snowflake is that the data isn't copied or transferred between accounts. Instead, any updates made to the provider account are immediately available to the consumer. Provider The provider, also referred to as the data provider or producer, is the user of a Snowflake account responsible for creating a share and making it available to other Snowflake account users for consumption. As the creator of the share, the provider holds the authority to determine which data and resources are shared and accessible to other users within the Snowflake ecosystem. Consumer A data consumer refers to any account that opts to establish a database using a share provided by a data provider. As a data consumer, upon integrating a shared database into your account, you gain access to and can query the contents of the database in the same manner as any other database within your account. There are different methods for sharing data in Snowflake. You can either restrict access based on specific permissions, ensuring only authorized users can view certain objects, or you can make the data available for all intended users to read. This flexibility allows for secure and efficient data collaboration. Direct Share Direct Share is the simplest method for consumers to access data shared by a provider when the provider and consumer are in the same region. This approach requires the data provider to have access to the account IDs of the consumer accounts. Once set up, consumers can easily view and use the shared data objects. Consumers With Snowflake Account Consumers with a Snowflake account can be given access to data shared by a provider. The shared objects can only be accessed by these consumers. In this setup, the provider is charged for storage, while the consumer is charged for compute usage. When sharing and consuming data via Snowflake shares, it’s important to follow best practices. These include validating the data shares, auditing access to the shared data, and adding or removing objects from the shares as needed. This ensures secure and efficient data sharing. A list of commands to be executed in Snowflake to share an object from the provider to the consumer. SQL CREATE DATABASE ROLE MYSHARE; GRANT USAGE ON SCHEMA PUBLIC TO DATABASE ROLE MYSHARE; GRANT SELECT ON VIEW VW_CUSTOMER TO DATABASE ROLE MYSHARE; SHOW GRANTS TO DATABASE ROLE MYSHARE; CREATE OR REPLACE SHARE MY_TEST_SHARE; GRANT USAGE ON DATABASE SAMPLE_DB TO SHARE MY_TEST_SHARE; GRANT USAGE ON SCHEMA SAMPLE_DB.PUBLIC TO SHARE MY_TEST_SHARE; GRANT SELECT ON TABLE SAMPLE_DB.PUBLIC.CUSTOMER_TEST TO SHARE MY_TEST_SHARE; SHOW SHARES; SHOW GRANTS TO SHARE MY_TEST_SHARE; ALTER SHARE MY_TEST_SHARE ADD ACCOUNT= ACC12345; ------ TO SHARE ALL OBJECTS GRANT SELECT ON ALL TABLES IN SCHEMA SAMPLE_DB.PUBLIC TO SHARE MY_TEST_SHARE; GRANT SELECT ON ALL TABLES IN DATABASE SAMPLE_DB TO SHARE MY_TEST_SHARE; Consumers With No Snowflake Account (Reader Account) As a data provider, you might want to share data with a consumer who does not have a Snowflake account or is not ready to become a licensed Snowflake customer. Consumers without a Snowflake account cannot access shared data. To enable access, you can create a Reader Account and share it with the consumer. With Reader Accounts, sharing data is quick, simple, and affordable without having the user sign up for a Snowflake account. The provider account that generated each reader account is in charge of managing it. Both the processing and storage fees that the customer incurs will be covered by the provider. SQL CREATE MANAGED ACCOUNT READER_ACCT; ADMIN_NAME='READER_ACCT'; ADMIN_PASSWORD='**********'; TYPE='READER'; SHOW MANAGED ACCOUNTS; --Snowflake URL with Locator ( Share it with the Consumer) ALTER SHARE SHARE_NAME ADD ACCTID=LOCATOR IN URL IN ABOVE LINE ; SHARE_RESTRICATIONS=FALSE; ALTER SHARE QCUFBZG.AXYZ5751.POC_SNOWFLAKE_SECURE_SHARE ADD ACCOUNT=XYZ11993; SHARE_RESTRICTIONS=FALSE; When sharing data with a consumer, the consumer can see all the information, and the provider cannot hide any data. It's suggested that you share data using secure views. This way, only the attributes meant for the consumer are visible, and access to other data is restricted. Views should be labeled as secure if they're meant to ensure data privacy. Listing Listing is a more advanced way of securely sharing data and operates on the same producer and consumer model as Direct Share. However, it differs in that it's not restricted to the same region; data can be accessed by Snowflake accounts in different regions. Data can be shared with specific accounts or published in the Snowflake Marketplace. Listings come in two types: Private and Public. Private: Private listings are exclusively accessible to specific consumers. They allow you to utilize listing features to directly share data and other information with other Snowflake accounts in any Snowflake region. Public: Data products can be shared in Snowflake Marketplace publicly. By offering listings on the Snowflake Marketplace, you can promote your data product across the Snowflake Data Cloud. This allows you to share curated data offerings with multiple consumers at once, instead of managing sharing arrangements with each consumer individually. Conclusion Snowflake's data-sharing capabilities provide a modern solution for organizations looking to share data securely and efficiently. By leveraging the features mentioned, businesses can overcome traditional data-sharing challenges and unlock the full potential of their data. Determine the best approach in sharing your data with consumers or accessing the data from providers based on the use case. For more detailed instructions and best practices on implementing data sharing, refer to the official Snowflake documentation. More

You Can Shape Trend Reports: Participate in DZone Research Surveys + Enter the Prize Drawings!

By Caitlin Candelmo

Hello, DZone Community! We have several surveys in progress as part of our research for upcoming Trend Reports. We would love for you to join us by sharing your experiences and insights (anonymously if you choose) — readers just like you drive the content that we cover in our Trend Reports. check out the details for each research survey below Over the coming months, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our Trend Reports. Security Research Security is everywhere; you can’t live with it, and you certainly can’t live without it! We are living in an entirely unprecedented world — one where bad actors are growing more sophisticated and are taking full advantage of the rapid advancements in AI. We will be exploring the most pressing security challenges and emerging strategies in this year’s survey for our August Enterprise Security Trend Report. Our 10-12-minute Enterprise Security Survey explores: Building a security-first organization Security architecture and design Key security strategies and techniques Cloud and software supply chain security At the end of the survey, you're also able to enter the prize drawing for a chance to receive one of two $175 (USD) e-gift cards! Join the Security Research Data Engineering Research As a continuation of our annual data-related research, we're consolidating our database, data pipeline, and data and analytics scopes into a single 12-minute survey that will guide help the narratives of our July Database Systems Trend Report and data engineering report later in the year. Our 2024 Data Engineering Survey explores: Database types, languages, and use cases Distributed database design + architectures Data observability, security, and governance Data pipelines, real-time processing, and structured storage Vector data and databases + other AI-driven data capabilities Join the Data Engineering Research You'll also have the chance to enter the $500 raffle at the end of the survey — five random people will be drawn and will receive $100 each (USD)! Cloud and Kubernetes Research This year, we're combining our annual cloud native and Kubernetes research into one 10-minute survey that dives further into these topics as they relate to both one another and at the intersection of security, observability, AI, and more. DZone's research will be informing these Trend Reports: May – Cloud Native: Championing Cloud Development Across the SDLC September – Kubernetes in the Enterprise Our 2024 Cloud Native Survey covers: Microservices, container orchestration, and tools/solutions Kubernetes use cases, pain points, and security measures Cloud infrastructure, costs, tech debt, and security threats AI for release management + monitoring/observability Join the Cloud Native Research Don't forget to enter the $750 raffle at the end of the survey! Five random people will be selected to each receive $150 (USD). Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Publications team More

Data Analysis and Automation Using Python

By Sandip Gami

An Overview of Data Pipeline Architecture

By Sreenath Devineni

How To Use Builder Design Pattern and DataFaker Library for Test Data Generation in Automation Testing

By Faisal Khatri

Snowflake Data Time Travel

Snowflake is a leading cloud-based data storage and analytics service that provides various solutions for data warehouses, data engineering, AI/ML modeling, and other related services. It has multiple features and functionalities; one powerful data recovery feature is Time Travel. It allows users to access historical data from the past. It is beneficial when a user comes across any of the below scenarios: Retrieving the previous row or column value before the current DML operation Recovering the last state of data for backup or redundancy Updating or deleting records from the table by mistake Restoring the previous state of the table, schema, or database Snowflake's Continuous Data Protection Life Cycle allows time travel within a window of 1 to 90 days. For the Enterprise edition, up to 90 days of retention is allowed. Time Travel SQL Extensions Time Travel can be achieved using Offsets, Timestamps, and Statements keywords in addition to the AT or BEFORE clause. Offset If a user wants to retrieve past data or recover a table from the older state data using time parameters, then the user can use the query below, where offset is defined in seconds. SQL SELECT * FROM any_table AT(OFFSET => -60*5); -- For 5 Minutes CREATE TABLE recoverd_table CLONE any_table AT(OFFSET => -3600); -- For 1 Hour Timestamp Suppose a user wants to query data from the past or recover a schema for a specific timestamp. Then, the user can utilize the below query. SQL SELECT * FROM any_table AT(TIMESTAMP => 'Sun, 05 May 2024 16:20:00 -0700'::timestamp_tz); CREATE SCHEMA recovered_schema CLONE any_schema AT(TIMESTAMP => 'Wed, 01 May 2024 01:01:00 +0300'::timestamp_tz); Statement Users can also use any unique query ID to get the latest data until the statement. SQL SELECT * FROM any_table BEFORE(STATEMENT => '9f6e1bq8-006f-55d3-a757-beg5a45c1234'); CREATE DATABASE recovered_db CLONE any_db BEFORE(STATEMENT => '9f6e1bq8-006f-55d3-a757-beg5a45c1234'); The command below sets the data retention time and increases or decreases. SQL CREATE TABLE any_table(id NUMERIC, name VARCHAR, created_date DATE) DATA_RETENTION_TIME_IN_DAYS=90; ALTER TABLE any_table SET DATA_RETENTION_TIME_IN_DAYS=30; If data retention is not required, then we can also use SET DATA_RETENTION_TIME_IN_DAYS=0;. Objects that do not have an explicitly defined retention period can inherit the retention from the upper object level. For instance, tables that do not have a specified retention period will inherit the retention period from schema, and schema that does not have the retention period defined will inherit from the database level. The account level is the highest level of the hierarchy and should be set up with 0 days for data retention. Now consider a case where a table, schema, or database accidentally drops, causing all the data to be lost. During such cases, when any data object gets dropped, it's kept in Snowflake's back-end until the data retention period. For such cases, Snowflake has a similar great feature that will bring those objects back with below SQL. SQL UNDROP TABLE any_table; UNDROP SCHEMA any_schema; UNDROP DATABASE any_database; If a user creates a table with the same name as the dropped table, then Snowflake creates a new table, not restore the old one. When the user uses the above UNDROP command, Snowflake restores the old object. Also, the user needs permission or ownership to restore the object. After the Time Travel period, if the object isn't retrieved within the data retention period, it is transferred to Snowflake Fail-Safe, where users can't query. The only way to recover that is by using Snowflake's help, and it stores the data for a maximum of 7 days. Challenges Time travel, though useful, has a few challenges, as shown below. The Time Travel has a default one-day setup for transient and temporary tables in Snowflake. Any objects except tables, such as views, UDFs, and stored procedures, are not supported. If a table is recreated with the same name, referring to the older version of the same name requires renaming the current table as, by default, Time Travel will refer to the latest version. Conclusion The Time Travel feature is quick, easy, and powerful. It's always handy and gives users more comfort while operating production-sensitive data. The great thing is that users can run these queries themselves without having to involve admins. With a maximum retention of 90 days, users have more than enough time to query back in time or fix any incorrectly updated data. In my opinion, it is Snowflake's strongest feature. Reference Understanding & Using Time Travel

By BHUSHAN FADNIS

Exploring Binary Search Trees: Theory and Practical Implementation

Binary Search Trees (BSTs) are fundamental hierarchical data structures in computer science, renowned for their efficiency in organizing and managing data. Each node in a BST holds a key value, with left and right child nodes arranged according to a specific property: nodes in the left subtree have keys less than the parent node, while those in the right subtree have keys greater than the parent node. This property facilitates fast searching, insertion, and deletion operations, making BSTs invaluable in applications requiring sorted data. Traversal algorithms like in-order, pre-order, and post-order traversals further enhance their utility by enabling systematic node processing. BSTs find extensive use in databases, compilers, and various computer science algorithms due to their simplicity and effectiveness in data organization and manipulation. This article delves into the theory and practical implementation of BSTs, highlighting their significance in both academic and real-world applications. Understanding Binary Search Trees Binary Search Trees (BSTs) are hierarchical data structures commonly used in computer science to organize and manage data efficiently. Unlike linear structures like arrays or linked lists, which store data sequentially, BSTs arrange data in a hierarchical manner. Each node in a BST contains a key value and pointers to its left and right child nodes. The key property of a BST is that for any given node, all nodes in its left subtree have keys less than the node's key, and all nodes in its right subtree have keys greater than the node's key. This property enables quick searching, insertion, and deletion operations, as it allows the tree to be efficiently navigated based on the value of the keys. BSTs are particularly useful in applications where data needs to be stored in a sorted order. For example, in a phonebook application, BSTs can be used to store names alphabetically, allowing for fast lookups of phone numbers based on names. Similarly, in a file system, BSTs can be employed to store files in sorted order based on their names or sizes, facilitating efficient file retrieval operations. Traversal algorithms, such as in-order, pre-order, and post-order traversals, allow us to systematically visit each node in the BST. In an in-order traversal, nodes are visited in ascending order of their keys, making it useful for obtaining data in sorted order. Pre-order and post-order traversals visit nodes in a specific order relative to their parent nodes, which can be helpful for various operations such as creating a copy of the tree or evaluating mathematical expressions. Operations on Binary Search Trees Operations on Binary Search Trees (BSTs) involve fundamental actions such as insertion, deletion, and searching, each essential for managing and manipulating the data within the tree efficiently. Insertion When inserting a new node into a BST, the tree's hierarchical structure must be maintained to preserve the ordering property. Starting from the root node, the algorithm compares the new node's key with the keys of existing nodes, recursively traversing the tree until an appropriate position is found. Once the correct location is identified, the new node is inserted as a leaf node, ensuring that the BST's ordering property is maintained. Python class TreeNode: def __init__(self, key): self.key = key self.left = None self.right = None def insert(root, key): if root is None: return TreeNode(key) if key < root.key: root.left = insert(root.left, key) else: root.right = insert(root.right, key) return root In the insertion operation, we recursively traverse the BST starting from the root node. If the tree is empty (root is None), we create a new node with the given key. Otherwise, we compare the key with the current node's key and traverse left if the key is smaller or right if it's greater. We continue this process until we find an appropriate position to insert the new node. Deletion Removing a node from a BST requires careful consideration to preserve the tree's integrity. The deletion process varies depending on whether the node to be removed has zero, one, or two children. In cases where the node has no children or only one child, the deletion process involves adjusting pointers to remove the node from the tree. However, if the node has two children, a more intricate process is required to maintain the BST's ordering property. Typically, the node to be deleted is replaced by its successor (either the smallest node in its right subtree or the largest node in its left subtree), ensuring that the resulting tree remains a valid BST. Python def minValueNode(node): current = node while current.left is not None: current = current.left return current def deleteNode(root, key): if root is None: return root if key < root.key: root.left = deleteNode(root.left, key) elif key > root.key: root.right = deleteNode(root.right, key) else: if root.left is None: return root.right elif root.right is None: return root.left temp = minValueNode(root.right) root.key = temp.key root.right = deleteNode(root.right, temp.key) return root In the deletion operation, we recursively traverse the BST to find the node with the key to be deleted. Once found, we handle three cases: a node with no children, a node with one child, and a node with two children. For a node with no children or one child, we simply remove the node from the tree. For a node with two children, we find the in-order successor (smallest node in the right subtree), copy its key to the current node, and then recursively delete the in-order successor. Searching Searching for a specific key in a BST involves traversing the tree recursively based on the key values. Starting from the root node, the algorithm compares the target key with the keys of nodes encountered during traversal. If the target key matches the key of the current node, the search is successful. Otherwise, the algorithm continues searching in the appropriate subtree based on the comparison of key values until the target key is found or determined to be absent. Python def search(root, key): if root is None or root.key == key: return root if root.key < key: return search(root.right, key) return search(root.left, key) In the search operation, we recursively traverse the BST starting from the root node. If the current node is None or its key matches the target key, we return the current node. Otherwise, if the target key is greater than the current node's key, we search in the right subtree; otherwise, we search in the left subtree. Traversal Techniques Traversal techniques in Binary Search Trees (BSTs) are methods used to visit and process all nodes in the tree in a specific order. There are three main traversal techniques: in-order traversal, pre-order traversal, and post-order traversal. In-Order Traversal In in-order traversal, nodes are visited in ascending order of their keys. The process begins at the leftmost node (the node with the smallest key), then visits the parent node, and finally the right child node. In a BST, an in-order traversal will visit nodes in sorted order. Python def inorder_traversal(root): if root: inorder_traversal(root.left) print(root.key) inorder_traversal(root.right) Pre-Order Traversal In pre-order traversal, nodes are visited starting from the root node, followed by its left subtree, and then its right subtree. This traversal method is useful for creating a copy of the tree or prefix expressions. Python def preorder_traversal(root): if root: print(root.key) preorder_traversal(root.left) preorder_traversal(root.right) Post-Order Traversal In post-order traversal, nodes are visited starting from the left subtree, then the right subtree, and finally the root node. This traversal method is useful for deleting nodes from the tree or evaluating postfix expressions. Python def postorder_traversal(root): if root: postorder_traversal(root.left) postorder_traversal(root.right) print(root.key) Practical Applications of Binary Search Trees BSTs find applications in various real-world scenarios, including: Binary search in sorted arrays for efficient search operations. Symbol tables for fast retrieval of key-value pairs. Expression trees for evaluating mathematical expressions. Best Practices and Considerations Efficient implementation and usage of BSTs require attention to: Balancing the tree to ensure optimal performance, especially in scenarios with skewed data. Handling edge cases such as duplicate values or deleting nodes with two children. Avoiding common pitfalls like memory leaks or infinite loops in recursive functions. Conclusion Binary Search Trees (BSTs) are powerful data structures that play a fundamental role in computer science. Their hierarchical organization and key property make them efficient for storing, retrieving, and manipulating data in sorted order. By understanding the theory behind BSTs and their various operations, including insertion, deletion, searching, and traversal techniques, developers can leverage their capabilities to solve a wide range of computational problems effectively. Whether in database systems, compilers, or algorithm design, BSTs offer a versatile and elegant solution for managing data and optimizing performance. Embracing the versatility and efficiency of BSTs opens up a world of possibilities for innovation and problem-solving in the realm of computer science.

By Neha Dhaliwal

Constructing Secure and Strong Data Access Layers With JPA

Data Security Is an Evolving Challenge With JPA In the present era, where data drives everything, keeping sensitive information safe is more important than ever. As developers, we use JPA — Java Persistence API to work with relational databases, but we often overlook security aspects. Here, we will explore how to use JPA's features to create secure and reliable data access layers, protecting the integrity and confidentiality of your data. The Security Threat Horizon Data access layers are a major target for malicious actors. Common threats include: Exploiting vulnerabilities in queries where user input is directly incorporated into SQL statements, Modifying sensitive data stored in your database, and bypassing authentication and authorization mechanisms to access data. JPA Features for Secure Queries While JPA doesn’t automatically make your application secure, it provides several features that can help reduce the risk of attacks: By separating data from the actual SQL query, these statements eliminate the risk of SQL injection attacks, using placeholders for parameters within the query helps prevent SQL injection, this allows you to construct dynamic queries using criteria builders and predicates, keeping the query logic separate from the actual SQL string. Secure Coding Practices With JPA In addition to using JPA features, it's crucial to follow these secure coding practices: Always validate user input before using it in JPA queries, implement strong authentication and authorization mechanisms, and encrypt sensitive data both when it's stored (at rest) and when it's being transmitted. The above code demonstrates secure coding practices with JPA, such as using prepared statements and Criteria API Understanding the Code In the above example, the Java Persistence API securely retrieves user data. First, we obtain an instance of the entityManager, which serves as the central interface for interacting with JPA entities. This entity manager instance acts as a gateway for performing various JPA operations. Next, we create a CriteriaBuilder object, which enables us to construct criteria queries dynamically. This approach allows for flexible and efficient querying. Now let's focus on the crucial aspect of secure coding — defining the filtering condition for the query using a Predicate. In this case, we utilize the cb.equal method to create a predicate that checks if the "isActive" field of the user entity is equal to true. This ensures that only active users are retrieved, promoting data integrity and security. Finally, we build the query string using a placeholder for the username. This approach separates the data (username) from the actual SQL statement, preventing potential security vulnerabilities like SQL injection attacks. By using parameterized queries, we can safeguard our application's database interactions. Incorporate a Security-First Mindset Building secure data access layers is an ongoing process. Additional tips include: Staying updated with JPA libraries and frameworks, conducting regular security audits, and educating developers and stakeholders about data security best practices. Workflow Wonders JPA enables developers to build powerful and flexible data access layers. However, security is not a built-in guarantee. By using JPA's security features, observing secure coding practices, and maintaining a security-conscious approach, you can create robust data access layers that safeguard your valuable data and adopt trust in your applications.

By Humashankar Vellathur Jaganathan

Machine Learning With Python: Data Preprocessing Techniques

Machine learning continues to be one of the most rapidly advancing and in-demand fields of technology. Machine learning, a branch of artificial intelligence, enables computer systems to learn and adopt human-like qualities, ultimately leading to the development of artificially intelligent machines. Eight key human-like qualities that can be imparted to a computer using machine learning as part of the field of artificial intelligence are presented in the table below. Human Quality AI Discipline (using ML approach) Sight Computer Vision Speech Natural Language Processing (NLP) Locomotion Robotics Understanding Knowledge Representation and Reasoning Touch Haptics Emotional Intelligence Affective Computing (aka. Emotion AI) Creativity Generative Adversarial Networks (GANs) Decision-Making Reinforcement Learning However, the process of creating artificial intelligence requires large volumes of data. In machine learning, the more data that we have and train the model on, the better the model (AI agent) becomes at processing the given prompts or inputs and ultimately doing the task(s) for which it was trained. This data is not fed into the machine learning algorithms in its raw form. It (the data) must first undergo various inspections and phases of data cleansing and preparation before it is fed into the learning algorithms. We call this phase of the machine learning life cycle, the data preprocessing phase. As implied by the name, this phase consists of all the operations and procedures that will be applied to our dataset (rows/columns of values) to bring it into a cleaned state so that it will be accepted by the machine learning algorithm to start the training/learning process. This article will discuss and look at the most popular data preprocessing techniques used for machine learning. We will explore various methods to clean, transform, and scale our data. All exploration and practical examples will be done using Python code snippets to guide you with hands-on experience on how these techniques can be implemented effectively for your machine learning project. Why Preprocess Data? The literal holistic reason for preprocessing data is so that the data is accepted by the machine learning algorithm and thus, the training process can begin. However, if we look at the intrinsic inner workings of the machine learning framework itself, more reasons can be provided. The table below discusses the 5 key reasons (advantages) for preprocessing your data for the subsequent machine learning task. Reason Explanation Improved Data Quality Data Preprocessing ensures that your data is consistent, accurate, and reliable. Improved Model Performance Data Preprocessing allows your AI Model to capture trends and patterns on deeper and more accurate levels. Increased Accuracy Data Preprocessing allows the model evaluation metrics to be better and reflect a more accurate overview of the ML model. Decreased Training Time By feeding the algorithm data that has been cleaned, you are allowing the algorithm to run at its optimum level thereby reducing the computation time and removing unnecessary strain on computing resources. Feature Engineering By preprocessing your data, the machine learning practitioner can gauge the impact that certain features have on the model. This means that the ML practitioner can select the features that are most relevant for model construction. In its raw state, data can have a magnitude of errors and noise in it. Data preprocessing seeks to clean and free the data from these errors. Common challenges that are experienced with raw data include, but are not limited to, the following: Missing values: Null values or NaN (Not-a-Number) Noisy data: Outliers or incorrectly captured data points Inconsistent data: Different data formatting inside the same file Imbalanced data: Unequal class distributions (experienced in classification tasks) In the following sections of this article, we will proceed to work with hands-on examples of Data Preprocessing. Data Preprocessing Techniques in Python The frameworks that we will utilize to work with practical examples of data preprocessing: NumPy Pandas SciKit Learn Handling Missing Values The most popular techniques to handle missing values are removal and imputation. It is interesting to note that irrespective of what operation you are trying to perform if there is at least one null (NaN) inside your calculation or process, then the entire operation will fail and evaluate to a NaN (null/missing/error) value. Removal This is when we remove the rows or columns that contain the missing value(s). This is typically done when the proportion of missing data is relatively small compared to the entire dataset. Example Output Imputation This is when we replace the missing values in our data, with substituted values. The substituted value is commonly the mean, median, or mode of the data for that column. The term given to this process is imputation. Example Output Handling Noisy Data Our data is said to be noisy when we have outliers or irrelevant data points present. This noise can distort our model and therefore, our analysis. The common preprocessing techniques for handling noisy data include smoothing and binning. Smoothing This data preprocessing technique involves employing operations such as moving averages to reduce noise and identify trends. This allows for the essence of the data to be encapsulated. Example Output Binning This is a common process in statistics and follows the same underlying logic in machine learning data preprocessing. It involves grouping our data into bins to reduce the effect of minor observation errors. Example Output Data Transformation This data preprocessing technique plays a crucial role in helping to shape and guide algorithms that require numerical features as input, to optimum training. This is because data transformation deals with converting our raw data into a suitable format or range for our machine learning algorithm to work with. It is a crucial step for distance-based machine learning algorithms. The key data transformation techniques are normalization and standardization. As implied by the names of these operations, they are used to rescale the data within our features to a standard range or distribution. Normalization This data preprocessing technique will scale our data to a range of [0, 1] (inclusive of both numbers) or [-1, 1] (inclusive of both numbers). It is useful when our features have different ranges and we want to bring them to a common scale. Example Output Standardization Standardization will scale our data to have a mean of 0 and a standard deviation of 1. It is useful when the data contained within our features have different units of measurement or distribution. Example Output Encoding Categorical Data Our machine learning algorithms most often require the features matrix (input data) to be in the form of numbers, i.e., numerical/quantitative. However, our dataset may contain textual (categorical) data. Thus, all categorical (textual) data must be converted into a numerical format before feeding the data into the machine learning algorithm. The most commonly implemented techniques for handling categorical data include one-hot encoding (OHE) and label encoding. One-Hot Encoding This data preprocessing technique is employed to convert categorical values into binary vectors. This means that each unique category becomes its column inside the data frame, and the presence of the observation (row) containing that value or not, is represented by a binary 1 or 0 in the new column. Example Output Label Encoding This is when our categorical values are converted into integer labels. Essentially, each unique category is assigned a unique integer to represent hitherto. Example Output This tells us that the label encoding was done as follows: ‘Blue’ -> 0 ‘Green’ -> 1 ‘Red’ -> 2 P.S., the numerical assignment is Zero-Indexed (as with all collection types in Python) Feature Extraction and Selection As implied by the name of this data preprocessing technique, feature extraction/selection involves the machine learning practitioner selecting the most important features from the data, while feature extraction transforms the data into a reduced set of features. Feature Selection This data preprocessing technique helps us in identifying and selecting the features from our dataset that have the most significant impact on the model. Ultimately, selecting the best features will improve the performance of our model and reduce overfitting thereof. Correlation Matrix This is a matrix that helps us identify features that are highly correlated thereby allowing us to remove redundant features. “The correlation coefficients range from -1 to 1, where values closer to -1 or 1 indicate stronger correlation, while values closer to 0 indicate weaker or no correlation”. Example Output 1 Output 2 Chi-Square Statistic The Chi-Square Statistic is a test that measures the independence of two categorical variables. It is very useful when we are performing feature selection on categorical data. It calculates the p-value for our features which tells us how useful our features are for the task at hand. Example Output The output of the Chi-Square scores consists of two arrays: The first array contains the Chi-Square statistic values for each feature. The second array contains the p-values corresponding to each feature. In our example: For the first feature: The chi-square statistic value is 0.0 p-value is 1.0 For the second feature: The chi-square statistic value is 3.0 p-value is approximately 0.083 The Chi-Square statistic measures the association between the feature and the target variable. A higher Chi-Square value indicates a stronger association between the feature and the target. This tells us that the feature being analyzed is very useful in guiding the model to the desired target output. The p-value measures the probability of observing the Chi-Square statistic under the null hypothesis that the feature and the target are independent. Essentially, A low p-value (typically < 0.05) indicates that the association between the feature and the target is statistically significant. For our first feature, the Chi-Square value is 0.0, and the p-value is 1.0 thereby indicating no association with the target variable. For the second feature, the Chi-Square value is 3.0, and the corresponding p-value is approximately 0.083. This suggests that there might be some association between our second feature and the target variable. Keep in mind that we are working with dummy data and in the real world, the data will give you a lot more variation and points of analysis. Feature Extraction This is a data preprocessing technique that allows us to reduce the dimensionality of the data by transforming it into a new set of features. Logically speaking, model performance can be drastically increased by employing feature selection and extraction techniques. Principal Component Analysis (PCA) PCA is a data preprocessing dimensionality reduction technique that transforms our data into a set of right-angled (orthogonal) components thereby capturing the most variance present in our features. Example Output With this, we have successfully explored a variety of the most commonly used data preprocessing techniques that are used in Python machine learning tasks. Conclusion In this article, we explored popular data preprocessing techniques for machine learning with Python. We began by understanding the importance of data preprocessing and then looked at the common challenges associated with raw data. We then dove into various preprocessing techniques with hands-on examples in Python. Ultimately, data preprocessing is a step that cannot be skipped from your machine learning project lifecycle. Even if there are no changes or transformations to be made to your data, it is always worth the effort to apply these techniques to your data where applicable. because, in doing so, you will ensure that your data is cleaned and transformed for your machine learning algorithm and thus your subsequent machine learning model development factors such as model accuracy, computational complexity, and interpretability will see an improvement. In conclusion, data preprocessing lays the foundation for successful machine-learning projects. By paying attention to data quality and employing appropriate preprocessing techniques, we can unlock the full potential of our data and build models that deliver meaningful insights and actionable results. Code Python # -*- coding: utf-8 -*- """ @author: Karthik Rajashekaran """ # we import the necessary frameworks import pandas as pd import numpy as np # we create dummy data to work with data = {'A': [1, 2, None, 4], 'B': [5, None, None, 8], 'C': [10, 11, 12, 13]} # we create and print the dataframe for viewing df = pd.DataFrame(data) print("Original DataFrame:\n" + str(df), "\n") # TECHNIQUE: ROW REMOVAL > we remove rows with any missing values df_cleaned = df.dropna() print("Row(s) With Null Value(s) Deleted:\n" + str(df_cleaned), "\n") # TECHNIQUE: COLUMN REMOVAL -> we remove columns with any missing values df_cleaned_columns = df.dropna(axis=1) print("Column(s) With Null Value(s) Deleted:\n" + str(df_cleaned_columns), "\n") #%% # IMPUTATION # we create dummy data to work with data = {'A': [1, 2, None, 4], 'B': [5, None, None, 8], 'C': [10, 11, 12, 13]} # we create and print the dataframe for viewing df = pd.DataFrame(data) print("Original DataFrame:\n" + str(df), "\n") # we impute the missing values with mean df['A'] = df['A'].fillna(df['A'].mean()) df['B'] = df['B'].fillna(df['B'].median()) print("DataFrame After Imputation:\n" + str(df), "\n") #%% # SMOOTHING # we create dummy data to work with data = {'A': [1, 2, None, 4], 'B': [5, None, None, 8], 'C': [10, 11, 12, 13]} # we create and print the dataframe for viewing df = pd.DataFrame(data) print("Original DataFrame:\n" + str(df), "\n") # we calculate the moving average for smoothing df['A_smoothed'] = df['A'].rolling(window=2).mean() print("Smoothed Column A DataFrame:\n" + str(df), "\n") #%% # BINNING # we create dummy data to work with data = {'A': [1, 2, None, 4], 'B': [5, None, None, 8], 'C': [10, 11, 12, 13]} # we create and print the dataframe for viewing df = pd.DataFrame(data) print("Original DataFrame:\n" + str(df), "\n") # we bin the data into discrete intervals bins = [0, 5, 10, 15] labels = ['Low', 'Medium', 'High'] # we apply the binning on column 'C' df['Binned'] = pd.cut(df['C'], bins=bins, labels=labels) print("DataFrame Binned Column C:\n" + str(df), "\n") #%% # NORMALIZATION # we import the necessary frameworks from sklearn.preprocessing import MinMaxScaler import pandas as pd # we create dummy data to work with data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]} # we print the original dataframe for viewing df = pd.DataFrame(data) print("Original DataFrame:\n" + str(df), "\n") # we apply mix-max normalization to our data using sklearn scaler = MinMaxScaler() df_normalized = pd.DataFrame(scaler.fit_transform(df), columns=df.columns) print("Normalized DataFrame:\n" + str(df_normalized), "\n") #%% # STANDARDIZATION # we create dummy data to work with data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]} # we print the original dataframe for viewing df = pd.DataFrame(data) print("Original DataFrame:\n" + str(df), "\n") # we import 'StandardScaler' from sklearn from sklearn.preprocessing import StandardScaler # we apply standardization to our data scaler = StandardScaler() df_standardized = pd.DataFrame(scaler.fit_transform(df), columns=df.columns) print("Standardized DataFrame:\n" + str(df_standardized), "\n") #%% # ONE-HOT ENCODING # we import the necessary framework from sklearn.preprocessing import OneHotEncoder # we create dummy data to work with data = {'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']} # we print the original dataframe for viewing df = pd.DataFrame(data) print("Original DataFrame:\n" + str(df), "\n") # we apply one-hot encoding to our categorical features encoder = OneHotEncoder(sparse_output=False) encoded_data = encoder.fit_transform(df[['Color']]) encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['Color'])) print("OHE DataFrame:\n" + str(encoded_df), "\n") #%% # LABEL ENCODING # we import the necessary framework from sklearn.preprocessing import LabelEncoder # we create dummy data to work with data = {'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']} # we print the original dataframe for viewing df = pd.DataFrame(data) print("Original DataFrame:\n" + str(df), "\n") # we apply label encoding to our dataframe label_encoder = LabelEncoder() df['Color_encoded'] = label_encoder.fit_transform(df['Color']) print("Label Encoded DataFrame:\n" + str(df), "\n") #%% # CORRELATION MATRIX # we import the necessary frameworks import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # we create dummy data to work with data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50], 'C': [5, 4, 3, 2, 1]} # we print the original dataframe for viewing df = pd.DataFrame(data) print("Original DataFrame:\n" + str(df), "\n") # we compute the correlation matrix of our features correlation_matrix = df.corr() # we visualize the correlation matrix sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm') plt.show() #%% # CHI-SQUARE STATISTIC # we import the necessary frameworks from sklearn.feature_selection import chi2 from sklearn.preprocessing import LabelEncoder import pandas as pd # we create dummy data to work with data = {'Feature1': [1, 2, 3, 4, 5], 'Feature2': ['A', 'B', 'A', 'B', 'A'], 'Label': [0, 1, 0, 1, 0]} # we print the original dataframe for viewing df = pd.DataFrame(data) print("Original DataFrame:\n" + str(df), "\n") # we encode the categorical features in our dataframe label_encoder = LabelEncoder() df['Feature2_encoded'] = label_encoder.fit_transform(df['Feature2']) print("Encocded DataFrame:\n" + str(df), "\n") # we apply the chi-square statistic to our features X = df[['Feature1', 'Feature2_encoded']] y = df['Label'] chi_scores = chi2(X, y) print("Chi-Square Scores:", chi_scores) #%% # PRINCIPAL COMPONENT ANALYSIS # we import the necessary framework from sklearn.decomposition import PCA # we create dummy data to work with data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50], 'C': [5, 4, 3, 2, 1]} # we print the original dataframe for viewing df = pd.DataFrame(data) print("Original DataFrame:\n" + str(df), "\n") # we apply PCA to our features pca = PCA(n_components=2) df_pca = pd.DataFrame(pca.fit_transform(df), columns=['PC1', 'PC2']) # we print the dimensionality reduced features print("PCA Features:\n" + str(df_pca), "\n") References Datacamp, How to Learn Machine Learning in 2024, February 2024. [Online]. [Accessed: 30 May 2024]. Statista, Growth of worldwide machine learning (ML) market size from 2021 to 2030, 13 February 2024. [Online]. [Accessed: 30 May 2024]. Hurne M.v., What is affective computing/emotion AI? 03 May 2024. [Online]. [Accessed: 30 May 2024].

By Karthik Rajashekaran

What Is Reverse ETL? Overview, Use Cases, and Key Benefits

In the evolving landscape of data engineering, reverse ETL has emerged as a pivotal process for businesses aiming to leverage their data warehouses and other data platforms beyond traditional analytics. Reverse ETL, or “Extract, Transform, Load” in reverse, is the process of moving data from a centralized data warehouse or data lake to operational systems and applications within your data pipeline. This enables businesses to operationalize their analytics, making data actionable by feeding it back into the daily workflows and systems that need it most. How Does Reverse ETL Work? Reverse ETL can be visualized as a cycle that begins with data aggregated in a data warehouse. The data is then extracted, transformed (to fit the operational systems' requirements), and finally loaded into various business applications such as a CRM, marketing platforms, or other customer support tools. These concepts can be further explored in this resource on the key components of a data pipeline. Key Components of Reverse ETL To effectively implement reverse ETL, it's essential to understand its foundational elements. Each component plays a specific role in ensuring that the data flows smoothly from the data warehouse to operational systems, maintaining integrity and timeliness. Here's a closer look at the key components that make reverse ETL an indispensable part of modern data architecture. Connectors: Connectors are the bridges between the data warehouse and target applications. They are responsible for the secure and efficient transfer of data. Transformers: Transformers modify the data into the appropriate format or structure required by the target systems, ensuring compatibility and maintaining data integrity. Loaders: Loaders are responsible for inserting the transformed data into the target applications, completing the cycle of data utilization. Data quality: Data quality is paramount in reverse ETL as it ensures that the data being utilized in operational systems is accurate, consistent, and trustworthy. Without high-quality data, business decisions made based on this data could be flawed, leading to potential losses and inefficiencies. Scheduling: Scheduling is crucial for the timeliness of data in operational systems. It ensures that the reverse ETL process runs at optimal times to update the target systems with the latest data, which is essential for maintaining real-time or near-real-time data synchronization across the business. Evolution of Data Management and ETL The landscape of data management has undergone significant transformation over the years, evolving to meet the ever-growing demands for accessibility, speed, and intelligence in data handling. ETL processes have been at the core of this evolution, enabling businesses to consolidate and prepare data for strategic analysis and decision-making. Understanding Traditional ETL Traditional ETL (Extract, Transform, Load) is a foundational process in data warehousing that involves three key steps: Extract: Data is collected from various operational systems, such as transactional databases, CRM systems, and other business applications. Transform: The extracted data is cleansed, enriched, and reformatted to fit the schema and requirements of the data warehouse. This step may involve sorting, summarizing, deduplicating, and validating to ensure the data is consistent and ready for analysis. Load: The transformed data is then loaded into the data warehouse, where it is stored and made available for querying and analysis. Challenges With Traditional ETL Traditional ETL has been a staple in data processing and analytics for many years; however, it presents several challenges that can hinder an organization's ability to access and utilize data efficiently, specifically: Data Accessibility Efficient data access is crucial for timely decision-making, yet traditional ETL can create barriers that impede this flow, such as: Data silos: Traditional ETL processes often lead to data silos where information is locked away in the data warehouse, making it less accessible for operational use. Limited integration: Integration of new data sources and operational systems can be complex and time-consuming, leading to difficulties in accessing a holistic view of the data landscape. Data governance: While governance is necessary, it can also introduce access controls that, if overly restrictive, limit timely data accessibility for users and systems that need it. Latency The agility of data-driven operations hinges on the promptness of data delivery, but traditional ETL processes can introduce delays that affect the currency of data insights, exemplified by: Batch processing: ETL processes are typically batch-based, running during off-peak hours. This means that data can be outdated by the time it's available in the data warehouse for operational systems, reporting, and analysis. Heavy processing loads: Transformation processes can be resource-intensive, leading to delays especially when managing large volumes of data. Pipeline complexity: Complex data pipelines with numerous sources and transformation steps can increase the time it takes to process and load data. An Introduction to Reverse ETL Reverse ETL emerged as organizations began to recognize the need to not only make decisions based on their data but to operationalize these insights directly within their business applications. The traditional ETL process focused on aggregating data from operational systems into a central data warehouse for analysis. However, as the analytics matured, the insights derived from this data needed to be put into action; this birthed the differing methods for data transformation based on use case: ETL vs. ELT vs. Reverse ETL. The next evolutionary step was to find a way to move the data and insights from the data warehouse back into the operational systems — effectively turning these insights into direct business outcomes. Reverse ETL was the answer to this, creating a feedback loop from the data warehouse to operational systems. By transforming the data already aggregated, processed, and enriched within the data warehouse and then loading it back into operational tools (the "reverse" of ETL), organizations can enrich their operational systems with valuable, timely insights, thus complementing the traditional data analytics lifecycle. Benefits of Reverse ETL As part of the evolution of traditional ETL, reverse ETL presented two key advantages: Data accessibility: With Reverse ETL, data housed in a data warehouse can be transformed and seamlessly merged back into day-to-day business tools, breaking down silos and making data more accessible across the organization. Real-time data synchronization: By moving data closer to the point of action, operational systems get updated with the most relevant, actionable insights, often in near-real-time, enhancing decision-making processes. Common Challenges of Reverse ETL Despite the key benefits of reverse ETL, there are several common challenges to consider: Data consistency and quality: Ensuring the data remains consistent and high-quality as it moves back into varied operational systems requires rigorous checks and ongoing maintenance. Performance impact on operational systems: Introducing additional data loads to operational systems can impact their performance, which must be carefully managed to avoid disruption to business processes. Security and regulatory compliance: Moving data out of the data warehouse raises concerns about security and compliance, especially when dealing with sensitive or regulated data. Understanding these challenges and benefits helps organizations effectively integrate reverse ETL into their data-driven workflow, enriching operational systems with valuable insights and enabling more informed decisions across the entire business. Reverse ETL Use Cases and Applications Reverse ETL unlocks the potential of data warehouses by bringing analytical insights directly into the operational tools that businesses use every day. Here are some of the most impactful ways that reverse ETL is being applied across various business functions: Customer Relationship Management (CRM): Reverse ETL tools transform and sync demographic and behavioral data from the data warehouse into CRM systems, providing sales teams with enriched customer profiles for improved engagement strategies. Marketing automation: Utilize reverse ETL's transformation features to tailor customer segments based on data warehouse insights and sync them with marketing platforms, enabling targeted campaigns and in-depth performance reporting. Customer support: Transform and integrate product usage patterns and customer feedback from the data warehouse into support tools, equipping agents with actionable data to personalize customer interactions. Product development: Usage-driven development that leverages reverse ETL to transform and feed feature interaction data back into product management tools, guiding the development of features that align with user engagement and preferences. In each of these use cases, reverse ETL tools not only move data but also apply necessary transformations to ensure that the data fits the operational context of the target systems, enhancing the utility and applicability of the insights provided. Five Factors to Consider Before Implementing Reverse ETL When considering the implementation of reverse ETL at your organization, it's important to evaluate several factors that can impact the success and efficiency of the process. Here are some key considerations: 1. Data Volume Assess the volume of data that will be moved to ensure that the reverse ETL tool can handle the load without performance degradation. Determine the data throughput needs, considering peak times and whether the tool can process large batches of data efficiently. 2. Data Integration Complexity Consider the variety of data sources, target systems, and whether the reverse ETL tool supports all necessary connectors. Evaluate the complexity of the data transformations required and whether the tool provides the necessary functionality to implement these transformations easily. 3. Scalability Ensure that the reverse ETL solution can scale with your business needs, handling increased data loads and additional systems over time. 4. Application Deployment and Maintenance Verify that the tool is accessible through preferred web browsers like Chrome and Safari. Determine whether the tool can be cloud-hosted or self-hosted, and understand the hosting preferences of your enterprise customers (on-prem vs. cloud). Look for built-in integration with version control systems like GitHub for detecting and applying configuration changes. 5. Security When implementing reverse ETL, ensure robust security by confirming the tool's adherence to SLAs with uptime monitoring, a clear process for regular updates and patches, and compliance with data protection standards like GDPR. Additionally, verify the tool's capability for data tokenization, encryption standards for data-at-rest, and possession of key certifications like SOC 2 Type 2 and EU/US Privacy Shield. By summarizing these factors, organizations can ensure that the reverse ETL tool they select not only meets their data processing needs but also aligns with their technical infrastructure, security standards, and regulatory compliance requirements. Reverse ETL Best Practices To maximize the benefits of reverse ETL, it's essential to adhere to best practices that ensure the process is efficient, secure, and scalable. These practices lay the groundwork for a robust data infrastructure: Data governance: Establish clear data governance policies to maintain data quality and compliance throughout the reverse ETL process. Monitoring and alerting: Implement comprehensive monitoring and alerting to quickly identify and resolve issues with data pipelines. Scalability and performance: Design reverse ETL workflows with scalability in mind to accommodate future growth and ensure that they do not negatively impact the performance of source or target systems. Top Three Reverse ETL Tools Choosing the right reverse ETL tool is crucial for success. Here's a brief overview of three popular platforms: Hightouch: A platform that specializes in syncing data from data warehouses directly to business tools, offering a wide range of integrations and a user-friendly interface. Census: Known for its strong integration capabilities, Census allows businesses to operationalize their data warehouse content across their operational systems. Segment: Known for its customer data platform (CDP), Segment provides Reverse ETL features that allow businesses to use their customer data in marketing, sales, and customer service applications effectively. To help select the most suitable reverse ETL tool for your organization's needs, here's a comparison table that highlights key features and differences between example solutions: Reverse ETL Tool Comparison Feature Hightouch Census Segment Core Offering Reverse ETL Reverse ETL CDP + limited reverse ETL Connectors Extensive Broad Broad Custom Connector Yes Yes Yes Real-Time Sync Yes Yes Yes Transformation Layer Yes Yes Only available on customer data Security & Compliance Strong Strong Strong Pricing Model Rows-based Fields-based Tiered Bottom Line: Is Reverse ETL Right for Your Business? Reverse ETL can be a game-changer for businesses looking to leverage their data warehouse insights in operational systems and workflows. If your organization requires real-time data access, enhanced customer experiences, or more personalized marketing efforts, reverse ETL could be the right solution. However, it's essential to consider factors such as data volume, integration complexity, and security requirements to ensure that a reverse ETL tool aligns with your business objectives and technical requirements.

By Suhas Jangoan

Streamlining Data Integration

Integrating data from multiple sources like Salesforce and Oracle into Amazon Redshift is crucial for organizations looking to centralize their analytics. This article demonstrates how to connect to Salesforce and Oracle, extract data using SOQL and SQL queries, load it into Redshift staging tables, and perform transformations using Redshift stored procedures, all orchestrated through Python scripts. Prerequisites Salesforce: Access to Salesforce with the necessary API permissions. Oracle: Access to an Oracle database with the necessary query permissions. Amazon Redshift: An existing Redshift cluster. Python: Installed with the necessary libraries (simple_salesforce, cx_Oracle, boto3, psycopg2). Connecting to Salesforce and Extracting Data First, let's connect to Salesforce and extract data using SOQL. Python from simple_salesforce import Salesforce import pandas as pd # Salesforce credentials sf = Salesforce(username='your_username', password='your_password', security_token='your_security_token') # SOQL query to fetch data from Salesforce query = "SELECT Id, Name, AccountNumber FROM Account" response = sf.query_all(query) # Convert response to a DataFrame data_sf = pd.DataFrame(response['records']).drop(columns='attributes') print(data_sf.head()) Library Import: We import Salesforce from simple_salesforce for easy Salesforce API interaction and pandas for data manipulation. Salesforce Connection: We establish a connection to Salesforce using the provided credentials. The best way to implement is to pass username and password as parameters from config file or using environmental variables, don't hardcode passwords. SOQL Query: We execute a SOQL (Salesforce Object Query Language) query to retrieve specific fields (Id, Name, AccountNumber) from the Account object. Data Conversion: The response is converted into a pandas DataFrame for easier manipulation and analysis. Connecting to Oracle and Extracting Data Next, let's connect to Oracle and extract data using SQL. Python import cx_Oracle # Oracle credentials and connection details oracle_dsn = cx_Oracle.makedsn("your_oracle_host", "your_oracle_port", service_name="your_service_name") conn_oracle = cx_Oracle.connect(user="your_username", password="your_password", dsn=oracle_dsn) # SQL query to fetch data from Oracle sql_query = "SELECT ID, NAME, ACCOUNT_NUMBER FROM ACCOUNTS" data_oracle = pd.read_sql(sql_query, con=conn_oracle) print(data_oracle.head()) # Close Oracle connection conn_oracle.close() Library Import: We import cx_Oracle for Oracle database connections. Oracle Connection: We establish a connection to Oracle using the provided credentials. SQL Query: We execute a SQL query to retrieve specific fields (ID, NAME, ACCOUNT_NUMBER) from the ACCOUNTS table. Data Conversion: The result is converted into a pandas DataFrame for easier manipulation and analysis. Loading Data Into Redshift Staging Tables Now, we load the extracted data from Salesforce and Oracle into Redshift staging tables. Python import boto3 import psycopg2 from io import StringIO # Redshift credentials and connection details redshift_host = 'your_redshift_host' redshift_db = 'your_database' redshift_user = 'your_user' redshift_password = 'your_password' redshift_port = 5439 # Connect to Redshift conn_redshift = psycopg2.connect( host=redshift_host, dbname=redshift_db, user=redshift_user, password=redshift_password, port=redshift_port ) cur_redshift = conn_redshift.cursor() # Create staging tables (if they don't exist) create_sf_table_query = """ CREATE TABLE IF NOT EXISTS staging_account_sf ( Id VARCHAR(18), Name VARCHAR(255), AccountNumber VARCHAR(40) ); """ create_oracle_table_query = """ CREATE TABLE IF NOT EXISTS staging_account_oracle ( ID VARCHAR(18), NAME VARCHAR(255), ACCOUNT_NUMBER VARCHAR(40) ); """ cur_redshift.execute(create_sf_table_query) cur_redshift.execute(create_oracle_table_query) conn_redshift.commit() # Load Salesforce data into staging table csv_buffer_sf = StringIO() data_sf.to_csv(csv_buffer_sf, index=False, header=False) csv_buffer_sf.seek(0) cur_redshift.copy_from(csv_buffer_sf, 'staging_account_sf', sep=',') conn_redshift.commit() # Load Oracle data into staging table csv_buffer_oracle = StringIO() data_oracle.to_csv(csv_buffer_oracle, index=False, header=False) csv_buffer_oracle.seek(0) cur_redshift.copy_from(csv_buffer_oracle, 'staging_account_oracle', sep=',') conn_redshift.commit() Library Import: We import boto3 for AWS interactions, psycopg2 for PostgreSQL/Redshift connections, and StringIO for in-memory file operations. Redshift Connection: We establish a connection to Redshift using the provided credentials. Create Staging Tables: We create staging tables (staging_account_sf for Salesforce data and staging_account_oracle for Oracle data) if they don't already exist. Data Loading: The Data Frames are converted to CSV format and loaded into the respective staging tables using copy_from, which efficiently loads data into Redshift. Executing Stored Procedures for ELT Once the data is in the staging tables, we can call stored procedures in Redshift to transform the data and load it into the final tables. Python # Call stored procedure for transformation stored_procedure_query = "CALL transform_data_procedure();" cur_redshift.execute(stored_procedure_query) conn_redshift.commit() # Verify data in the final table verify_query = "SELECT * FROM final_account_table LIMIT 10;" cur_redshift.execute(verify_query) for row in cur_redshift.fetchall(): print(row) # Close the connection cur_redshift.close() conn_redshift.close() Stored Procedure Call: We call a stored procedure (transform_data_procedure) in Redshift that performs the necessary transformations and loads the data into the final tables. This encapsulates the ELT (Extract, Load, Transform) logic within the database, leveraging Redshift's processing power. Data Verification: We run a query to verify that the data has been correctly transformed and loaded into the final table (final_account_table). Close Connection: Finally, we close the cursor and the database connection to clean up resources. Conclusion This script demonstrates a complete workflow for extracting data from Salesforce and Oracle, loading it into Amazon Redshift, and performing ELT operations using stored procedures. This approach leverages the strengths of each component: Salesforce and Oracle for CRM and relational data, Python for orchestration, and Redshift for scalable data transformations. By centralizing data in Redshift, organizations can perform more comprehensive analyses and derive valuable insights from their Salesforce and Oracle data, enabling better decision-making and operational efficiency.

By Rajesh Remala

Comprehensive Guide to Data Analysis and Visualization With Pandas and Matplotlib

Data analysis and visualization are fundamental skills in the realm of data science. Python, a versatile programming language, offers robust libraries like Pandas and Matplotlib to facilitate these tasks. Pandas provides powerful data manipulation capabilities, while Matplotlib is excellent for creating a wide range of visualizations. This tutorial will walk you through the process of analyzing a dataset and creating insightful visualizations using these libraries. By the end of this tutorial, you will be equipped with the knowledge to handle data more effectively and present your findings visually. Data Preparation The first step in any data analysis project is to prepare the data. Data preparation involves collecting, cleaning, and organizing data into a structured format. We'll start by importing the necessary libraries and creating a sample dataset that includes information on various products, their sales, and profit figures. Python python import pandas as pd # Sample dataset data = { 'Product': ['A', 'B', 'C', 'D', 'E'], 'Sales': [150, 200, 300, 250, 100], 'Profit': [50, 70, 120, 100, 40] } # Create DataFrame df = pd.DataFrame(data) print(df) In this snippet, we initialize a DataFrame with product data. The `Product` column contains product names, while `Sales` and `Profit` columns hold numerical data. This structured format allows for easy manipulation and analysis. Data Analysis Once the data is prepared, we can proceed with the analysis. This involves calculating basic statistics and exploring relationships within the data. Analyzing data helps us understand underlying patterns and trends, which can inform decision-making. Descriptive Statistics The distribution of the dataset's form, dispersion, and central tendency are all summarized by descriptive statistics. Pandas can be used to get this done with ease. Python # Summary statistics summary = df.describe() print(summary) The `describe` method provides a summary that includes the mean, standard deviation, and quartiles for the numerical columns in the DataFrame. These statistics give us a quick overview of the dataset's characteristics. Sales and Profit Analysis To gain deeper insights, we can calculate the profit margin for each product. The profit margin is a measure of profitability and is calculated as the profit divided by sales, expressed as a percentage. Python # Calculate profit margin df['Prft_mrgn'] = (df['Profit'] / df['Sales']) * 100 print(df[['Product', 'Prft_mrgn']]) This calculation adds a new column, `Prft_mrgn`, to the DataFrame, allowing us to compare the profitability of different products. Understanding profit margins helps in evaluating which products are more financially viable. Data Visualization Visualizing data helps to convey insights more effectively. Matplotlib is a comprehensive library for creating various types of plots. Visualization is crucial for interpreting data and communicating findings to a broader audience. Bar Chart A bar chart is ideal for comparing the sales of different products. It provides a clear visual representation of how each product performs in terms of sales. Python import matplotlib.pyplot as pyplt # Bar chart for sales pyplt.figure(figsize=(10, 6)) pyplt.bar(df['Product'], df['Sales'], color='skyblue') pyplt.xlabel('Product') pyplt.ylabel('Sales') pyplt.title('Sales by Product') pyplt.show() This code generates a bar chart, with product names along the x-axis and sales figures along the y-axis. The color and size of the chart can be customized to enhance readability. Bar charts are effective for displaying categorical data. Pie Chart A pie chart is useful for showing the proportion of total sales contributed by each product. It visually demonstrates how each product's sales compare to the whole. Python # Pie chart for sales distribution pyplt.figure(figsize=(8, 8)) pyplt.pie(df['Sales'], labels=df['Product'], autopct='%1.1f%%', startangle=140) pyplt.title('Sales Distribution by Product') pyplt.show() The pie chart segments are labeled with product names and their corresponding sales percentages, providing a clear picture of each product's contribution to total sales. Pie charts are excellent for showing parts of a whole. Scatter Plot Scatter plots are effective for examining the relationship between two numerical variables. We use a scatter plot to show the relationship between sales and profit. Python # Scatter plot for sales vs. profit pyplt.figure(figsize=(10, 6)) pyplt.scatter(df['Sales'], df['Profit'], color='green') pyplt.xlabel('Sales') pyplt.ylabel('Profit') pyplt.title('Sales vs. Profit') pyplt.show() In this scatter plot, each point represents a product. The x-axis shows sales figures, while the y-axis represents profit. This plot helps identify trends or patterns, such as whether higher sales correlate with higher profit. Scatter plots are useful for detecting relationships between variables. Conclusion In this tutorial, I demonstrate how to perform basic data analysis and visualization using Pandas and Matplotlib. I started by preparing the data and then moved on to calculating descriptive statistics and profit margins. Finally, create various plots to visualize the data, including bar charts, pie charts, and scatter plots. Mastering these tools will enable you to analyze data effectively and communicate your findings through compelling visualizations. By leveraging the power of Pandas and Matplotlib, you can transform raw data into meaningful insights.

By Krishnamurty Raju Mudunuru

Is Data Lineage a Pain Killer or Vitamin?

TL;DR: I might be biased on this, but I’m also equipped with analytics on column-level lineage usage from a number of customers and users. Image courtesy of the Masthead Data team: Data Lineage Is Data Lineage a Pain Killer or Vitamin? First, it very much depends on the user organization’s current use cases and their level of maturity. In my humble opinion, data engineers love looking at data flows and have that visual understanding of dependencies, but do they really use data lineage at the end of the day? What is the usage frequency? What are the specific use cases? From what we observed, data lineage certainly drives interest. However, when it comes to actual usage, it is not the central feature. This could be because our implementation is limited to some data sources. However, having lineage limited to only some pipelines also seems less meaningful to me (i.e., lineage in dbt or Dataform), as ingestion and other processes are left in shades. A typical use case might involve someone in the organization searching for a specific pipeline or model about twice a week for a few minutes. Common Uses for Data Lineage The most common use cases for lineage we saw were: The company is migrating or rebuilding its data platform. The organization is onboarding new teammates, often for new data initiatives. These are the times when lineage becomes very handy. Basically, it’s when the company starts not just maintaining what is in their data warehouse or data lake, but actually building and modernizing the data ecosystem. Does this mean that having lineage is a must in this case? Absolutely not. But if you are interested in moving faster and smarter, then the answer is absolutely yes. Questions To Consider So, it very much depends on what the organization is currently doing. I am not trying to be assertive here, but rather intelligently honest by asking if you really need data lineage. You might want to start with questions like: What is it for? What level of coverage do you need? Does it need to visualize production sources, or is a data warehouse enough? Do you need a BI solution connected? If yes, to what extent? Then you speak to the universe and decide: buy or build. There’s a lot to consider here. My take is as follows: Will it be used by the data team only, or will business users also be involved? (Consider the level of UX/UI required.) How much are you ready to invest in it? (Calculate the cost of building it internally at the expense of your team’s hours and compare it to purchasing from a vendor.) Please, do not forget to double the hours your team initially promised to you. Hear me out; I'm speaking as a product manager here. Consider what you have already in your data platform: data lake, using third-party data sources, and the stack already in use by the data team. It sounds easy and fun until you start dealing with complex cases like cross-project dependencies, views of temporary tables, or, heaven forbid, sharded tables, etc., and the list goes on. What is your team’s strategic focus and their skill set? Is it a strategic investment for you, and do you have the capacity to maintain and evolve it? Because your data platform, whether you believe it or not, will evolve. Conclusion Ultimately, my personal belief is that data lineage as a standalone visualization is not effective. Our use case for data lineage is to help troubleshoot broken pipelines or model errors because when organizations have an active warehouse with hundreds of pipelines and thousands of tables, it is impossible to keep track of if everything is working as expected. When we are talking about data quality, those are SQL rules and something already anticipated and known, but pipelines and models are a different beast. It is a lot about connectivity, compatibility, and effectiveness of the data platforms. Pairing data pipeline/model error detection and data lineage is the area where we see a lot of response and value for users. Additionally, it helps our clients save money as it is also connected to cost insights. Having lineage alone does not solve the problem; it creates a new one. No one understands how the solution is being used because lineage alone does not move the needle. It rather helps to move it faster in combination with anomaly detection and pipeline error detection. While data lineage alone may be seen as just another shining tool, its true value emerges when paired with comprehensive monitoring mechanisms and a commitment from the organization and the data team to build up a robust and reliable data platform.

By Yuliia Tkachova

CAP Theorem for Distributed System

This article dives into the world of distributed systems and explores a fundamental principle called the CAP theorem. Distributed systems play a crucial role in many modern applications, and the CAP theorem helps us understand the trade-offs inherent in these systems. What Are Distributed Systems? Distributed systems distribute computations and data across multiple interconnected nodes within a network. This can involve offloading processing power or geographically dispersing data for faster delivery. Unlike centralized systems that store data in a single location, distributed systems spread data across multiple interconnected points. This distribution brings its own set of challenges, with the first hurdle being network failures. Network links can break, creating network partitions where the entire network gets divided into isolated groups. Each partition loses communication with the others. A system's ability to withstand such partitions is called partition tolerance. It's important to note that achieving 100% partition tolerance is practically impossible; all distributed systems are susceptible to network partitions at some point. These partitions, though temporary, disrupt communication between parts of the system. Let's consider a social media platform like Instagram or Facebook. If a partition occurs, a newly posted photo might not be visible to users in another partition until the network recovers. This leads us to another crucial property that the distributed system has. The “Consistency”. You already noticed that if two partitions occurred, then data seen by two partitions are different (or in other words inconsistent). Consistency is a measurement of whether the system data is correct over the network at a given time. Consistency plays a crucial role in financial applications. Unlike social media posts, how fast your data is consistent across the system is a measurement of how consistent your system is. If not this might cause serious problems such as a double spending problem. If you don’t know the double expending problem, then it is about a financial system that holds each person’s balance. Assume Alice has 100$ in her account and all the servers consistent with Alice’s account balance of 100$. Alice bought 90$ worth of watches online from server A. Server A completed the transaction and sent a notification to other servers to deduct Alice’s account 90$. But before the message propagates, a sudden partition occurs and server B does not get Alice’s transaction. If Alice calls server B and performs another transaction, the server still considers Alice to have 100$ and lets her spend it again. If Alice buys a bag for 50$ from server B, the transaction still passes. As you can see, in this kind of financial system, consistency is a big matter and it should have higher consistency. In contrast to social media platforms, it does not matter how fast you receive your friend’s update. Now we know financial transaction systems need higher consistency. But how are we supposed to achieve it? There can be many consistency levels but let’s analyze the following levels which are used mostly for distributed systems: Linearizability Sequential Consistency Causal Consistency Eventual Consistency Linearizability Linearizability is the highest consistency level. This consistency algorithm works by adding a read lock for all the nodes in the system if any of the node data needs to be updated. By adding a read lock for all the nodes, we can make sure that any of the nodes in the system do not read partially updated data. The lock gets removed once all the data is in a consistent state. If there is network partitioning, the system takes more time to come to a consistent state. What will happen if a client connects to a node and requests to read data while the node is locked? Well, the client has to wait until the node releases the lock. This leads us to the third important property, Availability. The availability is a property when the client requests something from a node, it responds. If the node is not ready to serve the request, the client gets no or failed response. With the Linearizability example, we are locking the node so that the client does not get a response. This means until data become consistent, nodes are not available. We can conclude that if consistency is higher, we cannot achieve higher availability. Sequential Consistency In contrast to Linearizability, the Sequential Consistency is a relaxed consistency model. Instead of locking all the nodes for all the data updates, sequential consistency locks nodes in chronological order. Think about a chat application. When two people chat, both people’s messages should be in proper chronological order. If not, it would become a mess to understand it. Let us understand it with an example. Alice needs to send a message to the chat group and she sends the message to the system. Assuming the system has no partitions, her message propagates to all the nodes. Now Bob also needs to send a message to the group but a network partition occurred and some nodes do not get updated with Bob’s message. Now if Alice sends another message in this partitioned system some nodes are still not updated with the previous Bobs message. If the node does not update Bob’s message and only adds Alice’s message, the chat becomes a mess. In this scenario, a sequentially consistent system puts a write lock on the node where only it can write the node if all previously published data is already added. Until the node gets updated with previous data, it has to wait until previously sent messages reach the node. In this consistency mode, we are only considering the sequential consistency of each node data. Here you can see nodes are available than in the linearizability model where the write lock gets applied until the order of the events is resolved. The client can still get a response from the given node in the partitioned system but up to the last data in the correct sequential order. Causal Consistency The Causal Consistency is a much more relaxed consistency model that doesn’t care about order as well. Causal Consistency only considers about relation between data. For example, think about a social media platform where you can post images and put comments. Alice posts a photo to the platform and at the same time, Bob also posts a photo to the platform. These two posts do not have any relation. The order of the post does not affect the third person Charlie looking at both of the photos. But, for each photo, there is a comment section. When Charlie sees comments on the photos those comments should be in chronological order to him to understand it. Comments in this system are Causally consistent. This means that order does matter for some cases, but not for all the scenarios. If there are unrelated entities(Such as posts) having more relaxed consistency while comments have a dependency on their related posts. Eventual Consistency Now we can understand the Eventual Consistency easily. Think about the same social media platform without a comment feature. Now Alice and Bob post photos and Charlie can see their post on his feed. It does not matter which order he received the data. This leads us to think about another important fact about availability. In the linearizability consistency level, we could not achieve higher availability due to locking. But in Eventual Consistency, we don’t need to have any locks. Therefore server nodes are available at all times. What CAP Theorem Is All About? Now we discovered all the pieces of the CAP theorem and it is time to complete the puzzle. We have discussed three properties that the distributed system has. The Partition Tolerance tells us the system can tolerate partitioning, The Consistency which maintains data consistent over the distributed system, and the Availability which makes sure the system always responds to client requests. CAP theorem states that we can only select two of these properties out of three. Let’s have a look at all three cases. Select Partition Tolerance and Consistency Over the Availability This is the scenario in which we allow the system to have a partition. But we need to keep the system consistent. The financial transaction scenario we have discussed belongs to this category. To maintain consistency, we have to lock the nodes until data becomes consistent over the system. Only then is it allowed to read from the system. Therefore the availability is limited. Select Partition Tolerance and Availability Over the Consistency This is the scenario in which we don’t need strict consistency in the system. Remind about the social media system discussed in the eventual consistency. We don’t need consistent data but rather have data in whatever order. Having a relaxed consistency level, we don’t need to lock nodes and nodes are always available. Availability and Consistency Over the Partition Tolerance This kind of system is more likely a centralized system rather than a distributed system. We cannot build a system without having network partitioning. If we ensure Availability and Consistency at the same time, then there cannot be any partitions. Which means it is a centralized system. Therefore, we don’t discuss both available and Consistent systems in distributed systems. Example Use Cases for CAP Theorem Now you know what the CAP theorem is. Let’s see some example tools used in all three above cases. Partition Tolerance and Consistent System Redis and MongoDB are popular database solutions for CP(Consistent and Partition-tolerant systems). These databases are built to work as distributed databases and let there be partitions. Even if there are partitions, it lets you have consistent data over all the Databases connected. Partition Tolerance and Available System This system does not much care about consistency on the system. Rather it cares about being responsive and faster operation. This includes databases such as Cassandra and CouchDB. These databases are built to operate in distributed environments and faster operation. Consistent and Available System This type of system is not meant to be distributed but rather a centralized system. Databases such as SQLite and H2 are built to run on a single node. These databases are always consistent and available since they don’t need to lock nodes and it is the only node. But you cannot have it on a distributed system as it cannot tolerate partitioning. Why is MYSQL not listed for any of those? Well MySQL is generally considered a CA system designed to provide good consistency and availability while sacrificing partition tolerance. The reason for this is that MYSQL runs in a master-slave mode where the master is the node that the client to access. Therefore it is not a true distributed system that we are talking. But with different configurations, still you can work it as a CP system. As you can see, the CAP theorem imposes limitations on distributed systems. As system designers and developers, we must choose which properties to prioritize: consistency, availability, or partition tolerance. The chosen consistency model will determine whether we prioritize high availability or sacrifice some availability to ensure consistency. Many tools exist to help you design your application based on these trade-offs. Choose wisely! The next time you encounter a distributed system, consider these three crucial properties.

By Dhanushka Madushan

CORE

Using Spring AI With AI/LLMs to Query Relational Databases

The AIDocumentLibraryChat project has been extended to support questions for searching relational databases. The user can input a question and then the embeddings search for relevant database tables and columns to answer the question. Then the AI/LLM gets the database schemas of the relevant tables and generates based on the found tables and columns a SQL query to answer the question with a result table. Dataset and Metadata The open-source dataset that is used has 6 tables with relations to each other. It contains data about museums and works of art. To get useful queries of the questions, the dataset has to be supplied with metadata and that metadata has to be turned in embeddings. To enable the AI/LLM to find the needed tables and columns, it needs to know their names and descriptions. For all datatables like the museum table, metadata is stored in the column_metadata and table_metadata tables. Their data can be found in the files: column_metadata.csv and table_metadata.csv. They contain a unique ID, the name, the description, etc. of the table or column. That description is used to create the embeddings the question embeddings are compared with. The quality of the description makes a big difference in the results because the embedding is more precise with a better description. Providing synonyms is one option to improve the quality. The Table Metadata contains the schema of the table to add only the relevant table schemas to the AI/LLM prompt. Embeddings To store the embeddings in Postgresql, the vector extension is used. The embeddings can be created with the OpenAI endpoint or with the ONNX library that is provided by Spring AI. Three types of embeddings are created: Tabledescription embeddings Columndescription embeddings Rowcolumn embeddings The Tabledescription embeddings have a vector based on the table description and the embedding has the tablename, the datatype = table, and the metadata id in the metadata. The Columndescription embeddings have a vector based on the column description and the embedding has the tablename, the dataname with the column name, the datatype = column, and the metadata id in the metadata. The Rowcolumn embeddings have a vector based on the content row column value. That is used for the style or subject of an artwork to be able to use the values in the question. The metadata has the datatype = row, the column name as dataname, the tablename, and the metadata id. Implement the Search The search has 3 steps: Retrieve the embeddings Create the prompt Execute query and return result Retrieve the Embeddings To read the embeddings from the Postgresql database with the vector extension, Spring AI uses the VectorStore class in the DocumentVSRepositoryBean: Java @Override public List<Document> retrieve(String query, DataType dataType) { return this.vectorStore.similaritySearch( SearchRequest.query(query).withFilterExpression( new Filter.Expression(ExpressionType.EQ, new Key(MetaData.DATATYPE), new Value(dataType.toString())))); } The VectorStore provides a similarity search for the query of the user. The query is turned in an embedding and with the FilterExpression for the datatype in the header values, the results are returned. The TableService class uses the repository in the retrieveEmbeddings method: Java private EmbeddingContainer retrieveEmbeddings(SearchDto searchDto) { var tableDocuments = this.documentVsRepository.retrieve( searchDto.getSearchString(), MetaData.DataType.TABLE, searchDto.getResultAmount()); var columnDocuments = this.documentVsRepository.retrieve( searchDto.getSearchString(), MetaData.DataType.COLUMN, searchDto.getResultAmount()); List<String> rowSearchStrs = new ArrayList<>(); if(searchDto.getSearchString().split("[ -.;,]").length > 5) { var tokens = List.of(searchDto.getSearchString() .split("[ -.;,]")); for(int i = 0;i<tokens.size();i = i+3) { rowSearchStrs.add(tokens.size() <= i + 3 ? "" : tokens.subList(i, tokens.size() >= i +6 ? i+6 : tokens.size()).stream().collect(Collectors.joining(" "))); } } var rowDocuments = rowSearchStrs.stream().filter(myStr -> !myStr.isBlank()) .flatMap(myStr -> this.documentVsRepository.retrieve(myStr, MetaData.DataType.ROW, searchDto.getResultAmount()).stream()) .toList(); return new EmbeddingContainer(tableDocuments, columnDocuments, rowDocuments); } First, documentVsRepository is used to retrieve the document with the embeddings for the tables/columns based on the search string of the user. Then, the search string is split into chunks of 6 words to search for the documents with the row embeddings. The row embeddings are just one word, and to get a low distance, the query string has to be short; otherwise, the distance grows due to all the other words in the query. Then the chunks are used to retrieve the row documents with the embeddings. Create the Prompt The prompt is created in the TableService class with the createPrompt method: Java private Prompt createPrompt(SearchDto searchDto, EmbeddingContainer documentContainer) { final Float minRowDistance = documentContainer.rowDocuments().stream() .map(myDoc -> (Float) myDoc.getMetadata().getOrDefault(MetaData.DISTANCE, 1.0f)).sorted().findFirst().orElse(1.0f); LOGGER.info("MinRowDistance: {}", minRowDistance); var sortedRowDocs = documentContainer.rowDocuments().stream() .sorted(this.compareDistance()).toList(); var tableColumnNames = this.createTableColumnNames(documentContainer); List<TableNameSchema> tableRecords = this.tableMetadataRepository .findByTableNameIn(tableColumnNames.tableNames()).stream() .map(tableMetaData -> new TableNameSchema(tableMetaData.getTableName(), tableMetaData.getTableDdl())).collect(Collectors.toList()); final AtomicReference<String> joinColumn = new AtomicReference<String>(""); final AtomicReference<String> joinTable = new AtomicReference<String>(""); final AtomicReference<String> columnValue = new AtomicReference<String>(""); sortedRowDocs.stream().filter(myDoc -> minRowDistance <= MAX_ROW_DISTANCE) .filter(myRowDoc -> tableRecords.stream().filter(myRecord -> myRecord.name().equals(myRowDoc.getMetadata() .get(MetaData.TABLE_NAME))).findFirst().isEmpty()) .findFirst().ifPresent(myRowDoc -> { joinTable.set(((String) myRowDoc.getMetadata() .get(MetaData.TABLE_NAME))); joinColumn.set(((String) myRowDoc.getMetadata() .get(MetaData.DATANAME))); tableColumnNames.columnNames().add(((String) myRowDoc.getMetadata() .get(MetaData.DATANAME))); columnValue.set(myRowDoc.getContent()); this.tableMetadataRepository.findByTableNameIn( List.of(((String) myRowDoc.getMetadata().get(MetaData.TABLE_NAME)))) .stream().map(myTableMetadata -> new TableNameSchema( myTableMetadata.getTableName(), myTableMetadata.getTableDdl())).findFirst() .ifPresent(myRecord -> tableRecords.add(myRecord)); }); var messages = createMessages(searchDto, minRowDistance, tableColumnNames, tableRecords, joinColumn, joinTable, columnValue); Prompt prompt = new Prompt(messages); return prompt; } First, the min distance of the rowDocuments is filtered out. Then a list row of documents sorted by distance is created. The method createTableColumnNames(...) creates the tableColumnNames record that contains a set of column names and a list of table names. The tableColumnNames record is created by first filtering for the 3 tables with the lowest distances. Then the columns of these tables with the lowest distances are filtered out. Then the tableRecords are created by mapping the table names to the schema DDL strings with the TableMetadataRepository. Then the sorted row documents are filtered for MAX_ROW_DISTANCE and the values joinColumn, joinTable, and columnValue are set. Then the TableMetadataRepository is used to create a TableNameSchema and add it to the tableRecords. Now the placeholders in systemPrompt and the optional columnMatch can be set: Java private final String systemPrompt = """ ... Include these columns in the query: {columns} \n Only use the following tables: {schemas};\n %s \n """; private final String columnMatch = """ Join this column: {joinColumn} of this table: {joinTable} where the column has this value: {columnValue}\n """; The method createMessages(...) gets the set of columns to replace the {columns} placeholder. It gets tableRecords to replace the {schemas} placeholder with the DDLs of the tables. If the row distance was beneath the threshold, the property columnMatch is added at the string placeholder %s. Then the placeholders {joinColumn}, {joinTable}, and {columnValue} are replaced. With the information about the required columns the schemas of the tables with the columns and the information of the optional join for row matches, the AI/LLM is able to create a sensible SQL query. Execute Query and Return Result The query is executed in the createQuery(...) method: Java public SqlRowSet searchTables(SearchDto searchDto) { EmbeddingContainer documentContainer = this.retrieveEmbeddings(searchDto); Prompt prompt = createPrompt(searchDto, documentContainer); String sqlQuery = createQuery(prompt); LOGGER.info("Sql query: {}", sqlQuery); SqlRowSet rowSet = this.jdbcTemplate.queryForRowSet(sqlQuery); return rowSet; } First, the methods to prepare the data and create the SQL query are called and then queryForRowSet(...) is used to execute the query on the database. The SqlRowSet is returned. The TableMapper class uses the map(...) method to turn the result into the TableSearchDto class: Java public TableSearchDto map(SqlRowSet rowSet, String question) { List<Map<String, String>> result = new ArrayList<>(); while (rowSet.next()) { final AtomicInteger atomicIndex = new AtomicInteger(1); Map<String, String> myRow = List.of(rowSet .getMetaData().getColumnNames()).stream() .map(myCol -> Map.entry( this.createPropertyName(myCol, rowSet, atomicIndex), Optional.ofNullable(rowSet.getObject( atomicIndex.get())) .map(myOb -> myOb.toString()).orElse(""))) .peek(x -> atomicIndex.set(atomicIndex.get() + 1)) .collect(Collectors.toMap(myEntry -> myEntry.getKey(), myEntry -> myEntry.getValue())); result.add(myRow); } return new TableSearchDto(question, result, 100); } First, the result list for the result maps is created. Then, rowSet is iterated for each row to create a map of the column names as keys and the column values as values. This enables returning a flexible amount of columns with their results. createPropertyName(...) adds the index integer to the map key to support duplicate key names. Summary Backend Spring AI supports creating prompts with a flexible amount of placeholders very well. Creating the embeddings and querying the vector table is also very well supported. Getting reasonable query results needs the metadata that has to be provided for the columns and tables. Creating good metadata is an effort that scales linearly with the amount of columns and tables. Implementing the embeddings for columns that need them is an additional effort. The result is that an AI/LLM like OpenAI or Ollama with the "sqlcoder:70b-alpha-q6_K" model can answer questions like: "Show the artwork name and the name of the museum that has the style Realism and the subject of Portraits." The AI/LLM can within boundaries answer natural language questions that have some fit with the metadata. The amount of embeddings needed is too big for a free OpenAI account and the "sqlcoder:70b-alpha-q6_K" is the smallest model with reasonable results. AI/LLM offers a new way to interact with relational databases. Before starting a project to provide a natural language interface for a database, the effort and the expected results have to be considered. The AI/LLM can help with questions of small to middle complexity and the user should have some knowledge about the database. Frontend The returned result of the backend is a list of maps with keys as column names and values column values. The amount of returned map entries is unknown, because of that the table to display the result has to support a flexible amount of columns. An example JSON result looks like this: JSON {"question":"...","resultList":[{"1_name":"Portrait of Margaret in Skating Costume","2_name":"Philadelphia Museum of Art"},{"1_name":"Portrait of Mary Adeline Williams","2_name":"Philadelphia Museum of Art"},{"1_name":"Portrait of a Little Girl","2_name":"Philadelphia Museum of Art"}],"resultAmount":100} The resultList property contains a JavaScript array of objects with property keys and values. To be able to display the column names and values in an Angular Material Table component, these properties are used: TypeScript protected columnData: Map<string, string>[] = []; protected columnNames = new Set<string>(); The method getColumnNames(...) of the table-search.component.ts is used to turn the JSON result in the properties: TypeScript private getColumnNames(tableSearch: TableSearch): Set<string> { const result = new Set<string>(); this.columnData = []; const myList = !tableSearch?.resultList ? [] : tableSearch.resultList; myList.forEach((value) => { const myMap = new Map<string, string>(); Object.entries(value).forEach((entry) => { result.add(entry[0]); myMap.set(entry[0], entry[1]); }); this.columnData.push(myMap); }); return result; } First, the result set is created and the columnData property is set to an empty array. Then, myList is created and iterated with forEach(...). For each of the objects in the resultList, a new Map is created. For each property of the object, a new entry is created with the property name as the key and the property value as the value. The entry is set on the columnData map and the property name is added to the result set. The completed map is pushed on the columnData array and the result is returned and set to the columnNames property. Then a set of column names is available in the columnNames set and a map with column name to column value is available in the columnData. The template table-search.component.html contains the material table: HTML @if(searchResult && searchResult.resultList?.length) { <table mat-table [dataSource]="columnData"> <ng-container *ngFor="let disCol of columnNames" matColumnDef="{{ disCol }"> <th mat-header-cell *matHeaderCellDef>{{ disCol }</th> <td mat-cell *matCellDef="let element">{{ element.get(disCol) }</td> </ng-container> <tr mat-header-row *matHeaderRowDef="columnNames"></tr> <tr mat-row *matRowDef="let row; columns: columnNames"></tr> </table> } First, the searchResult is checked for existence and objects in the resultList. Then, the table is created with the datasource of the columnData map. The table header row is set with <tr mat-header-row *matHeaderRowDef="columnNames"></tr> to contain the columnNames. The table rows and columns are defined with <tr mat-row *matRowDef="let row; columns: columnNames"></tr>. The cells are created by iterating the columnNames like this: <ng-container *ngFor="let disCol of columnNames" matColumnDef="{{ disCol }">. The header cells are created like this: <th mat-header-cell *matHeaderCellDef>{{ disCol }</th>. The table cells are created like this: <td mat-cell *matCellDef="let element">{{ element.get(disCol) }</td>. element is the map of the columnData array element and the map value is retrieved with element.get(disCol). Summary Frontend The new Angular syntax makes the templates more readable. The Angular Material table component is more flexible than expected and supports unknown numbers of columns very well. Conclusion To question a database with the help of an AI/LLM takes some effort for the metadata and a rough idea of the users what the database contains. AI/LLMs are not a natural fit for query creation because SQL queries require correctness. A pretty large model was needed to get the required query correctness, and GPU acceleration is required for productive use. A well-designed UI where the user can drag and drop the columns of the tables in the result table might be a good alternative for the requirements. Angular Material Components support drag and drop very well. Before starting such a project the customer should make an informed decision on what alternative fits the requirements best.

By Sven Loesekann

Data

DZone's Featured Data Resources

Top Data Experts

The Latest Data Topics