Low-Code Development: Leverage low and no code to streamline your workflow so that you can focus on higher priorities.
DZone Security Research: Tell us your top security strategies in 2024, influence our research, and enter for a chance to win $!
A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
The Beginner's Guide To Understanding Graph Databases
You Can Shape Trend Reports: Participate in DZone Research Surveys + Enter the Prize Drawings!
Microsoft recently announced the introduction of vector search functionality in Azure Cosmos DB for MongoDB vCore. This feature enhances the capabilities of Cosmos DB by allowing developers to perform complex similarity searches on high-dimensional data, which is particularly useful in RAG-based applications, recommendation systems, image and document retrieval, and more. I am also participating in the Cosmos DB hackathon to explore more about how we can use this inside retrieval augmented generation. In this article, we will explore the details of this new functionality, its use cases, and provide a sample implementation using Python. What Is a Vector Store? A vector store (or vector database) is designed to store and manage vector embeddings. These embeddings are mathematical representations of data in a high-dimensional space. Each dimension corresponds to a feature of the data, and tens of thousands of dimensions might be used to represent sophisticated data. For example, words, phrases, entire documents, images, audio, and other types of data can all be vectorized. In simpler terms, vector embeddings are a list of numbers that can represent inside a multi-dimensional space for any complex data. Example Pen: [0.6715,0.5562,0.3566,0.9787] Now we can represent a pen inside a multi-dimensional space and then use vector search algorithms to perform a similarity search to retrieve the closest matching elements. How Does a Vector Index Work? In a vector store, vector search algorithms are used to index and query embeddings. Vector indexing is a technique used in ML and data analysis to efficiently search and retrieve information from large datasets. Some well-known algorithms include: Flat Indexing Hierarchical Navigable Small World (HNSW) Inverted File (IVF) Indexes Locality Sensitive Hashing (LSH) Indexes Vector search allows you to find similar items based on their data characteristics rather than exact matches on a property field. It’s useful for applications such as: Searching for similar text Finding related images Making recommendations Detecting anomalies Integrated Vector Database in Azure Cosmos DB for MongoDB vCore The Integrated Vector Database in Azure Cosmos DB for MongoDB vCore enables you to efficiently store, index, and query high-dimensional vector data directly within your Cosmos DB instance. Both transactional data and also vector embeddings are stored inside Cosmos DB together. This eliminates the need to transfer data to separate vector stores and incur additional costs. It works in 2 steps: 1. Vector Index Creation To perform a vector similarity search over vector properties in your documents, you’ll first need to create a vector index. This index allows efficient querying based on vector characteristics. 2. Vector Search Once your data is inserted into your Azure Cosmos DB for MongoDB vCore database and collection, and your vector index is defined, you can perform a vector similarity search against a targeted query vector. What Is Vector Search? Vector search, also known as similarity search, or nearest neighbor search, is a technique used to find objects that are similar to a given query object in a high-dimensional space. Unlike traditional search methods that rely on exact matches, vector search leverages the concept of distance between points in a vector space to find similar items. This is particularly useful for unstructured data like images, audio, and text embeddings. Benefits of Vector Search in Cosmos DB Efficient similarity searches: Enables fast and efficient searches on high-dimensional vectors, making it ideal for recommendation engines, image search, and natural language processing tasks Scalability: Leverages the scalability of Cosmos DB to handle large datasets and high query volumes. Flexibility: Integrates seamlessly with existing MongoDB APIs, allowing developers to use familiar tools and libraries. Use Cases Recommendation systems: Providing personalized recommendations based on user behavior and preferences Image and video retrieval: Searching for images or videos that are visually similar to a given input Natural Language Processing: Finding documents or text snippets that are semantically similar to a query text Anomaly Detection: Identifying unusual patterns in high-dimensional data Setting Up Vector Search in Cosmos DB Prerequisites An Azure account with an active subscription Azure Cosmos DB for MongoDB vCore configured for your workload Detailed Step-By-Step Guide and Sample Code Written in Python Create a Cosmos DB account: Navigate to the Azure portal. Search for Azure Cosmos DB and select the MongoDB (vCore) option. Follow the prompts to create your Cosmos DB account. Configure your database: Create a database and a collection where you’ll store your vectors. Ensure that the collection is appropriately indexed to support vector operations. Specifically, you’ll need to create an index on the vector field. Insert vectors into the collection: Vectors can be stored as arrays of numbers in your MongoDB documents. Set up your project: Create a new Python project (e.g., using Visual Studio or Visual Studio Code). Import necessary MongoDB and Azure/OpenAI modules Connect to the database using Mongo client. Inserting data: The code below shows how to insert order data from a local JSON file and insert embeddings into contentVector field. Generate vector embeddings by using the open AI getEmbeddings() method. Here is the full code for your reference: JavaScript const { MongoClient } = require('mongodb'); const { OpenAIClient, AzureKeyCredential} = require("@azure/openai"); // Set up the MongoDB client const dbClient = new MongoClient(process.env.AZURE_COSMOSDB_CONNECTION_STRING); // Set up the Azure OpenAI client const aoaiClient = new OpenAIClient("https://" + process.env.AZURE_OPENAI_API_INSTANCE_NAME + ".openai.azure.com/", new AzureKeyCredential(process.env.AZURE_OPENAI_API_KEY)); async function main() { try { await dbClient.connect(); console.log('Connected to MongoDB'); const db = dbClient.db('order_db'); // Load order data from a local json file console.log('Loading order data') const orderRawData = "<local json file>"; const orderData = (await (await fetch(orderRawData)).json()) .map(order => cleanData(order)); await insertDataAndGenerateEmbeddings(db, orderData); } catch (error) { console.error('An error occurred:', error); } finally { await dbClient.close(); } } // Insert data into the database and generate embeddings async function insertDataAndGenerateEmbeddings(db, data) { const orderCollection= db.collection('orders'); await orderCollection.deleteMany({}); var result = await orderCollection.bulkWrite( data.map(async (order) => ({ insertOne: { document: { ...order, contentVector: await generateEmbeddings(JSON.stringify(order)) } } })) ); console.log(`${result.insertedCount} orders inserted`); } // Generate embeddings async function generateEmbeddings(text) { const embeddings = await aoaiClient.getEmbeddings(embeddingsDeploymentName, text); await new Promise(resolve => setTimeout(resolve, 500)); // Rest period to avoid rate limiting on Azure OpenAI return embeddings.data[0].embedding; } Note: Remember to replace placeholders (Cosmos DB connection string, Azure OpenAI key, and endpoint) with actual values. Managing Costs To manage costs effectively when using vector search in Cosmos DB: Optimize indexes: Ensure that only necessary fields are indexed. Monitor usage: Use Azure Monitor to track and analyze usage patterns. Auto-scale: Configure auto-scaling to handle peak loads efficiently without over-provisioning resources. Data partitioning: Partition your data appropriately to ensure efficient querying and storage. Conclusion The introduction of vector search functionality in Azure Cosmos DB for MongoDB vCore opens up new possibilities for building advanced AI and machine learning applications. By leveraging this feature, developers can implement efficient similarity searches, enabling a wide range of applications from recommendation systems to anomaly detection. With the provided Python code examples, you can get started with integrating vector search into your Cosmos DB-based applications. For more detailed documentation, visit the Azure Cosmos DB documentation.
Relational Databases are the bedrock of any FinTech application, especially for OLTP (Online transaction Processing). This foundational component in any application architecture usually poses challenges around scaling as the business expands rapidly. So, it is imperative that all database activities are monitored closely in the production environment and issues like long-running queries are tracked and resolved. This article will explore the FinTech case study, which has built a Lending Platform. The company uses the MySQL database hosted in AWS as part of the AWS RDS service. It has multiple microservices using different database schemas hosted on the database instance. The MVP product offering was launched a few years back, and since then, they have been incorporating new features into the FinTech platform. We will cover commonly identified database issues and what was done to resolve these issues. Common Mistakes This section discusses common mistakes identified, the steps to resolve the issues, and additional guidelines. Using Database Views Inside Stored Procedures After it started seeing database growth, the major issue identified in the initial days with the MySQL database was that usage of views inside stored procedures resulted in long-running queries and full table scans. The pattern we saw with the developers was that they built multiple reusable views and used those in the stored procedures. The example below is a stored procedure invoking a view uv_consultations: MySQL CREATE PROCEDURE `usp_get_consultation`( in_dso_ref_id varchar(45) ) BEGIN select * from uv_consultations where dso_ref_id=in_dso_ref_id; END Here is what the view looks like: MySQL CREATE OR REPLACE VIEW uv_consultations AS SELECT c.id AS consultation_id, c.dso_ref_id AS dso_ref_id, c.pat_id AS pat_id, p.first_name AS pat_first_name, p.last_name AS pat_last_name, p.dso_pat_id AS dso_pat_id, p.dob AS pat_dob, COUNT(cn.id) AS notifs FROM ((((consultation c JOIN patient p ON ((c.pat_id = p.id))) LEFT JOIN application app ON ((c.id = app.consultation_id))) LEFT JOIN responsible_party rp ON ((app.primary_rp_id = rp.id))) LEFT JOIN consultation_notif cn ON (((cn.consultation_id = c.id) AND (cn.read = 0) AND (cn.hide = 0)))) GROUP BY c.id So, if we look at the view, it has multiple joins and a group by. It's a complex view query. MySQL executes views in two ways: view merging and view materialization. View Merging In this approach, MySQL merges the view query with the outer query in which the view has been referenced. The biggest advantage is that it uses underlying table indexes of the actual base table, improving the query timing. However, there are limitations to view merging. The query within the view must be a simple select statement with no aggregations, joins, sub-queries, or DISTINCT clauses. View Materialization When MySQL cannot perform a merge view, it defaults to View Materialization. In this approach, MySQL stores the view results in a temporary internal table and then uses the query on the view to query the internal table. The drawback is that it does not use base table indexes, causing the queries on the view to run slower. Views with GROUP BY, aggregation, DISTINCT, and complex joins trigger view materialization. In the above example, the view has aggregation and multiple joins, so the stored procedure's execution results in view materialization, which causes it to execute slowly. To mitigate this, refactored the stored procedure to use the direct complex SQL query and avoid the intermittent layer of views. Leading Wildcard String Comparison If we use the LIKE operator and the leading wild card search, e.g., LIKE ‘%AMOL’, then the query will not use the index on that column. Below is the sample query: MySQL SELECT COUNT(*) INTO v_ext_partner_service_count FROM loan_ext_payload WHERE loan_acct_id = v_loan_acct_id AND ext_partner_service LIKE '%CREDIT_PULL' For LIKE ‘CREDIT_PULL%’, MySQL uses the index efficiently as the indexes are structured in a way that this type of search makes them inherently faster. On the contrary, for the leading wildcard LIKE ‘%CREDIT_PULL’, MySQL execution engine looks at each entry in the index if it ends with ‘CREDIT_PULL.’ The index is optimized for prefixing (CREDIT_PULL% ), not suffixing (%CREDIT_PULL); its performance benefits are wiped out with leading wildcard string comparisons. The recommendation is to avoid using prefixed wild card string comparison or search. If this is unavoidable, use a full-text search. Using Functions as Part of the WHERE Clause Using functions for filtering records using a WHERE condition can harm query performance. For instance, using FIND_IN_SET() method as in the below query: MySQL SELECT code, code_type, message, message_es FROM locale_messages WHERE code_type = v_code_type AND FIND_IN_SET(code, v_codes) > 0; FIND_IN_SET returns the position of a string if it is present (as a substring) within a list of strings. Using FIND_IN_SET in the query causes a full table scan as it does not use the index. MySQL has to scan each record to evaluate the function, which could be very expensive from a computing perspective for larger tables. Similarly, mathematical, string, or date functions would have a similar side effect. Below are the examples: MySQL SELECT loan_id, loan_acct_id FROM loan_application WHERE YEAR(create_date) = 2024; MySQL SELECT loan_id, loan_acct_id FROM loan_application WHERE YEAR(create_date) = 2024; MySQL SELECT loan_id, loan_acct_id FROM loan_application WHERE ABS(merchant_discount_fee) > 5; If we cast columns as part of the where clause, it again leads to a full table scan, impacting the query performance. Example as below: MySQL SELECT loan_id, loan_acct_id FROM loan_application WHERE CAST(approval_datetime AS DATE) = '2024-06-01'; As a resolution for FIND_IN_SET, it enabled full-text search. For all other function-based where conditions, function-based indexes were created. Performing Deletes as Part of Regular Transactions Frequent delete operations as part of regular transaction processing can degrade database performance. Deleting rows from a table leads to index rebalancing. MySQL has to update the indexes on the columns for that table as it has to remove the index entries. For rollback and recovery, transaction logs are created for all delete operations. This can lead to rapid growth of transaction logs. The rows deleted in random order fragment tables and indexes degrade performance due to scattered reads. To mitigate regular delete operations, rows can be soft-deleted, and then, an offline batch job could perform a hard delete during non-peak hours. Non-peak hours allow the system to use its resources effectively. Batch or bulk deletes are more performant than the deletes span across regular intervals. Executing Multiple Inserts From the Application Instead of Batching If we have to insert multiple child rows for the parent record from the application layer, we will execute a stored procedure for each insert instead of batching it. Each call on insert needs to establish a database connection from the application layer and execute individual stored procedures for each record, which could be inefficient. To avoid multiple round trips to the database server, the multi-insert statement was built as below on the application layer and then sent to the database-stored procedure to execute it. MySQL INSERT INTO verif_user_attrib_result (id, application_id, user_attrib_id, doc_type_id, verif_result, verif_notes, verif_date, doc_upload_count) values(148712,146235,1,NULL,1,NULL,NULL,0), (148712,146235,2,NULL,1,NULL,NULL,0), (148712,146235,3,NULL,1,NULL,NULL,0), (148712,146235,4,NULL,-1,NULL,NULL,0); MySQL CREATE PROCEDURE p_verif_create_user_attrib_results ( IN v_user_attrib_result_multi_insert TEXT, out v_status int) BEGIN SET v_status = 1; SET @qry_user_attrib_result = v_user_attrib_result_multi_insert; PREPARE stmt_user_attrib_result from @qry_user_attrib_result; EXECUTE stmt_user_attrib_result; DEALLOCATE PREPARE stmt_user_attrib_result; SET v_status = 0; END Recommendations Connection Labels In the microservices world, where multiple microservices connect to different or the same schemas of the database server, monitoring the connection requests from the different microservices often becomes difficult. If there is an issue with connection pools, connection draining, or database performance issues, we cannot identify which connection belongs to which service. This is where program_name connection attributes come in handy. Java jdbc:mysql://db.fintech.com:3306/icwdev?cacheCallableStmts=true &callableStmtCacheSize=1000&connectionAttributes=program_name:loan-application-api This attribute labels every connection to the respective program name. This connection identifier helps segregate database issues with a specific service. Apart from diagnosing issues, it helps in enhanced monitoring. We can build query performance metrics and error rates for specific microservices. Purging Large Tables For large tables, we purge the data based on purging criteria. Utility tables managing user access tokens are periodically purged. To manage purging effectively, we implement partitioning on the tables. Partitioning is time-based Index Columns Used in Join We ensure that all columns listed in join conditions are indexed. By including index columns in join conditions, the database engine filters the data efficiently using indexed data, thus avoiding the entire table scan. Keep Transactions Smaller In high-concurrency applications, we must ensure smaller transactions for optimal database performance. Smaller transactions reduce the duration of data locking and minimize resource contention, which helps reduce resource contention, avoid deadlocks, and improve the transaction success rate. Miscellaneous Common Asks We need to ensure that UPDATE statements include a WHERE clause. At a FinTech organization, some complex stored procedures missed the WHERE clause, leading to unintended behaviors in the test environment. Strict no to “SELECT *” in queries. Conclusion We have covered common pitfalls and resolutions for fine-tuning the application and underlying MySQL database performance through the FinTech application case study. Here are some important suggestions: avoid using complex views in stored procedures, stay away from leading wildcard searches, monitor queries with full table scans, and take appropriate corrective actions. Perform delete operations during offline hours through the batch job. Avoid multiple roundtrips to the server for multi-inserts; instead, use batching. Define a purging strategy. Have indexes defined on all join columns, and finally, try to keep the transaction scope smaller. With these recommendations incorporated, we can provide optimal database performance and faster end-user applications. Continuous monitoring and proactive issue resolution help maintain consistent application performance and availability.
Postgres continues to evolve the database landscape beyond traditional relational database use cases. Its rich ecosystem of extensions and derived solutions has made Postgres a formidable force, especially in areas such as time-series and geospatial, and most recently, gen(erative) AI workloads. Pgvector has become a foundational extension for gen AI apps that want to use Postgres as a vector database. In brief, pgvector adds a new data type, operators, and index types to work with vectorized data (embeddings) in Postgres. This allows you to use the database for similarity searches over embeddings. Pgvector started to take off in 2023, with rocketing GitHub stars: Pure vector databases, such as Pinecone, had to acknowledge the existence of pgvector and start publishing competitive materials. I consider this a good sign for Postgres. Why a good sign? Well, as my fellow Postgres community member Rob Treat put it, “First they ignore you. Then they laugh at you. Then they create benchmarketing. Then you win!” So, how is this related to the topic of distributed Postgres? The more often Postgres is used for gen AI workloads, the more frequently you’ll hear (from vendors behind other solutions) that gen AI apps built on Postgres will have: Scalability and performance issues Challenges with data privacy A hard time with high availability If you do encounter the listed issues, you shouldn’t immediately dump Postgres and migrate to a more scalable, highly available, secure vector database, at least not until you’ve tried running Postgres in a distributed configuration! Let’s discuss when and how you can use distributed Postgres for gen AI workloads. What Is Distributed Postgres? Postgres was designed for single-server deployments. This means that a single primary instance stores a consistent copy of all application data and handles both read and write requests. How do you make a single-server database distributed? You tap into the Postgres ecosystem! Within the Postgres ecosystem, people usually assume one of the following of distributed Postgres: Multiple standalone PostgreSQL instances with multi-master asynchronous replication and conflict resolution (like EDB Postgres Distributed) Sharded Postgres with a coordinator (like CitusData) Shared-nothing distributed Postgres (like YugabyteDB) Check out the following guide for more information on each deployment option. As for this article, let’s examine when and how distributed Postgres can be used for your gen AI workloads. Problem #1: Embeddings Use All Available Memory and Storage Space If you have ever used an embedding model that translates text, images, or other types of data into a vectorized representation, you might have noticed that the generated embeddings are quite large arrays of floating point numbers. For instance, you might use an OpenAI embedding model that translates a text value into a 1536-dimensional array of floating point numbers. Given that each item in the array is a 4-byte floating point number, the size of a single embedding is approximately 6KB — a substantial amount of data. Now, if you have 10 million records, you would need to allocate approximately 57 gigabytes of storage and memory just for those embeddings. Additionally, you need to consider the space taken by indexes (such as HNSW, IVFFlat, etc.), which many of you will create to expedite the vector similarity search. Overall, the greater the number of embeddings, the more memory and storage space Postgres will require to store and manage them efficiently. You can reduce storage and memory usage by switching to an embedding model that generates vectors with fewer dimensions, or by using quantization techniques. But, suppose I need those 1536-dimensional vectors and I don’t want to apply any quantization techniques. In that case, if the number of embeddings continues to increase, I could outgrow the memory and storage capacity of my database instance. This is an obvious area where you can tap into distributed Postgres. For example, by running sharded (CitusData) or shared-nothing (YugabyteDB) versions of PostgreSQL, you can allow the database to distribute your embeddings evenly across an entire cluster of nodes. Using this approach, you are no longer limited by the memory and storage capacities of a single node. If your application continues to generate more embeddings, you can always scale out the cluster by adding more nodes. Problem #2: Similarity Search Is a Compute-Intensive Operation This problem is closely related to the previous one but with a focus on CPU and GPU utilization. When we say “just perform the vector similarity search over the embeddings stored in our database,” the task sounds straightforward and obvious to us humans. However, from the database server's perspective, it's a compute-intensive operation requiring significant CPU cycles. For instance, here is the formula used to calculate the cosine similarity between two vectors. We typically use cosine similarity to find the most relevant data for a given user prompt. Imagine A as a vector or embedding of a newly provided user prompt, and B as a vector or embedding of your unique business data stored in Postgres. If you’re in healthcare, B could be a vectorized representation of a medication and treatment of a specific disease. To find the most relevant treatment (vector B) for the provided user symptoms (vector A), the database must calculate the dot product and magnitude for each combination of A and B. This process is repeated for every dimension (the 'i' in the formula) in the compared embeddings. If your database contains a million 1536-dimensional vectors (treatments and medications), Postgres must perform a million calculations over these multi-dimensional vectors for each user prompt. Approximate nearest neighbor search (ANN) allows us to reduce CPU and GPU usage by creating specialized indexes for vectorized data. However, with ANN, you sacrifice some accuracy; you might not always receive the most relevant treatments or medications since the database does not compare all the vectors. Additionally, these indexes come with a cost: they take time to build and maintain, and they require dedicated memory and storage. If you don't want to be bound by the CPU and GPU resources of a single database server, you can consider using a distributed version of Postgres. Whenever compute resources become a bottleneck, you can scale your database cluster out and up by adding new nodes. Once a new node joins the cluster, a distributed database like YugabyteDB will automatically rebalance the embeddings and immediately begin utilizing the new node's resources. Problem #3: Data Privacy Whenever I demonstrate what a combination of LLM and Postgres can achieve, developers are inspired. They immediately try to match and apply these AI capabilities to the apps they work on. However, there is always a follow-up question related to data privacy: 'How can I leverage an LLM and embedding model without compromising data privacy?' The answer is twofold. First, if you don’t trust a particular LLM or embedding model provider, you can choose to use private or open-source models. For instance, use Mistral, LLaMA, or other models from Hugging Face that you can install and run from your own data centers or cloud environments. Second, some applications need to comply with data residency requirements to ensure that all data used or generated by the private LLM and embedding model never leaves a specific location (data center, cloud region, or zone). In this case, you can run several standalone Postgres instances, each working with data from a specific location, and allow the application layer to orchestrate access across multiple database servers. Another option is to use the geo-partitioning capabilities of distributed Postgres deployments, which automate data distribution and access across multiple locations, simplifying application logic. Let's continue with the healthcare use case to see how geo-partitioning allows us to distribute information about medications and treatments across locations required by data regulators. Here I’ve used YugabyteDB as an example of a distributed Postgres deployment. Imagine three hospitals, one in San Francisco, and the others in Chicago and New York. We deploy a single distributed YugabyteDB cluster, with several nodes in regions (or private data centers) near each hospital's location. To comply with data privacy and regulatory requirements, we must ensure that the medical data from these hospitals never leaves their respective data centers. With geo-partitioning, we can achieve this as follows: Create Postgres tablespaces mapping them to cloud regions in the US West, Central, and East. There is at least one YugabyteDB node in every region. SQL CREATE TABLESPACE usa_east_ts WITH ( replica_placement = '{"num_replicas": 1, "placement_blocks": [{"cloud":"gcp","region":"us-east4","zone":"us-east4-a","min_num_replicas":1}]}' ); CREATE TABLESPACE usa_central_ts WITH ( replica_placement = '{"num_replicas": 1, "placement_blocks": [{"cloud":"gcp","region":"us-central1","zone":"us-central1-a","min_num_replicas":1}]}' ); CREATE TABLESPACE usa_west_ts WITH ( replica_placement = '{"num_replicas": 1, "placement_blocks": [{"cloud":"gcp","region":"us-west1","zone":"us-west1-a","min_num_replicas":1}]}' ); Create a treatment table that keeps information about treatments and medications. Each treatment has an associated multi-dimensional vector – description_vector – that is generated for the treatment’s description with an embedding model. Finally, the table is partitioned by the hospital_location column. SQL CREATE TABLE treatment ( id int, name text, description text, description_vector vector(1536), hospital_location text NOT NULL ) PARTITION BY LIST (hospital_location); The partitions’ definition is as follows. For instance, the data of the hospital3, which is in San Francisco, will be automatically mapped to the usa_west_ts whose data belongs to the database nodes in the US West. SQL CREATE TABLE treatments_hospital1 PARTITION OF treatment(id, name, description, description_vector, PRIMARY KEY (id, hospital_location)) FOR VALUES IN ('New York') TABLESPACE usa_east_ts; CREATE TABLE treatments_hospital2 PARTITION OF treatment(id, name, description, description_vector, PRIMARY KEY (id, hospital_location)) FOR VALUES IN ('Chicago') TABLESPACE usa_central_ts; CREATE TABLE treatments_hospital3 PARTITION OF treatment(id, name, description, description_vector, PRIMARY KEY (id, hospital_location)) FOR VALUES IN ('San Francisco') TABLESPACE usa_west_ts; Once you've deployed a geo-partitioned database cluster and defined the required tablespaces with partitions, let the application connect to it and allow the LLM to access the data. For instance, the LLM can query the treatment table directly using: SQL select name, description from treatment where 1 - (description_vector ⇔ $user_prompt_vector) > 0.8 and hospital_location = $location The distributed Postgres database will automatically route the request to the node that stores data for the specified hospital_location. The same applies to INSERTs and UPDATEs; changes to the treatment table will always be stored in the partition->tablespace->nodes belonging to that hospital's location. These changes will never be replicated to other locations. Problem #4: High Availability Although Postgres was designed to function in a single-server configuration, this doesn't mean it can't be run in a highly available setup. Depending on your desired recovery point objective (RPO) and recovery time objective (RTO), there are several options. So, how is distributed Postgres useful? With distributed PostgreSQL, your gen AI apps can remain operational even during zone, data center, or regional outages. For instance, with YugabyteDB, you simply deploy a multi-node distributed Postgres cluster and let the nodes handle fault tolerance and high availability. The nodes communicate directly. If one node fails, the others will detect the outage. Since the remaining nodes have redundant, consistent copies of data, they can immediately start processing application requests that were previously sent to the failed node. YugabyteDB provides RPO = 0 (no data loss) and RTO within the range of 3-15 seconds (depending on the database and TCP/IP configuration defaults). In this way, you can build gen AI apps and autonomous agents that never fail, even during region-level incidents and other catastrophic events. Summary Thanks to extensions like pgvector, PostgreSQL has evolved beyond traditional relational database use cases and is now a strong contender for generative AI applications. However, working with embeddings might pose some challenges, including significant memory and storage consumption, compute-intensive similarity searches, data privacy concerns, and the need for high availability. Distributed PostgreSQL deployments offer scalability, load balancing, and geo-partitioning, ensuring data residency compliance and uninterrupted operations. By leveraging these distributed systems, you can build scalable gen AI applications that scale and never fail.
By fetching data from the organization’s internal or proprietary sources, Retrieval Augmented Generation (RAG) extends the capabilities of FMs to specific domains, without needing to retrain the model. It is a cost-effective approach to improving model output so it remains relevant, accurate, and useful in various contexts. Knowledge Bases for Amazon Bedrock is a fully managed capability that helps you implement the entire RAG workflow from ingestion to retrieval and prompt augmentation without having to build custom integrations to data sources and manage data flows. With MongoDB Atlas vector store integration, you can build RAG solutions to securely connect your organization’s private data sources to FMs in Amazon Bedrock. Let's see how the MongoDB Atlas integration with Knowledge Bases can simplify the process of building RAG applications. Configure MongoDB Atlas MongoDB Atlas cluster creation on AWS process is well documented. Here are the high-level steps: This integration requires an Atlas cluster tier of at least M10. During cluster creation, choose an M10 dedicated cluster tier. Create a database and collection. For authentication, create a database user. Select Password as the Authentication Method. Grant the Read and write to any database role to the user. Modify the IP Access List – add IP address 0.0.0.0/0 to allow access from anywhere. For production deployments, AWS PrivateLink is the recommended way to have Amazon Bedrock establish a secure connection to your MongoDB Atlas cluster. Create the Vector Search Index in MongoDB Atlas Use the below definition to create a Vector Search index. { "fields": [ { "numDimensions": 1536, "path": "AMAZON_BEDROCK_CHUNK_VECTOR", "similarity": "cosine", "type": "vector" }, { "path": "AMAZON_BEDROCK_METADATA", "type": "filter" }, { "path": "AMAZON_BEDROCK_TEXT_CHUNK", "type": "filter" } ] } AMAZON_BEDROCK_TEXT_CHUNK – Contains the raw text for each data chunk. We are using cosine similarity and embeddings of size 1536 (we will choose the embedding model accordingly - in the the upcoming steps). AMAZON_BEDROCK_CHUNK_VECTOR – Contains the vector embedding for the data chunk. AMAZON_BEDROCK_METADATA – Contains additional data for source attribution and rich query capabilities. Configure the Knowledge Base in Amazon Bedrock Create an AWS Secrets Manager secret to securely store the MongoDB Atlas database user credentials. Create an Amazon Simple Storage Service (Amazon S3) storage bucket and upload any document(s) of your choice — Knowledge Base supports multiple file formats (including text, HTML, and CSV). Later, you will use the knowledge base to ask questions about the contents of these documents. Navigate to the Amazon Bedrock console and start configuring the knowledge base. In step 2, choose the S3 bucket you created earlier: Select Titan Embeddings G1 – Text embedding model MongoDB Atlas as the vector database. Enter the basic information for the MongoDB Atlas cluster along with the ARN of the AWS Secrets Manager secret you had created earlier. In the Metadata field mapping attributes, enter the vector store-specific details. They should match the vector search index definition you used earlier. Once the knowledge base is created, you need to synchronize the data source (S3 bucket data) with the MongoDB Atlas vector search index. Once that's done, you can check the MongoDB Atlas collection to verify the data. As per the index definition, the vector embeddings have been stored in AMAZON_BEDROCK_CHUNK_VECTOR along with the text chunk and metadata in AMAZON_BEDROCK_TEXT_CHUNK and AMAZON_BEDROCK_METADATA, respectively. Query the Knowledge Base You can now ask questions about your documents by querying the knowledge base — select Show source details to see the chunks cited for each footnote. You can also change the foundation model. For example, I switched to Claude 3 Sonnet. Use Retrieval APIs To Integrate Knowledge Base With Applications To build RAG applications on top of Knowledge Bases for Amazon Bedrock, you can use the RetrieveAndGenerate API which allows you to query the knowledge base and get a response. If you want to further customize your RAG solutions, consider using the Retrieve API, which returns the semantic search responses that you can use for the remaining part of the RAG workflow. More Configurations You can further customize your knowledge base queries using a different search type, additional filter, different prompt, etc. Conclusion Thanks to the MongoDB Atlas integration with Knowledge Bases for Amazon Bedrock, most of the heavy lifting is taken care of. Once the vector search index and knowledge base are configured, you can incorporate RAG into your applications. Behind the scenes, Amazon Bedrock will convert your input (prompt) into embeddings, query the knowledge base, augment the FM prompt with the search results as contextual information, and return the generated response. Happy building!
Mission-critical applications require high availability. The goal of high availability is to provide users with consistent access to services or resources, minimizing the chances of interruption. Automatic failover is a specific mechanism used to achieve high availability. It involves automatically detecting the failure of a system component (like a server, network, or database) and immediately switching operations to a standby component without human intervention. This increases resiliency. MariaDB MaxScale is a database proxy that includes features for high availability. In this article, I’ll show you how you can try it out with an online store simulator application implemented in Java and Svelte. Architecture The following diagram shows the architecture of the demo application: A web application developed with JavaScript and the Svelte framework makes HTTP requests to a Java backend. The backend answers with server-sent events that the frontend uses to update the user interface on the browser. The backend is implemented with Spring Boot and connects to a MariaDB database cluster using R2DBC (reactive). The backend logic is, in short, a simulation of reads and writes to an online store database. The simulation is parameterized, and the user can adjust: Product visits per minute: How many reads to the database per minute. Orders per minute: How many writes to the database per minute. Products per order: Write amplification. Timeout in milliseconds: How many seconds until a request to the database is considered failed. The database cluster is front-ended by a database proxy called MaxScale. This proxy makes the cluster look like a single logical database to the Java backend. MaxScale also performs read/write splitting (sending writes to the primary MariaDB server and reads to replicas), as well as load-balancing of reads among replica servers using a configurable algorithm. Data is automatically replicated from the primary to the replica database servers. Building the Docker Images From Source I have prepared custom Docker images for every component in the simulator. You can either build the images from the source (optional) or use the already built and published images from Docker Hub. If you decide to build the images yourself, you can find the source code on GitHub: MariaDB deployments: Custom images for easy deployment of replicated MariaDB topologies with MaxScale. DO NOT USE THESE IN PRODUCTION! These images are suitable only for demo applications. Use the official MariaDB Docker images for production deployments. Backend application: The backend application that connects to the database cluster. Frontend application: The frontend application that makes simulation configuration requests to the backend and receives events to show the simulation result. Each repository has Dockerfiles that you can use to build your own Docker images. For example, to build the backend application image, run: Shell docker build --tag alejandrodu/online-store-simulator-java-backend . Running the Simulation All the services can be started using the following Docker Compose file (docker-compose.yml): YAML version: "3.9" services: server-1: container_name: server-1 image: alejandrodu/mariadb ports: - "3306:3306" environment: - MARIADB_CREATE_DATABASE=demo - MARIADB_CREATE_USER=user:Password123! - MARIADB_CREATE_REPLICATION_USER=replication_user:ReplicationPassword123! - MARIADB_CREATE_MAXSCALE_USER=maxscale_user:MaxScalePassword123! server-2: container_name: server-2 image: alejandrodu/mariadb ports: - "3307:3306" environment: - MARIADB_REPLICATE_FROM=replication_user:ReplicationPassword123!@server-1:3306 server-3: container_name: server-3 image: alejandrodu/mariadb ports: - "3308:3306" environment: - MARIADB_REPLICATE_FROM=replication_user:ReplicationPassword123!@server-1:3306 maxscale: container_name: maxscale image: alejandrodu/mariadb-maxscale command: --admin_host 0.0.0.0 --admin_secure_gui false ports: - "4000:4000" - "8989:8989" - "27017:27017" environment: - MAXSCALE_USER=maxscale_user:MaxScalePassword123! - MARIADB_HOST_1=server-1 3306 - MARIADB_HOST_2=server-2 3306 - MARIADB_HOST_3=server-3 3306 healthcheck: test: ["CMD", "maxctrl", "list", "servers"] interval: 5s timeout: 10s retries: 5 java-backend: container_name: java-backend image: alejandrodu/online-store-simulator-java-backend ports: - "8080:8080" environment: - spring.r2dbc.url=r2dbc:mariadb://maxscale:4000/demo - spring.r2dbc.username=user - spring.r2dbc.password=Password123! - spring.liquibase.url=jdbc:mariadb://maxscale:4000/demo - spring.liquibase.user=user - spring.liquibase.password=Password123! depends_on: maxscale: condition: service_healthy svelte-frontend: container_name: svelte-fronted image: alejandrodu/online-store-simulator-svelte-frontend ports: - "5173:80" environment: - BACKEND_URL=http://java-backend:8080 Move to the directory in which the Docker Compose file is, and start the services in detached mode as follows: Shell docker compose up -d Configuring MaxScale Before you start the simulation, configure MaxScale for transaction replay. Also, adjust timeouts to make the simulation more interesting. Navigate to http://localhost:8989/ and log into the UI using: Username:admin Password:mariadb You’ll see a dashboard with the MariaDB cluster state. There’s a primary server (server-1), and two replicas (server-2 and server-3). Replication is already configured from server-1 (primary) to server-2 and server-3 (replicas). All servers should be up and running. Click on mdb_monitor and then on the pencil icon to enable parameter editing. Set the following parameters: auto_failover (true): This enables automatic failover. When a MariaDB server is down, MaxScale selects a replica server and reconfigures it as the new primary so that writes can continue to happen. auto_rejoin (true): This enables automatic rejoin of recovered servers. When a failed server is up again, MaxScale detects it and configures it as an available replica server. failcount (1): Sets the number of monitor (a component in MaxScale that checks server status) iterations required for a server to be down in order to activate the failover process. We set a value of 1 to make sure the failover starts immediately after failure. backend_connect_timeout (1000): Connection timeout for monitor connections. We set a low value (one second) to quickly activate failover for this demo. backend_read_timeout (1000): Read timeout for monitor connections. backend_write_timeout (1000): Write timeout for monitor connections. master_failure_timeout (1000): Primary failure timeout. monitor_interval (1000): How often the servers are monitored. WARNING: These values are appropriate for this demo but very likely not the best for production environments! Once the parameters are set, click on Done Editing and Confirm. You also need to enable transaction replay which automatically re-execute failed in-flight transactions on servers that went down just after a SQL statement was routed. This is a useful feature for software developers since it prevents the need for coding failure cases and transaction retry. On the main menu, click on Dashboard and then on any of the query_router_service links in the list of servers. Edit the parameters as follows: transaction_replay (true): Activates automatic retry of failed transactions. transaction_replay_retry_on_deadlock (true): Same as the previous when a deadlock occurs. transaction_replay_retry_on_mismatch (true): Same as the previous when a checksum mismatch occurs. Once the parameters are set, click on Done Editing and Confirm. Starting the Simulation With everything configured, you can start the simulation. Navigate to http://localhost:5173/ and configure the following parameters (names are, I hope, self-explanatory): Product visits per minute:6000 Orders per minute:60 Timeout in milliseconds:8000 But before you start the simulation, you need to create the products for the online store. Click on Data | Create products…. Leave the default values and click on Create. You should see the UI updating as products are created in the database. Now you can finally click on Start and see the simulation in action. Simulating a Server Failure At this point, the primary server is handling writes (orders). What happens if you stop that server? In the command line run: Shell docker stop server-1 Depending on multiple factors you might get some “disappointed visitors” or even a few “missed opportunities” in the simulator. Or maybe you don’t get any at all! Product visits (reads) and orders (writes) continue to happen thanks to MaxScale. Without automatic failover, you have to reconfigure everything manually which ends up more offline time and in many disappointed visitors and missed opportunities! Start the failed server: Shell docker start server-1 Go to the MaxScale Dashboard (http://localhost:8989/) and check that server-1 is now a functioning replica. You can perform a manual switchover to make server-1 the primary server again. Click on mdb_monitor and then mouse hover over the MASTER section. Click on the pencil icon and select server-1. Click Swap and check again in the Dashboard that the new primary server is server-1. Conclusion Automatic failover is only one of the components in highly available systems. You can use a database proxy like MaxScale to set up automatic failover, but also other components such as load-balancing, query routing, transaction retry, topology isolation, and more. Check out the documentation here.
Flyway is a popular open-source tool for managing database migrations. It makes it easy to manage and version control the database schema for your application. Flyway supports almost all popular databases including Oracle, SQL Server, DB2, MySQL, Amazon RDS, Aurora MySQL, MariaDB, PostgreSQL, and more. For the full list of supported databases, you can check the official documentation here. How Flyway Migrations Works Any changes to the database are called migrations. Flyway supports two types of migrations; versioned or repeatable migrations. Versioned migrations are the most common type of migration, they are applied once in the order they appear. Versioned migrations are used for creating, altering, and dropping tables, indexes or foreign keys. Versioned migration files use naming conventions using [Prefix][Separator][Migration Description][Suffix] for example, V1__add_user_table.sql and V2__alter_user_table.sql Repeatable migrations, on the other hand, are (re-)applied every time they change. Repeatable migrations are useful for managing views, stored procedures, or bulk reference data updates where the latest version should replace the previous one without considering versioning. Repeatable migrations are always applied last after all pending versioned migrations are been executed. Repeatable migration files use naming conventions such as R__add_new_table.sql The migration schemas can be written in either SQL or Java. When we start the application to an empty database, Flyway will first create a schema history table (flyway_schema_history) table. This table IS used to track the state of the database. After the flyway_schema_history the table is created, it will scan the classpath for the migration files. The migrations are then sorted based on their version number and applied in order. As each migration gets applied, the schema history table is updated accordingly. Integrating Flyway in Spring Boot In this tutorial, we will create a Spring Boot application to deal with MySQL8 database migration using Flyway. This example uses Java 17, Spring Boot 3.2.4, and MySQL 8.0.26. For the database operation, we will use Spring boot JPA. Install Flyway Dependencies First, add the following dependencies to your pom.xml or your build.gradle file. The spring-boot-starter-data-jpa dependency is used for using Spring Data Java Persistence API (JPA) with Hibernate. The mysql-connector-j is the official JDBC driver for MySQL databases. It allows your Java application to connect to a MySQL database for operations such as creating, reading, updating, and deleting records. The flyway-core dependency is essential for integrating Flyway into your project, enabling migrations and version control for your database schema. The flyway-mysql dependency adds the Flyway support for MySQL databases. It provides MySQL-specific functionality and optimizations for Flyway operations. It's necessary when your application uses Flyway for managing database migrations on a MySQL database. pom.xml XML <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-data-jpa</artifactId> </dependency> <dependency> <groupId>com.mysql</groupId> <artifactId>mysql-connector-j</artifactId> <scope>runtime</scope> </dependency> <dependency> <groupId>org.flywaydb</groupId> <artifactId>flyway-core</artifactId> </dependency> <dependency> <groupId>org.flywaydb</groupId> <artifactId>flyway-mysql</artifactId> </dependency> <!-- Other dependencies--> </dependencies> Configure the Database Connection Now let us provide the database connection properties in your application.properties file. Properties files # DB properties spring.datasource.url=jdbc:mysql://localhost:3306/flyway_demo spring.datasource.username=root spring.datasource.password=Passw0rd spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver #JPA spring.jpa.show-sql=true Create Database Changelog Files Let us now create a couple of database migration schema files inside the resources/db/migrations directory. V1__add_movies_table SQL CREATE TABLE movie ( id bigint NOT NULL AUTO_INCREMENT, title varchar(255) DEFAULT NULL, headline varchar(255) DEFAULT NULL, language varchar(255) DEFAULT NULL, region varchar(255) DEFAULT NULL, thumbnail varchar(255) DEFAULT NULL, rating enum('G','PG','PG13','R','NC17') DEFAULT NULL, PRIMARY KEY (id) ) ENGINE=InnoDB; V2__add_actor_table.sql SQL CREATE TABLE actor ( id bigint NOT NULL AUTO_INCREMENT, first_name varchar(255) DEFAULT NULL, last_name varchar(255) DEFAULT NULL, PRIMARY KEY (id) ) ENGINE=InnoDB; V3__add_movie_actor_relations.sql SQL CREATE TABLE movie_actors ( actors_id bigint NOT NULL, movie_id bigint NOT NULL, PRIMARY KEY (actors_id, movie_id), KEY fk_ref_movie (movie_id), CONSTRAINT fk_ref_movie FOREIGN KEY (movie_id) REFERENCES movie (id), CONSTRAINT fl_ref_actor FOREIGN KEY (actors_id) REFERENCES actor (id) ) ENGINE=InnoDB; R__create_or_replace_movie_view.sql SQL CREATE OR REPLACE VIEW movie_view AS SELECT id, title FROM movie; V4__insert_test_data.sql SQL INSERT INTO movie (title, headline, language, region, thumbnail, rating) VALUES ('Inception', 'A thief who steals corporate secrets through the use of dream-sharing technology.', 'English', 'USA', 'inception.jpg', 'PG13'), ('The Godfather', 'The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son.', 'English', 'USA', 'godfather.jpg', 'R'), ('Parasite', 'A poor family, the Kims, con their way into becoming the servants of a rich family, the Parks. But their easy life gets complicated when their deception is threatened with exposure.', 'Korean', 'South Korea', 'parasite.jpg', 'R'), ('Amélie', 'Amélie is an innocent and naive girl in Paris with her own sense of justice. She decides to help those around her and, along the way, discovers love.', 'French', 'France', 'amelie.jpg', 'R'); -- Inserting data into the 'actor' table INSERT INTO actor (first_name, last_name) VALUES ('Leonardo', 'DiCaprio'), ('Al', 'Pacino'), ('Song', 'Kang-ho'), ('Audrey', 'Tautou'); -- Leonardo DiCaprio in Inception INSERT INTO movie_actors (actors_id, movie_id) VALUES (1, 1); -- Al Pacino in The Godfather INSERT INTO movie_actors (actors_id, movie_id) VALUES (2, 2); -- Song Kang-ho in Parasite INSERT INTO movie_actors (actors_id, movie_id) VALUES (3, 3); -- Audrey Tautou in Amélie INSERT INTO movie_actors (actors_id, movie_id) VALUES (4, 4); These tables are mapped to the following entity classes. Movie.java Java @Entity @Data public class Movie { @Id @GeneratedValue(strategy = GenerationType.IDENTITY) private Long id; private String title; private String headline; private String thumbnail; private String language; private String region; @Enumerated(EnumType.STRING) private ContentRating rating; @ManyToMany Set<Actor> actors; } public enum ContentRating { G, PG, PG13, R, NC17 } Actor.java Java @Entity @Data public class Actor { @Id @GeneratedValue(strategy = GenerationType.IDENTITY) Long id; String firstName; String lastName; } Configure Flyway We can control the migration process using the following properties in the application.properties file: application.properties spring.flyway.enabled=true spring.flyway.locations=classpath:db/migrations spring.flyway.baseline-on-migrate=true spring.flyway.validate-on-migrate=true Property Use spring.flyway.enabled=true Enables or disables Flyway's migration functionality for your application spring.flyway.validate-on-migrate=true When this property is set to true, Flyway will validate the applied migrations against the migration scripts every time it runs a migration. This ensures that the migrations applied to the database match the ones available in the project. If validation fails, Flyway will prevent the migration from running, which helps catch potential problems early. spring.flyway.baseline-on-migrate=true Used when you have an existing database that wasn't managed by Flyway and you want to start using Flyway to manage it. Setting this to true allows Flyway to baseline an existing database, marking it as a baseline and starting to manage subsequent migrations. spring.flyway.locations Specifies the locations of migration scripts within your project. Run the Migrations When you start your Spring Boot application, Flyway will automatically check the db/migrations directory for any new migrations that have not yet been applied to the database and will apply them in version order. ./mvnw spring-boot:run Reverse/Undo Migrations in Flyway Flyway allows you to revert migrations that were applied to the database. However, this feature requires you to have a Flyway Teams (Commercial) license. If you're using the community/free version of Flyway, the workaround is to create a new migration changelog file to undo the changes made by the previous migration and apply them. For example, V5__delete_movie_actors_table.sql DROP TABLE movie_actors; Now run the application to apply the V5 migration changelog to your database. Using Flyway Maven Plugin Flyway provides a maven plugin to manage the migrations from the command line. It provides 7 goals. Goal Description flyway:baseline Baselines an existing database, excluding all migrations up to and including baselineVersion. flyway:clean Drops all database objects (tables, views, procedures, triggers, ...) in the configured schemas. The schemas are cleaned in the order specified by the schemas property.. flyway:info Retrieves the complete information about the migrations including applied, pending and current migrations with details and status flyway:migrate Triggers the migration of the configured database to the latest version. flyway:repair Repairs the Flyway schema history table. This will remove any failed migrations on databases without DDL transactions flyway:undo Undoes the most recently applied versioned migration. Flyway teams only flyway:validate Validate applied migrations against resolved ones on the classpath. This detect accidental changes that may prevent the schema(s) from being recreated exactly. To integrate the flyway maven plugin into your maven project, we need to add flyway-maven-plugin plugin to your pom.xml file. XML <properties> <database.url>jdbc:mysql://localhost:3306/flyway_demo</database.url> <database.username>YOUR_DB_USER</database.username> <database.password>YOUR_DB_PASSWORD</database.password> </properties> <build> <plugins> <plugin> <groupId>org.flywaydb</groupId> <artifactId>flyway-maven-plugin</artifactId> <version>10.10.0</version> <configuration> <url>${database.url}</url> <user>${database.username}</user> <password>${database.password}</password> </configuration> </plugin> <!-- other plugins --> </plugins> </build> Now you can use the Maven goals. ./mvnw flyway:migrate Maven allows you to define properties in the project's POM and pass the value from the command line. ./mvnw -Ddatabase.username=root -Ddatabase.password=Passw0rd flyway:migrate
I would like to to introduce you a Java class with less than 170 lines of code to facilitate work with SQL queries called via the JDBC API. What makes this solution interesting? The class can be embedded in a Java version 17 script. Using a Java Script The advantage of a Java script is easy portability in text format and the possibility of running without prior compilation, while we have considerable resources available from the language's standard library at runtime. The use of scripts is offered for various prototypes, in which even more complicated data exports or data conversions can be solved (after connecting to the database). Scripts are useful wherever we don't want to (or can't) put the implementation into a standard Java project. However, the use of the script has some limitations. For example, the code must be written in a single file. We can include all the necessary libraries when we run the script, but these will likely have additional dependencies, and simply listing them on the command line can be frustrating. The complications associated with the distribution of such a script probably do not need to be emphasized. For the above reasons, I believe that external libraries in scripts are best avoided. If we still want to go the script route, the choice falls on pure JDBC. Multi-line text literals can be advantageously used for writing SQL queries, and the automatic closing of objects like PreparedStatement (implementing the interface AutoCloseable). So what's the problem? Mapping SQL Parameter Values For security reasons, it is advisable to map SQL parameter values to question marks. I consider the main handicap of JDBC to be the mapping of parameters using the sequence number of the question mark (starting with one). The first version of the parameter mapping to the SQL script often turns out well, but the risk of error increases as the number of parameters and additional SQL modifications increase. I remind you that by inserting a new parameter in the first position, the following row must be renumbered. Another complication is the use of the operator IN because for each value of the enumeration, a question mark must be written in the SQL template which must be mapped to a separate parameter. If the parameter list is dynamic, the list of question marks in the SQL template must also be dynamic. Debugging a larger number of more complex SQLs can start to take a significant amount of time. For inserting SQL parameters using String Templates we will have to wait a little longer. However, inserting SQL parameters could be facilitated by a simple wrapper over the interfacePreparedStatement, which would (before calling the SQL statement) append the parameters using JPA-style named tags (alphanumeric text starting with a colon). A wrapper could also simplify reading data from the database (with a SELECT statement) if it allowed the necessary methods to be chained into a single statement, preferably with a return type Stream<ResultSet>. SqlParamBuilder Class Visualization of the SQL command with attached parameters would sometimes be useful for debugging or logging the SQL query. I present to you the class SqlParamBuilder. The priority of the implementation was to cover the stated requirements with a single Java class with minimalistic code. The programming interface was inspired by the library JDBI. The samples use the H2 database in in-memory mode. However, connecting the database driver will be necessary. Java void mainStart(Connection dbConnection) throws Exception { try (var builder = new SqlParamBuilder(dbConnection)) { System.out.println("# CREATE TABLE"); builder.sql(""" CREATE TABLE employee ( id INTEGER PRIMARY KEY , name VARCHAR(256) DEFAULT 'test' , code VARCHAR(1) , created DATE NOT NULL ) """) .execute(); System.out.println("# SINGLE INSERT"); builder.sql(""" INSERT INTO employee ( id, code, created ) VALUES ( :id, :code, :created ) """) .bind("id", 1) .bind("code", "T") .bind("created", someDate) .execute(); System.out.println("# MULTI INSERT"); builder.sql(""" INSERT INTO employee (id,code,created) VALUES (:id1,:code,:created), (:id2,:code,:created) """) .bind("id1", 2) .bind("id2", 3) .bind("code", "T") .bind("created", someDate.plusDays(7)) .execute(); builder.bind("id1", 11) .bind("id2", 12) .bind("code", "V") .execute(); System.out.println("# SELECT"); List<Employee> employees = builder.sql(""" SELECT t.id, t.name, t.created FROM employee t WHERE t.id < :id AND t.code IN (:code) ORDER BY t.id """) .bind("id", 10) .bind("code", "T", "V") .streamMap(rs -> new Employee( rs.getInt("id"), rs.getString("name"), rs.getObject("created", LocalDate.class))) .toList(); System.out.printf("# PRINT RESULT OF: %s%n", builder.toStringLine()); employees.stream() .forEach((Employee employee) -> System.out.println(employee)); assertEquals(3, employees.size()); assertEquals(1, employees.get(0).id); assertEquals("test", employees.get(0).name); assertEquals(someDate, employees.get(0).created); } } record Employee (int id, String name, LocalDate created) {} static class SqlParamBuilder {…} Usage Notes and Final Thoughts An instance of the type SqlParamBuilder can be recycled for multiple SQL statements. After calling the command, the parameters can be changed and the command can be run again. The parameters are assigned to the last used object PreparedStatement. Method sql() automatically closes the internal object PrepradedStatement (if there was one open before). If we change the group of parameters (typically for the IN operator), we need to send the same number for the same PreparedStatement. Otherwise, the method againsql() will need to be used. An object is required after the last command execution to explicitly close the SqlParamBuilder. However, since we are implementing an interface AutoCloseable, just enclose the entire block in a try block. Closing does not affect the contained database connection. In the Bash shell, the sample can be run with a script SqlExecutor.sh, which can download the necessary JDBC driver (here, for the H2 database). If we prefer Kotlin, we can try a Bash script SqlExecutorKt.sh, which migrates the prepared Kotlin code to a script and runs it. Let's not get confused by the fact that the class is stored in a Maven-type project. One reason is the ease of running JUnit tests. The class is licensed under the Apache License, Version 2.0. Probably the fastest way to create your own implementation is to download the example script, redesign the method mainRun(), and modify the connection parameters to your own database. Use your own JDBC driver to run.
I’ve lost too much time fiddling around with configurations and services just to spin up a compute instance in AWS. Sometimes, I just need a production-ready environment to play around with to test out applications and ideas. With Heroku, I can get that with just a few simple commands at the CLI. Recently, I learned that Heroku also includes support for PostGIS. I’ve personally never used PostGIS before. I know of several proprietary competitors to the product, so I figured this would be a great time to try out the open-source option. In this article, I’ll show you how to get a PostGIS-enabled Postgres instance running on Heroku. Then, I’ll run some sample queries on the database, just to give you a feel for how it works. And the best part? You can follow along and do your own exploring as we go! What Is PostGIS? Even if you’ve used Postgres for a while, you might not be familiar with PostGIS. The GIS stands for Geographic Information System. There are many solutions in the space. But the thing that makes PostGIS nice is that it’s based on the well-loved PostgreSQL database. In addition to all of the performance you’d expect from Postgres, we get a full-featured tool for storing geospatial data. Not only does PostGIS provide a good storage solution for this type of data, but it can be seamlessly integrated with several applications that can consume this data (such as ArcGIS and Tableau). TL;DR — If you need to process, store, or query location data, PostGIS is a great option for doing that. Fortunately, it’s as simple as adding an addon to a Heroku app to get a new Postgres instance up and running. So, let’s do that now. How Can We Use PostGIS? To get started, you’ll need an app of any size. Then, you add an instance of Heroku Postgres to your app. Create a Heroku App For my demo, I’m going to create an empty app. Attach a Heroku Postgres Add-On Once the app is created, I can create the Heroku Postgres add-on. Because my sample data set is too big for a Mini plan instance, I need to use the basic plan instead. I can do this from the command line: Shell $ heroku login $ heroku addons:create heroku-postgresql:basic -a postgis-demo Creating heroku-postgresql:basic on ⬢ postgis-demo... ~$0.013/hour (max $9/month) Database has been created and is available ! This database is empty. If upgrading, you can transfer ! data from another database with pg:copy Created postgresql-fitted-78461 as DATABASE_URL Once I’ve created my Postgres database, I only have a few more steps to set up PostGIS. Create the PostGIS Extension Heroku Postgres has many possible extensions we could install. To list them, we can ask our instance: SQL $ heroku pg:psql -a postgis-demo --> Connecting to postgresql-fitted-78461 … postgis-demo::DATABASE=> \x on; Expanded display is on. postgis-demo::DATABASE=> show extwlist.extensions; … address_standardizer,address_standardizer_data_us,amcheck,autoinc,bloom,btree_gin,btree_gist,citext,cube,dict_int,earthdistance,fuzzystrmatch,hstore,insert_username,intarray,isn,lo,ltree,moddatetime,pg_partman,pg_stat_statements,pg_trgm,pgcrypto,pgrowlocks,postgis,postgis_raster,postgis_topology,refint,seg,sslinfo,tablefunc,tcn,tsm_system_rows,tsm_system_time,unaccent,uuid-ossp We see postgis in the list of available extensions. From there, we can create the extension. SQL postgis-demo::DATABASE=> CREATE EXTENSION postgis; CREATE EXTENSION We can confirm the extension is installed and check the version: SQL postgis-demo::DATABASE=> SELECT postgis_version(); -[ RECORD 1 ]---+-------------------------------------- postgis_version | 3.4 USE_GEOS=1 USE_PROJ=1 USE_STATS=1 Alright! It looks like we’re up and running with PostGIS 3.4. Load Initial Dataset Now that I’ve got the PostGIS extension enabled, I need to load in a dataset to play around with. I’m using the dataset provided by the Introduction to PostGIS book. The downloaded data bundle is a 21.5 MB zip file. In the data subfolder of the extracted archive, there’s a 9.5 MB file called nyc_data.backup. This is a file with all of the census data from the 2000 census of New York City, along with all of the streets, neighborhoods, and subway stations in the city. We can restore the data backup directly to our Heroku Postgres instance by using the heroku pg:backups:restore command. This is incredibly convenient. However, keep in mind the following caveats: The backup file that you can restore from cannot be uploaded from your local machine. It must be available online. Fortunately, I found a GitHub repo that makes nyc_data.backup available. Performing database restore starts by completely resetting your Heroku Postgres instance, including your installation of the postgis extension. So, even though we showed above how to install the extension manually, we’ll need to add a flag when we restore our database to pre-install the extension before loading the data. Here’s the command we would use to restore the database backup: Shell $ heroku pg:backups:restore \ https://github.com/Giorgi/PostgresSamples/raw/main/nyc_data.backup \ -e postgis \ -a postgis-demo Our backup file is specified through a publicly accessible URL. You can always download the dataset from the PostGIS tutorial, extract the nyc_data.backup file, and post it online to a location of your own choosing. The -e postgis flag specifies that we want to install the postgis extension prior to loading the backup’s schema and data. That was it! Not bad for a few simple commands. We have our database and data. Why Heroku? If you already know how to set up Postgres on a local machine, you might be wondering why I went with Heroku. For me, the biggest reason is simplicity. Besides choosing a large enough Heroku Postgres plan for the analysis I plan to do and installing the PostGIS extension, there’s nothing else I need to do to get up and running. Also, collaborating on any analysis I do is easy. I can grant other people access to my database as collaborators, or I can quickly build an application on top of the database and share access through a normal web interface, rather than the Postgres client. Finally, when I’m done working on a project and I don’t need it any longer, I can just delete the app on Heroku and it’s all gone. No data files on my computer to worry about. No extra software installed locally. I’m able to enjoy a quick excursion into a new technology and then move on when I’m done. Working With PostGIS Now, let’s take a look at how PostGIS works. Work Just as You Would With Postgres The first thing to remember is that PostGIS is an extension within Postgres. That means that you can also perform any standard Postgres query. Let’s say I wanted to find out how many streets in New York start with B. A simple SQL query will tell me: SQL postgis-demo::DATABASE=> SELECT count(*) postgis-demo::DATABASE-> FROM nyc_streets postgis-demo::DATABASE-> WHERE name LIKE 'B%'; count ------- 1282 (1 row) How about the number of neighborhoods in each borough? Again, a simple SQL query: SQL postgis-demo::DATABASE=> SELECT boroname, count(*) postgis-demo::DATABASE-> FROM nyc_neighborhoods postgis-demo::DATABASE-> GROUP BY boroname; boroname | count ---------------+------- Queens | 30 Brooklyn | 23 Staten Island | 24 The Bronx | 24 Manhattan | 28 (5 rows) So far we’ve just done standard PostgreSQL. Now, let’s take a look at how to use PostGIS features. Examples of Working With Geospatial Geometries Because our dataset includes all New York streets, we can ask how many kilometers of streets there are in the city with this query: SQL postgis-demo::DATABASE=> SELECT Sum(ST_Length(geom)) / 1000 as street_length FROM nyc_streets; street_length -------------------- 10418.904717199996 (1 row) We can also calculate areas, such as the acreage of the entirety of Manhattan: SQL postgis-demo::DATABASE=> SELECT Sum(ST_Area(geom)) / 4047 as acreage FROM nyc_neighborhoods WHERE boroname = 'Manhattan'; acreage ------------------- 13965.32012239119 (1 row) Note that these calculations come from the geospatial data, not from columns related to aggregated data of this sort. Not only that, but these queries execute extremely quickly. One final query that I’m really amazed by involves the use of spatial joins. Much like standard database joins, spatial joins can unite multiple tables, but on the basis of spatial relationships. For example, we can query for which neighborhood a specific subway station is in, using the spatial data. To do this, we can use ST_Contains from PostGIS to determine if the geometry of the neighborhood completely contains the geometry of the subway station. Based on the name of the subway (in nyc_subway_stations), we query for the neighborhood (in nyc_neighborhoods) for which ST_Contains is true. Our query looks like this: SQL postgis-demo::DATABASE=> SELECT subways.name AS subway_name, neighborhoods.name AS neighborhood_name, neighborhoods.boroname AS borough FROM nyc_neighborhoods AS neighborhoods JOIN nyc_subway_stations AS subways ON ST_Contains(neighborhoods.geom, subways.geom) WHERE subways.name = 'Broad St'; subway_name | neighborhood_name | borough -------------+--------------------+----------- Broad St | Financial District | Manhattan (1 row) PostGIS provides even more advanced location querying functionality with geometries, but that’s outside the scope of our simple demo here. Conclusion Having never used PostGIS before, I’m really impressed by what it can do. There’s a lot more I could do with this database too, since I’ve only made it about halfway through the official Introduction to PostGIS book. Not only that, I can build and deploy applications on top of PostGIS by using any number of languages supported by Heroku. In particular, I’m thinking I might want to find a use case for building a Rails app on top of PostGIS. I already found some documentation on how I can get started. But for now, I don’t need this instance anymore, so I’m going to clean it up and delete my app. From the CLI, this is what I need to do: Shell $ heroku apps:destroy postgis-demo ▸ WARNING: This will delete ⬢ postgis-demo including all add-ons. ▸ To proceed, type postgis-demo or re-run this command with --confirm postgis-demo > postgis-demo Destroying ⬢ postgis-demo (including all add-ons)... done Wait, that’s all? Yeah, that’s all. With a single command and confirmation, everything is torn down and I don’t need to worry about it anymore. Shell $ heroku apps You have no apps. $ heroku addons No add-ons. Now that I’ve deleted my app, you have an incredible opportunity: The unique app name postgis-demo is available for the first reader who wants to grab it on Heroku! Are you ready to build your next great PostGIS app? Today is the day!
Twenty-twenty-three has been referred to as “The Year of the Lock Manager’s Revenge” in a recent article that examined production challenges related to queries across multiple partitions and indexes. Kyle Hailey, who has extensive Oracle Database and performance troubleshooting experience, documented one such issue at Midjourney, highlighting how surprised he was by PostgreSQL's limitations. PostgreSQL’s partitioning is implemented on top of tables, much like inheritance and views. This results in too many partitions presenting the same types of problems as querying too many tables and indexes. In this, he offers valuable insights into PostgreSQL partitioning challenges and scalability issues compared to Oracle Database. Optimizing the Management of Large Data Sets in PostgreSQL Managing terabyte-sized tables can be complex. Partitioning the data is a solution, but having hundreds of partitions — which are similar to tables — makes querying impractical. Although shares a few of PostgreSQL's limitations, it enhances scalability with its underlying storage. It distributes table rows and index entries, reducing the reliance on declarative partitioning.* *NOTE: Declarative partitioning remains useful in YugabyteDB for, say, purging old data and ensuring data sovereignty, where managing tens of partitions is acceptable. When it comes to scaling both horizontally and linearly, YugabyteDB automatically and transparently splits and distributes data across partitions, optimizing reads and writes. In this article, I'll run two demos: On PostgreSQL with declarative partitioning and Lock Manager Contention, highlighting some of the limitations On YugabyteDB, which has distributed tables, indexes, and linear scalability. PostgreSQL With Declarative Partitioning and LockManager Contention To summarize the main problem, I created a table in PostgreSQL with a hundred partitions: SQL CREATE TABLE payments_part ( primary key (payment_id) , payment_id bigint generated always as identity (cache 1000) , created timestamptz not null default now() , account_id bigint not null , amount decimal(10,2) not null ) partition by hash (payment_id) ; -- generate the partition DDL and \gexec it. select format('create table %I partition of %I for values with ( modulus %s, remainder %s)' , 'payments_'||to_char(num,'FM099') , 'payments_part' , max(num)over(), num-1 ) from generate_series(1,100) num ; \gexec create index payments_by_account_part on payments_part(account_id, amount desc); vacuum analyze payments_part; While it is not necessary to demonstrate the problem, let’s add a hundred million rows to provide a practical example to work with. SQL insert into payments_part (account_id, amount, created) select 1000 * random() as account, random() as amount, random() * interval '24.25 year' + timestamp '2000-01-01 00:00:00' as created from generate_series(1,100000) account_id \watch c=1000 vacuum analyze payments_part; The problem is evident in the execution plan for a basic query. Because indexes are local, the query performs a hundred Index Scans and concatenates the results. SQL postgres=> explain (analyze, buffers, costs off) select * from payments_part where account_id = 42 and amount < 0.01 ; QUERY PLAN ----------------------------------------------------------------------------------------- Append (actual time=0.014..1.018 rows=495 loops=1) Buffers: shared hit=797 -> Index Scan using payments_001_account_id_amount_idx on payments_001 (actual time=0.013..0.018 rows=5 loops=1) Index Cond: ((account_id = 42) AND (amount < 0.01)) Buffers: shared hit=8 -> Index Scan using payments_002_account_id_amount_idx on payments_002 (actual time=0.007..0.008 rows=1 loops=1) Index Cond: ((account_id = 42) AND (amount < 0.01)) Buffers: shared hit=4 ... (for the 100 of partitions) -> Index Scan using payments_099_account_id_amount_idx on payments_099 (actual time=0.005..0.008 rows=4 loops=1) Index Cond: ((account_id = 42) AND (amount < 0.01)) Buffers: shared hit=7 -> Index Scan using payments_100_account_id_amount_idx on payments_100 (actual time=0.006..0.009 rows=4 loops=1) Index Cond: ((account_id = 42) AND (amount < 0.01)) Buffers: shared hit=7 Planning Time: 2.645 ms Execution Time: 1.415 ms (304 rows) In addition to the cost of going through a hundred B-Tree levels, the execution plan is huge. Even at the time of execution, some locks must be acquired in shared memory structures, and this limits the scalability. Here's an execution plan for a simple query retrieving the maximum amount for one account: SQL postgres=> explain (analyze, buffers, costs off, summary off ) select max(amount) from payments_part where account_id = 42 ; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Result (actual time=1.526..1.549 rows=1 loops=1) Buffers: shared hit=401 InitPlan 1 (returns $0) -> Limit (actual time=1.521..1.545 rows=1 loops=1) Buffers: shared hit=401 -> Merge Append (actual time=1.520..1.543 rows=1 loops=1) Sort Key: payments_001.amount DESC Buffers: shared hit=401 -> Index Only Scan using payments_001_account_id_amount_idx on payments_001 (actual time=0.026..0.027 rows=1 loops=1) Index Cond: ((account_id = 42) AND (amount IS NOT NULL)) Heap Fetches: 0 Buffers: shared hit=4 ... (for the 100 of partitions) -> Index Only Scan using payments_100_account_id_amount_idx on payments_100 (actual time=0.016..0.016 rows=1 loops=1) Index Cond: ((account_id = 42) AND (amount IS NOT NULL)) Heap Fetches: 0 Buffers: shared hit=4 Even with quick response times, reading the first row from each index partition makes a plan that reads one hundred table partitions unscalable, especially when concurrent executions will compete on the lock manager. Therefore, I used PgBench to run this query from two hundred client sessions: SQL cat > query.sql <<'SQL' \set id random(1,1000) select max(amount) from payments_part where account_id = :id; SQL pgbench -c 200 -T 300 -nf query.sql -M prepared -r During the run, wait events indicate that sessions devote approximately one-third of their execution time to waiting for locks. Database load with wait events The latency metrics from the PgBench client is not helpful since the client-server round trip consumes the majority of the time. However, the throughput gives an idea of the performance: SQL [ec2-user@ip-172-31-33-218 postgresql-16.2]$ cat > query.sql <<'SQL' \set id random(1,1000) select max(amount) from payments_part where account_id = :id; SQL pgbench -c 200 -T 300 -nf query.sql -M prepared -r pgbench (16.2, server 11.22) transaction type: query.sql scaling factor: 1 query mode: prepared number of clients: 200 number of threads: 1 maximum number of tries: 1 duration: 300 s number of transactions actually processed: 2348934 number of failed transactions: 0 (0.000%) latency average = 25.369 ms initial connection time = 2078.371 ms tps = 7883.760582 (without initial connection time) statement latencies in milliseconds and failures: 0.001 0 \set id random(1,1000) 24.580 0 select max(amount) from payments_part where account_id = :id; [ec2-user@ip-172-31-33-218 postgresql-16.2]$ Response time and latency are visible from pg_stat_statements, revealing 6407 transactions per second with an average execution time of 1.45 milliseconds: Top statements with wait events. I have run this on AWS db.m5.8xlarge multi-AZ ( 2x 32 vCPUs, 128 GiB ) managed by Amazon RDS. Adding more CPU would not help because most of the time is spent on lock manager contention. YugabyteDB With Distributed Tables and Indexes To Scale Performing the same operation as above with YugabyteDB, with a hundred tables, is feasible but lacks scalability. The reason is different: YugabyteDB doesn't need to acquire shared locks on a central memory structure. However, its deployment across multiple availability zones means that sending read operations to numerous partitions will introduce network latency for each partition. To distribute data, utilize the storage's automatic sharding. Distributed reads scale by batching the operations, incurring latency only once per execution, typically a millisecond in a single-region deployment. It also provides all SQL agility without limitations: global secondary indexes and referential integrity. It is also easier since the tablets are split automatically as the table grows. Suppose I want the equivalent of 100 partitions. In that case, I simply add "split into 100 tablets" to my CREATE TABLE statement: SQL CREATE TABLE payments ( primary key (payment_id) , payment_id bigint generated always as identity (cache 1000) , created timestamptz not null default now() , account_id bigint not null , amount decimal(10,2) not null ) split into 100 tablets ; If I load 8TB of data into this table, each tablet will be approximately 80GB, ideal for avoiding lengthy and expensive operations. If I add new nodes to my cluster, the tablet peers will automatically move to balance the node, with the tablet as the unit of adjustment. Full compaction, if needed, will occur tablet by tablet. A cluster can run hundreds of tablets per node. I don't need to manually set the number of tablets. They will split automatically according to their size and the number of nodes in the cluster. I did it here to demonstrate the executions with a level of partitioning comparable to what I did in the PostgreSQL demo. I can create any index without limitations on the key because all indexes are globally distributed. Here is an index I created for quick access to the payments from one account: SQL create index payments_by_account on payments(account_id, amount desc) split into 100 tablets ; In general, you can create more secondary indexes in YugabyteDB than in PostgreSQL because neither has the same performance overhead. I generate the same amount of data: SQL insert into payments (account_id, amount, created) select 1000 * random() as account, random() as amount, random() * interval '24.25 year' + timestamp '2000-01-01 00:00:00' as created from generate_series(1,100000) account_id \watch c=1000 I have 100 million rows distributed into 100 table tablets and 100 index tablets over 3 availability zones. SQL yugabyte=> select count(*) from payments ; count ----------- 100000000 (1 row) yugabyte=> select num_tablets from yb_table_properties('payments'::regclass); num_tablets ------------- 100 (1 row) yugabyte=> select num_tablets from yb_table_properties('payments_by_account'::regclass); num_tablets ------------- 100 (1 row) yugabyte=> select host, cloud, region, zone from yb_servers() order by 2,3,4; host | cloud | region | zone -------------+-------+-----------+------------ 10.8.8.142 | aws | eu-west-1 | eu-west-1a 10.8.9.141 | aws | eu-west-1 | eu-west-1a 10.8.8.5 | aws | eu-west-1 | eu-west-1a 10.8.11.120 | aws | eu-west-1 | eu-west-1b 10.8.10.34 | aws | eu-west-1 | eu-west-1b 10.8.11.158 | aws | eu-west-1 | eu-west-1b 10.8.12.201 | aws | eu-west-1 | eu-west-1c 10.8.12.82 | aws | eu-west-1 | eu-west-1c 10.8.13.34 | aws | eu-west-1 | eu-west-1c (9 rows) The key insight into scalability comes from the execution plan. Here is a query that retrieves all payments from an account with a payment amount lower than one cent: SQL yugabyte=> explain (analyze, dist, costs off) select * from payments where account_id = 42 and amount < 0.01 ; QUERY PLAN ----------------------------------------------------------------------------------------- Index Scan using payments_by_account on payments (actual time=6.755..7.013 rows=500 loops=1) Index Cond: ((account_id = 42) AND (amount < 0.01)) Storage Table Read Requests: 1 Storage Table Read Execution Time: 4.420 ms Storage Index Read Requests: 1 Storage Index Read Execution Time: 1.908 ms Planning Time: 0.064 ms Execution Time: 7.082 ms Storage Read Requests: 2 Storage Read Execution Time: 6.328 ms Storage Write Requests: 0 Catalog Read Requests: 0 Catalog Write Requests: 0 Storage Flush Requests: 0 Storage Execution Time: 6.328 ms Peak Memory Usage: 8 kB (16 rows) This execution plan does not concatenate partitions, preventing contention on the lock manager in PostgreSQL. It is a single table in the SQL layer. The read requests to the index and the table are batched and sent to all nodes, where rows and index entries are distributed to tablets. Fetching the result of 500 rows from the distributed storage (in multiple availability zones) in 7 milliseconds. Because the indexes are global, they are very efficient, especially when covering all columns required by the query. Here is a query that retrieves the payment amount for a list of accounts. SQL yugabyte=> explain (analyze, dist, costs off) select amount from payments where account_id in (1,12,42,68,102,500,746,909) and amount < 0.01 order by amount ; QUERY PLAN ----------------------------------------------------------------------------------------- Sort (actual time=5.222..5.341 rows=3967 loops=1) Sort Key: amount Sort Method: quicksort Memory: 282kB -> Index Only Scan using payments_by_account on payments (actual time=2.852..4.538 rows=3967 loops=1) Index Cond: ((account_id = ANY ('{1,12,42,68,102,500,746,909}'::bigint[])) AND (amount < 0.01)) Heap Fetches: 0 Storage Index Read Requests: 1 Storage Index Read Execution Time: 2.673 ms Planning Time: 0.093 ms Execution Time: 5.526 ms Storage Read Requests: 1 Storage Read Execution Time: 2.673 ms Storage Write Requests: 0 Catalog Read Requests: 0 Catalog Write Requests: 0 Storage Flush Requests: 0 Storage Execution Time: 2.673 ms Peak Memory Usage: 508 kB (18 rows) The result is returned in 5 milliseconds, which is typically unattainable with a PostgreSQL table partitioned on payment_id because it necessitates reading all partitions, in this case, 100 indexes. However, YugabyteDB reads many tablets in a single distributed read request. Additionally, thanks to YugabyteDB's loose index scan, it's capable of reading for multiple values in a single request. Note that I didn't need to VACUUM to get a real Index Only Scan (with Heap Fetches: 0). YugabyteDB implements multi-version concurrency control in tables and indexes, eliminating the need to go to the table rows, or their visibility map, for consistent reads. Here is the execution plan for the simple query that I'll run concurrently: SQL yugabyte=> explain (analyze, dist, costs off, summary off ) select max(amount) from payments where account_id = 42 ; QUERY PLAN ----------------------------------------------------------------------------------------- Result (actual time=1.151..1.152 rows=1 loops=1) InitPlan 1 (returns $0) -> Limit (actual time=1.148..1.149 rows=1 loops=1) -> Index Only Scan using payments_by_account on payments (actual time=1.147..1.147 rows=1 loops=1) Index Cond: ((account_id = 42) AND (amount IS NOT NULL)) Heap Fetches: 0 Storage Index Read Requests: 1 Storage Index Read Execution Time: 1.038 ms (8 rows) I use PgBench to run it from one hundred concurrent sessions: SQL cat > query.sql <<'SQL' \set id random(1,1000) select max(amount) from payments where account_id = :id; SQL pgbench -c 200 -T 300 -nf query.sql -M prepared -r The throughput is 9707 transactions per second: SQL [ec2-user@ip-172-31-33-218 postgresql-16.2]$ cat > query.sql <<'SQL' \set id random(1,1000) select max(amount) from payments where account_id = :id; SQL pgbench -c 200 -T 300 -nf query.sql -M prepared -r pgbench (16.2, server 11.2-YB-2.18.6.0-b0) transaction type: query.sql scaling factor: 1 query mode: prepared number of clients: 200 number of threads: 1 maximum number of tries: 1 duration: 300 s number of transactions actually processed: 2807877 number of failed transactions: 0 (0.000%) latency average = 20.614 ms initial connection time = 10623.820 ms tps = 9702.109724 (without initial connection time) statement latencies in milliseconds and failures: 0.001 0 \set id random(1,1000) 19.536 0 select max(amount) from payments where account_id = :id; [ec2-user@ip-172-31-33-218 postgresql-16.2]$ The average response time is 1.62 milliseconds: I have run this on AWS m6i.xlarge Replication Factor 3 ( 9x 4 vCPUs, 16 GiB ) managed by YugabyteDB, which is resilient to zone failure. There's no lock manager and no single point of contention. It can scale by adding new nodes, and the tablets will be automatically rebalanced. I can check that the 200 tablets (100 for the table and 100 for the secondary index) are well-distributed, with 22 Raft leaders on each node and two Raft followers in other zones: Dashboard: List of servers with tablet leaders and followers. YugabyteDB Metrics I've run this on Yugabyte’s fully-managed DBaaS, YugabyteDB Managed, where you can explore the performance metrics in Grafana or Datadog. Since all nodes are active, I connected through the load balancer and can verify that the connections were distributed to all nodes: Connections per server SQL processing is well-balanced, and it's evident that my cluster can handle increased throughput: CPU usage per server The SQL Select is transformed to distributed read operations that are also distributed to all nodes: Operations per second per server. I highlighted these basic metrics to illustrate that YugabyteDB extends beyond data partitioning. All layers — connections, SQL processing, data storage, reads and writes — are distributed. In Conclusion….Partitioning Levels in YugabyteDB If your PostgreSQL database grows to terabytes, consider partitioning into multiple tables or sharding across multiple databases, especially if all access patterns can utilize a single partitioning key for all indexes. To scale further without changing the application, migrating to YugabyteDB is a good option. If you do, try not to use many SQL partitions as in PostgreSQL and instead rely on the built-in distribution for simplified operations and faster execution. There are two levels of partitioning in YugabyteDB: DocDB automatic sharding, which transparently distributes and replicates table rows and index entries, by range or hash on their key YSQL declarative partitioning, which enables data isolation by location (data residency and geo-distribution) or time (lifecycle management by deleting old partitions) using PostgreSQL list or range partitioning. Those two levels can be combined, with each SQL partition distributed within the data placement constraints defined by the partition's tablespace. Additional Resources Postgres Partition Conversion: minimal downtime explains how difficult it is to partition an existing non-partitioned table and the limitations that change your primary key and indexing possibilities Postgres Partition Pains — LockManager Waits explains that queries on many partitions do not scale because they compete on lock acquisition at execution time.
Microsoft's SQL Server is a powerful RDBMS that is extensively utilized in diverse industries for the purposes of data storage, retrieval, and analysis. The objective of this article is to assist novices in comprehending SQL Server from fundamental principles to advanced techniques, employing real-world illustrations derived from nProbe data. nProbe is a well-known network traffic monitoring tool that offers comprehensive insights into network traffic patterns. Getting Started With SQL Server 1. Introduction to SQL Server SQL Server provides a comprehensive database management platform that integrates advanced analytics, robust security features, and extensive reporting capabilities. It offers support for a wide range of data types and functions, enabling efficient data management and analysis. 2. Installation Begin by installing SQL Server. Microsoft offers different editions, including Express, Standard, and Enterprise, to cater to varying needs. The Express edition is free and suitable for learning and small applications. Here is the step-by-step guide to install the SQL server. 3. Basic SQL Operations Learn the fundamentals of SQL, including creating databases, tables, and writing basic queries: Create database: `CREATE DATABASE TrafficData;` Create table: Define a table structure to store nProbe data: MS SQL CREATE TABLE NetworkTraffic ( ID INT PRIMARY KEY, SourceIP VARCHAR(15), DestinationIP VARCHAR(15), Packets INT, Bytes BIGINT, Timestamp DATETIME ); Intermediate SQL Techniques 4. Data Manipulation Inserting Data To insert data into the `NetworkTraffic` table, you might collect information from various sources, such as network sensors or logs. MS SQL INSERT INTO NetworkTraffic (SourceIP, DestinationIP, Packets, Bytes, Timestamp) VALUES ('10.0.0.1', '192.168.1.1', 150, 2048, '2023-10-01T14:30:00'); Batch insert to minimize the impact on database performance: MS SQL INSERT INTO NetworkTraffic (SourceIP, DestinationIP, Packets, Bytes, Timestamp) VALUES ('10.0.0.2', '192.168.1.2', 50, 1024, '2023-10-01T15:00:00'), ('10.0.0.3', '192.168.1.3', 100, 1536, '2023-10-01T15:05:00'), ('10.0.0.4', '192.168.1.4', 200, 4096, '2023-10-01T15:10:00'); Updating Data You may need to update records as new data becomes available or corrections are necessary. For instance, updating the byte count for a particular traffic record: MS SQL UPDATE NetworkTraffic SET Bytes = 3072 WHERE ID = 1; Update multiple fields at once: MS SQL UPDATE NetworkTraffic SET Packets = 180, Bytes = 3072 WHERE SourceIP = '10.0.0.1' AND Timestamp = '2023-10-01T14:30:00'; Deleting Data Removing data is straightforward but should be handled with caution to avoid accidental data loss. MS SQL DELETE FROM NetworkTraffic WHERE Timestamp < '2023-01-01'; Conditional delete based on network traffic analysis: MS SQL DELETE FROM NetworkTraffic WHERE Bytes < 500 AND Timestamp BETWEEN '2023-01-01' AND '2023-06-01'; Querying Data Simple Queries: Retrieve basic information from your data set. MS SQL SELECT FROM NetworkTraffic WHERE SourceIP = '10.0.0.1'; Select specific columns: MS SQL SELECT SourceIP, DestinationIP, Bytes FROM NetworkTraffic WHERE Bytes > 2000; Aggregate Functions Useful for summarizing or analyzing large data sets. MS SQL SELECT AVG(Bytes), MAX(Bytes), MIN(Bytes) FROM NetworkTraffic WHERE Timestamp > '2023-01-01'; Grouping data for more detailed analysis: MS SQL SELECT SourceIP, AVG(Bytes) AS AvgBytes FROM NetworkTraffic GROUP BY SourceIP HAVING AVG(Bytes) > 1500; Join Operations In scenarios where you have multiple tables, joins are essential. Assume another table `IPDetails` that stores additional information about each IP. MS SQL SELECT n.SourceIP, n.DestinationIP, n.Bytes, i.Location FROM NetworkTraffic n JOIN IPDetails i ON n.SourceIP = i.IPAddress WHERE n.Bytes > 1000; Complex Queries Combining multiple SQL operations to extract in-depth insights. MS SQL SELECT SourceIP, SUM(Bytes) AS TotalBytes FROM NetworkTraffic WHERE Timestamp BETWEEN '2023-01-01' AND '2023-02-01' GROUP BY SourceIP ORDER BY TotalBytes DESC; Advanced SQL Server Features 5. Indexing for Performance Optimizing SQL Server performance through indexing and leveraging stored procedures for automation is critical for managing large databases efficiently. Here’s an in-depth look at both topics, with practical examples, particularly focusing on enhancing operations within a network traffic database like the one collected from nProbe. Why Indexing Matters Indexing is a strategy to speed up the retrieval of records from a database by reducing the number of disk accesses required when a query is processed. It is especially vital in databases with large volumes of data, where search operations can become increasingly slow. Types of Indexes Clustered indexes: Change the way records are stored in the database as they sort and store the data rows in the table based on their key values. Tables can have only one clustered index. Non-clustered indexes: Do not alter the physical order of the data, but create a logical ordering of the data rows and use pointers to physical rows; each table can have multiple non-clustered indexes. Example: Creating an Index on Network Traffic Data Suppose you frequently query the `NetworkTraffic` table to fetch records based on `SourceIP` and `Timestamp`. You can create a non-clustered index to speed up these queries: MS SQL CREATE NONCLUSTERED INDEX idx_networktraffic_sourceip ON NetworkTraffic (SourceIP, Timestamp); This index would particularly improve performance for queries that look up records by `SourceIP` and filter on `Timestamp`, as the index helps locate data quickly without scanning the entire table. Below are additional instructions on utilizing indexing effectively. 6. Stored Procedures and Automation Benefits of Using Stored Procedures Stored procedures help in encapsulating SQL code for reuse and automating routine operations. They enhance security, reduce network traffic, and improve performance by minimizing the amount of information sent to the server. Example: Creating a Stored Procedure Imagine you often need to insert new records into the `NetworkTraffic` table. A stored procedure that encapsulates the insert operation can simplify the addition of new records: MS SQL CREATE PROCEDURE AddNetworkTraffic @SourceIP VARCHAR(15), @DestinationIP VARCHAR(15), @Packets INT, @Bytes BIGINT, @Timestamp DATETIME AS BEGIN INSERT INTO NetworkTraffic (SourceIP, DestinationIP, Packets, Bytes, Timestamp) VALUES (@SourceIP, @DestinationIP, @Packets, @Bytes, @Timestamp); END; Using the Stored Procedure To insert a new record, instead of writing a full insert query, you simply execute the stored procedure: MS SQL EXEC AddNetworkTraffic @SourceIP = '192.168.1.1', @DestinationIP = '10.0.0.1', @Packets = 100, @Bytes = 2048, @Timestamp = '2024-04-12T14:30:00'; Automation Example: Scheduled Tasks SQL Server Agent can be used to schedule the execution of stored procedures. For instance, you might want to run a procedure that cleans up old records every night: MS SQL CREATE PROCEDURE CleanupOldRecords AS BEGIN DELETE FROM NetworkTraffic WHERE Timestamp < DATEADD(month, -1, GETDATE()); END; You can schedule this procedure to run automatically at midnight every day using SQL Server Agent, ensuring that your database does not retain outdated records beyond a certain period. By implementing proper indexing strategies and utilizing stored procedures, you can significantly enhance the performance and maintainability of your SQL Server databases. These practices are particularly beneficial in environments where data volumes are large and efficiency is paramount, such as in managing network traffic data for IFC systems. 7. Performance Tuning and Optimization Performance tuning and optimization in SQL Server are critical aspects that involve a systematic review of database and system settings to improve the efficiency of your operations. Proper tuning not only enhances the speed and responsiveness of your database but also helps in managing resources more effectively, leading to cost savings and improved user satisfaction. Key Areas for Performance Tuning and Optimization 1. Query Optimization Optimize queries: The first step in performance tuning is to ensure that the queries are as efficient as possible. This includes selecting the appropriate columns, avoiding unnecessary calculations, and using joins effectively. Query profiling: SQL Server provides tools like SQL Server Profiler and Query Store that help identify slow-running queries and bottlenecks in your SQL statements. Example: Here’s how you can use the Query Store to find performance issues: MS SQL SELECT TOP 10 qt.query_sql_text, rs.avg_duration FROM sys.query_store_query_text AS qt JOIN sys.query_store_plan AS qp ON qt.query_text_id = qp.query_text_id JOIN sys.query_store_runtime_stats AS rs ON qp.plan_id = rs.plan_id ORDER BY rs.avg_duration DESC; 2. Index Management Review and adjust indexes: Regularly reviewing the usage and effectiveness of indexes is vital. Unused indexes should be dropped, and missing indexes should be added where significant performance gains can be made. Index maintenance: Rebuilding and reorganizing indexes can help in maintaining performance, especially in databases with heavy write operations. Example: Rebuild an index using T-SQL: MS SQL ALTER INDEX ALL ON dbo.YourTable REBUILD WITH (FILLFACTOR = 90, SORT_IN_TEMPDB = ON, STATISTICS_NORECOMPUTE = OFF); 3. Database Configuration and Maintenance Database settings: Adjust database settings such as recovery model, file configuration, and buffer management to optimize performance. Routine maintenance: Implement regular maintenance plans that include updating statistics, checking database integrity, and cleaning up old data. Example: Set up a maintenance plan in SQL Server Management Studio (SSMS) using the Maintenance Plan Wizard. 4. Hardware and Resource Optimization Hardware upgrades: Sometimes, the best way to achieve performance gains is through hardware upgrades, such as increasing memory, adding faster disks, or upgrading CPUs. Resource allocation: Ensure that the SQL Server has enough memory and CPU resources allocated, particularly in environments where the server hosts multiple applications. Example: Configure maximum server memory: MS SQL EXEC sp_configure 'max server memory', 4096; RECONFIGURE; 5. Monitoring and Alerts System monitoring: Continuous monitoring of system performance metrics is crucial. Tools like System Monitor (PerfMon) and Dynamic Management Views (DMVs) in SQL Server provide real-time data about system health. Alerts setup: Configure alerts for critical conditions, such as low disk space, high CPU usage, or blocking issues, to ensure that timely actions are taken. Example: Set up an alert in SQL Server Agent: MS SQL USE msdb ; GO EXEC dbo.sp_add_alert @name = N'High CPU Alert', @message_id = 0, @severity = 0, @enabled = 1, @delay_between_responses = 0, @include_event_description_in = 1, @notification_message = N'SQL Server CPU usage is high.', @performance_condition = N'SQLServer:SQL Statistics|Batch Requests/sec|_Total|>|1000', @job_id = N'00000000-1111-2222-3333-444444444444'; GO Performance tuning and optimization is an ongoing process, requiring regular adjustments and monitoring. By systematically addressing these key areas, you can ensure that your SQL Server environment is running efficiently, effectively supporting your organizational needs. Conclusion Mastering SQL Server is a journey that evolves with practice and experience. Starting from basic operations to leveraging advanced features, SQL Server provides a powerful toolset for managing and analyzing data. As your skills progress, you can handle larger datasets like those from nProbe, extracting valuable insights and improving your network's performance and security. For those looking to dive deeper, Microsoft offers extensive documentation and a community rich with resources to explore more complex SQL Server capabilities. Useful References nProbe SQL Server SQL server performance tuning
Abhishek Gupta
Principal Developer Advocate,
AWS
Artem Ervits
Principal Sales Engineer,
Snowflake
Otavio Santana
Award-winning Software Engineer and Architect,
OS Expert
Denis Magda
Head of DevRel,
Yugabyte