DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Low-Code Development: Leverage low and no code to streamline your workflow so that you can focus on higher priorities.

DZone Security Research: Tell us your top security strategies in 2024, influence our research, and enter for a chance to win $!

Launch your software development career: Dive head first into the SDLC and learn how to build high-quality software and teams.

Open Source Migration Practices and Patterns: Explore key traits of migrating open-source software and its impact on software development.

Related

  • Getting Started With Apache Cassandra
  • Essential Techniques for Performance Tuning in Snowflake
  • Hello YugabyteDB: Running Kong on the Distributed PostgreSQL Database
  • Build a Philosophy Quote Generator With Vector Search and Astra DB

Trending

  • How To Plan a (Successful) MuleSoft VPN Migration (Part II)
  • Benchmarking Java Streams
  • GBase 8a Implementation Guide: Performance Optimization
  • Leveraging Test Containers With Docker for Efficient Unit Testing
  1. DZone
  2. Data Engineering
  3. Databases
  4. Using Primary, Partition, and Clustering Keys in ScyllaDB (or Cassandra)

Using Primary, Partition, and Clustering Keys in ScyllaDB (or Cassandra)

Learn core data modeling concepts that are critical for optimizing performance in wide-column NoSQL databases.

By 
Guy Shtub user avatar
Guy Shtub
·
May. 20, 24 · Tutorial
Like (1)
Save
Tweet
Share
780 Views

Join the DZone community and get the full member experience.

Join For Free

In ScyllaDB and other NoSQL databases, the data model is based on the queries and not just around the domain entities. When creating the data model, we take into account both the conceptual data model and the application workflow: which queries will be performed by which users and how often.

One of the main goals of data modeling in ScyllaDB and other wide-column databases (e.g., Apache Cassandra) is to return results fast. To achieve that, you want:

  • Even data distribution: Data should be evenly spread across the cluster so that every node holds roughly the same amount of data. ScyllaDB determines which node should store the data based on hashing the partition key. Therefore, choosing a suitable partition key is crucial. More on this later on.
  • To minimize the number of partitions accessed in a read query: To make reads faster, we’d ideally have all the data required in a read query stored in a single Table. Although it’s fine to duplicate data across tables, in terms of performance, it’s better if the data needed for a read query is in one table.

Things you should NOT focus on:

  • Avoiding data duplication: To get efficient reads, we sometimes have to duplicate data. More about that and denormalization later in this lesson. In a later session, we learn how to avoid duplication in some cases using Secondary Indexes.
  • Minimizing the number of writes: writes in ScyllaDB aren’t free, but they are very efficient and “cheap.” ScyllaDB is optimized for high write throughput. Reads, while still very fast, are usually more expensive than writes and are harder to fine-tune. We’d usually be ready to increase the number of writes to increase read efficiency. Keep in mind that the number of tables also affects consistency. 

In this tutorial, you will use an example based on a Veterinary Clinic named 4Paws Clinic. In this clinic, each admitted animal has a connected heart rate monitor, which logs heart rate and other vital information every five seconds.

What’s a Primary Key?

A Primary Key is defined within a table. It is one or more columns used to identify a row. All tables must include a definition for a Primary Key. For example, consider this table:

CQL
 
CREATE TABLE heartrate_v1 (
   pet_chip_id uuid,
   time timestamp,
   heart_rate int,
   PRIMARY KEY (pet_chip_id)
);


In the example above the primary key is a single column – the pet_chip_id. If a Primary Key is made up of a single column, it is called a Simple Primary Key. 

For the above table perform the query:

 
SELECT * from heartrate_v1 WHERE pet_chip_id = 123e4567-e89b-12d3-a456-426655440b23;


What happens if we want to query our data by pet_chip_id but also by time? That is if our query is:

 
SELECT * from heartrate_v1 WHERE pet_chip_id = 123e4567-e89b-12d3-a456-426655440b23 AND time >='2019-03-04 07:01:00' AND time <='2019-03-04 07:02:00';


In that case, the above query won’t work. We can define the Primary Key to include more than one column, in which case it is called a Composite (or Compound) key. Create the following table:

CQL
 
CREATE TABLE heartrate_v2 (
   pet_chip_id uuid,
   time timestamp,
   heart_rate int,
   PRIMARY KEY (pet_chip_id, time)
);


And insert some data:

 
INSERT INTO heartrate_v2(pet_chip_id, time, heart_rate) VALUES (123e4567-e89b-12d3-a456-426655440b23, '2019-03-04 07:01:05', 100);

INSERT INTO heartrate_v2(pet_chip_id, time, heart_rate) VALUES (123e4567-e89b-12d3-a456-426655440b23, '2019-03-04 07:01:10', 90); 

INSERT INTO heartrate_v2(pet_chip_id, time, heart_rate) VALUES (123e4567-e89b-12d3-a456-426655440b23, '2019-03-04 07:01:50', 96); 

INSERT INTO heartrate_v2(pet_chip_id, time, heart_rate) VALUES (123e4567-e89b-12d3-a456-426655440b23, '2019-04-04 07:01:50', 99);  


Enter Partition Keys and Clustering Keys

In the case shown above, the first part of the Primary Key is called the Partition Key (pet_chip_id in the above example) and the second part is called the Clustering Key (time).

A Primary Key is composed of 2 parts:

  • The Partition Key is responsible for data distribution across the nodes. It determines which node will store a given row. It can be one or more columns.

Partition key

  • The Clustering Key is responsible for sorting the rows within the partition. It can be zero or more columns.

Now execute the query we previously saw, according to time:

 
SELECT * from heartrate_v2 WHERE pet_chip_id = 123e4567-e89b-12d3-a456-426655440b23 AND time >='2019-03-04 07:01:00' AND time <='2019-03-04 07:02:00';


Additionally, we previously had an issue with heartrate_v1, where a pet could only have one heart rate value recorded regardless of the time. Now that we defined the time to be a part of the primary key, each primary key, which is a combination of pet_chip_id and time, can have a heart rate value.

Read the data for the same pet:

 
SELECT * from heartrate_v2 WHERE pet_chip_id = 123e4567-e89b-12d3-a456-426655440b23 ;


We can see that as opposed to the previous example. This time the value wasn’t overwritten.

The Partition Key and the Clustering Keys With Multiple Columns

As we just saw, both the Partition Key and the Clustering Key can include more than one column, for example, if our query is:

 
SELECT * from heartrate_v3 WHERE pet_chip_id = 123e4567-e89b-12d3-a456-426655440b23 AND time ='2019-03-04 07:01:00' AND pet_name = 'Duke';


We could define the table as follows:

CQL
 
CREATE TABLE heartrate_v3 (
   pet_chip_id uuid,
   time timestamp,
   heart_rate int,
   pet_name text,
   PRIMARY KEY ((pet_chip_id, time), pet_name)
);


Create the above table, then insert some data:

 
INSERT INTO heartrate_v3(pet_chip_id, time, heart_rate, pet_name) VALUES (123e4567-e89b-12d3-a456-426655440b23, '2019-03-04 07:01:10', 90, 'Duke'); 


In this case, the partition key includes two columns: pet_chip_id and time, and the clustering key is pet_name. Keep in mind that every query must include all columns defined in the partition key.

Now try to execute this query:

 
SELECT * from heartrate_v3 WHERE pet_chip_id = 123e4567-e89b-12d3-a456-426655440b23; 


It fails, as the entire partition key is not given.

Now try this query:

 
SELECT * from heartrate_v3 WHERE pet_chip_id = 123e4567-e89b-12d3-a456-426655440b23 AND time ='2019-03-04 07:01:10' AND pet_name = 'Duke';  


It succeeds as the complete partition key is given.

Similarly, if we want each partition to be based on the pet_chip_id but want to be able to query according to pet_name and heart_rate:

 
SELECT * from heartrate_v4 WHERE pet_chip_id = 123e4567-e89b-12d3-a456-426655440b23 AND pet_name = 'Duke' AND heart_rate = 100; 


it is possible to define (do this):

CQL
 
CREATE TABLE heartrate_v4 (
   pet_chip_id uuid,
   time timestamp,
   heart_rate int,
   pet_name text,
   PRIMARY KEY (pet_chip_id, pet_name, heart_rate)
);


Note:

  • If there is more than one column in the Clustering Key (pet_name and heart_rate in the example above), the order of these columns defines the clustering order. For a given partition, all the rows are physically ordered inside ScyllaDB by the clustering order. This order determines what select query you can efficiently run on this partition.
  • In this example, the ordering is first by pet_name and then by heart_rate.
  • In addition to the Partition Key columns, a query may include the Clustering Key. If it does include the Clustering Key columns they must be used in the same order as they were defined.

Additional Data Modeling Exercises

If you want to continue learning about primary keys and other data modeling concepts, feel free to play around with these additional exercises and quizzes building on this same Veterinary Clinic example.

Apache Cassandra Data modeling Database clustering Partition (database)

Published at DZone with permission of Guy Shtub. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Getting Started With Apache Cassandra
  • Essential Techniques for Performance Tuning in Snowflake
  • Hello YugabyteDB: Running Kong on the Distributed PostgreSQL Database
  • Build a Philosophy Quote Generator With Vector Search and Astra DB

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: