DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Low-Code Development: Leverage low and no code to streamline your workflow so that you can focus on higher priorities.

DZone Security Research: Tell us your top security strategies in 2024, influence our research, and enter for a chance to win $!

Launch your software development career: Dive head first into the SDLC and learn how to build high-quality software and teams.

Open Source Migration Practices and Patterns: Explore key traits of migrating open-source software and its impact on software development.

Related

  • Required Skills/Knowledge To Pass AWS Machine Learning Specialty Certification
  • Top 10 Jobs With AWS Certification
  • Comparing Pandas, Polars, and PySpark: A Benchmark Analysis
  • The Beginner's Guide To Understanding Graph Databases

Trending

  • 10 ChatGPT Prompts To Boost Developer Productivity
  • Data Integration Technology Maturity Curve 2024-2030
  • Agile vs. DevOps: What Sets Them Apart?
  • Strengthening Web Application Security With Predictive Threat Analysis in Node.js
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Feature Store - Why Do You Need One?

Feature Store - Why Do You Need One?

How to set up a data architecture that saves your data scientists time and effort.

By 
Roger Oriol user avatar
Roger Oriol
·
Aug. 22, 22 · Analysis
Like (1)
Save
Tweet
Share
4.8K Views

Join the DZone community and get the full member experience.

Join For Free

A feature store is a storage system for features. Features are properties of data calculated through an ETL process or feature pipeline. This pipeline takes raw data and calculates a property from it. This property - usually a numeric value - will be useful to a machine learning model. It is important to find adequate, correct, and quality features. The quality of those features is the most important contributor to a model's success. The model will use the features either to train itself or to make predictions. A feature store will help to organize and use those features.

At its core, a feature store is only a database. More specifically, there are usually two databases. There is an offline store equipped to store large sums of data, like an HBase or S3. There is also an online store equipped for fast data serving, like Cassandra. Features are organized in feature groups, which can be thought of as tables. Features that are used together are stored in the same feature group so access to them is faster and without joins. There are many ETL processes (think Spark) that write to the offline store. Data from the offline store replicates to the online store to keep them consistent. Data streams can also write both to the online and offline stores, for fast real-time data access.

The architecture of Michelangelo Pallete feature store developed at Uber

In this article, I will lay out the advantages of including a Feature Store in your data architecture. Prescribing solutions to all cases without further thought is definitely not the answer. But almost every Data Science team will benefit from having a feature store, even if it is small.

Reusable Features

The principal reason to be for a feature store is to empower data scientists to reuse features. Building feature pipelines takes around 80% of data scientists' time. Avoiding repeated feature engineering work will result in a faster work cycle. One example of feature reuse is for sharing features between training and inference. The features used for training are roughly the same as the features used for making a prediction. Another example of feature reuse is between teams or projects. Features related to core enterprise concepts are usually used throughout different ML projects. To encourage reuse, features must be discoverable through the feature store.

Feature Consistency

Another benefit of centralizing features in a single feature store is feature consistency. Different data science teams might calculate similar features slightly differently. Those features might be the same concept and data scientists will have to agree to unify them. Then, if the process to calculate the feature changes, it changes for all the projects that use it. Or they might be a different concept, and data scientists will have to categorize them according to their separate quirks.

Point-in-Time Correctness

Feature stores also enable point-in-time correctness. The online store will always have the latest value for a feature. The offline store will store all historical values the feature had at any point. This enables data scientists to work with old values, aggregate time ranges, and so on. It also ensures the reproducibility of a model. At any point, we can recover the data used in a past training or in a past inference to debug the model.

Data Health

One can also generate statistics from the feature store to monitor the health of the data. If the data drifts (its health or structure changes over time), it can be automatically detected in the pipeline. Statistics can also help explain how a feature affects the predictions of each model.

Data Lineage

Using the catalog of features and models, you can draw a data lineage. This data lineage shows the data source used to create each feature. It also shows the models or other feature pipelines that use the feature. This graph enables debugging problems with data. It becomes trivial to track down where a piece of data came from and how it is being used.

Online Store

In some use cases, an ML model will have a low-latency requirement. For example, if a model is being called from an API call, the user will expect a response within a few seconds. This requires very fast access to features. Instead of calculating them each time, we can access the precalculated features in the online store. We know that the online store is always going to have the up-to-date last value of the feature. The online store is also optimized for sub-second queries for a fast response.

Don't use a feature store if you don't have to. But if your organization has a medium-sized ML team or several ML teams, or it has any of the needs I exposed, consider introducing a feature store. It will only benefit your data science teams in the long run.

How To Start Using a Feature Store Now?

You can build a feature store by putting together your own components as Uber did with Michelangelo. You could use Hive for the offline store, Cassandra and Redis for the online store, Kafka for streaming real-time data, and a Spark cluster to run ETL processes. On the other hand, you could also trust other people that have already built feature stores and use their solutions. You can choose an open-source solution and host it yourself. Some open-source solutions are:

  • Feast: a minimal Feature Store that lacks some features like an ETL system and data lineage. Feast has integration support with tools from GCP (BigQuery as the offline store and Datastore as the online store) and AWS (Redshift, DynamoDB). It also has integration support for other agnostic tools like Snowflake, Redis, or Kafka.
  • Hopsworks: very complete Feature store. It includes tools like a Model Registry, multi-tenant governance, data lineage, and much more. It can be deployed either in GCP, AWS, Azure or in the premise. This is because Hopsworks provides its own technology, instead of integrating with other sources like Feast. Hopsworks is deployed in a Kubernetes cluster. This cluster includes a RonDB database for the online store and integrates with S3/Bucket for the offline store.

You can also choose a SaaS tool instead of an open-source one. Some examples include:

  • Databricks Feature Store: it is integrated inside the Databricks Lakehouse Platform. Therefore, it is a good fit if you are already using Databricks as your ML platform. It uses Delta Lake as an offline store and can be integrated with either AWS DynamoDB, AWS RDS, or AWS Aurora as an online store.
  • SageMaker Feature Store: fully managed feature store by AWS. It uses S3 as an offline store and DynamoDB as an online store. It integrates with all the other tools in the SageMaker environment and with data sources within AWS like Redshift, Athena and S3.
  • Vertex AI Feature Store: a feature store fully managed by Google in their cloud provider GCP. It uses BigQuery as an offline store and BigTable as an online store. It integrates with all the other tools in the Vertex AI environment and with BigQuery and GCS as data sources.
AWS Data science Machine learning

Published at DZone with permission of Roger Oriol. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Required Skills/Knowledge To Pass AWS Machine Learning Specialty Certification
  • Top 10 Jobs With AWS Certification
  • Comparing Pandas, Polars, and PySpark: A Benchmark Analysis
  • The Beginner's Guide To Understanding Graph Databases

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: