DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Low-Code Development: Leverage low and no code to streamline your workflow so that you can focus on higher priorities.

DZone Security Research: Tell us your top security strategies in 2024, influence our research, and enter for a chance to win $!

Launch your software development career: Dive head first into the SDLC and learn how to build high-quality software and teams.

Open Source Migration Practices and Patterns: Explore key traits of migrating open-source software and its impact on software development.

Related

  • Next-Gen Data Pipes With Spark, Kafka and k8s
  • Apache Storm Tutorial
  • Superior Stream Processing: Apache Flink's Impact on Data Lakehouse Architecture
  • What Is a Streaming Database?

Trending

  • Developer Git Commit Hygiene
  • How To Use Builder Design Pattern and DataFaker Library for Test Data Generation in Automation Testing
  • The AI Revolution: Empowering Developers and Transforming the Tech Industry
  • What Is Plagiarism? How to Avoid It and Cite Sources
  1. DZone
  2. Software Design and Architecture
  3. Microservices
  4. Apache Storm: Architecture

Apache Storm: Architecture

Storm is simple, can be used with any programming language, is used by many companies, and is a lot of fun to use! Let's dive into its architecture.

By 
Ayush Tiwari user avatar
Ayush Tiwari
·
Updated Nov. 17, 17 · Tutorial
Like (23)
Save
Tweet
Share
33.6K Views

Join the DZone community and get the full member experience.

Join For Free

In our previous blog, Apache Storm: The Hadoop of Real-Time we have discussed introduction of apache storm. So, to explore more about apache storm, we will be going to talk about the basic architecture of Apache Storm.


Components of a Storm Cluster

An Apache Storm cluster is superficially similar to a Hadoop cluster. Whereas on Hadoop you run MapReduce jobs, on Storm, you run topologies. Jobs and topologies themselves are very different — one key difference being that a MapReduce job eventually finishes, whereas a topology processes messages forever (or until you kill it).

There are two kinds of nodes in a Storm cluster: master node and worker nodes.

1. Master Node (Nimbus)

The master node runs a daemon called Nimbus that is similar to Hadoop’s JobTracker. Nimbus is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures.

Nimbus is an Apache Thrift service enabling you to submit code in any programming language. This way, you can always utilize the language that you are proficient in without needing to learn a new language to utilize Apache Storm.

The Nimbus service relies on Apache ZooKeeper to monitor the message processing tasks as all the worker nodes update their tasks status in the Apache ZooKeeper service.

2. Worker Nodes (Supervisor)

Each worker node runs a daemon called the Supervisor. The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Each worker process executes a subset of a topology; a running topology consists of many worker processes spread across many machines.

All coordination between Nimbus and the Supervisors is done through a ZooKeeper cluster. Additionally, the Nimbus daemon and Supervisor daemons are fail-fast and stateless. Even though stateless nature has its own disadvantages, it actually helps Storm process real-time data in the best possible and quickest way.

Storm is not entirely stateless, though. It stores its state in Apache ZooKeeper. Since the state is available in Apache ZooKeeper, a failed Nimbus can be restarted and made to work from where it left. Service monitoring tools can monitor Nimbus and restart it if there is any failure.

Apache Storm also has an advanced topology called Trident Topology with state maintenance. and it also provides a high-level API like Pig.

Image result for bolt in storm

Topologies

To do real-time computation on Storm, you create what are called topologies. A topology is a graph of computation and is implemented as DAG (directed acyclic graph) data structure.

Each node in a topology contains processing logic (bolts) and links between nodes indicate how data should be passed around between nodes (streams).

A Storm topology

When a topology is submitted to a Storm cluster, the Nimbus service on master node consults the supervisor services on different worker nodes and submits the topology. Each supervisor creates one or more worker processes, each having its own separate JVM. Each process runs within itself threads that we call Executors.

The thread/executor processes the actual computational tasks: Spout or Bolt.

Running a topology is straightforward. First, you package all your code and dependencies into a single JAR. Then, you run a command like the following: 

storm jar all-my-code.jar org.apache.storm.MyTopology arg1 arg2

Streams represent the unbounded sequences of tuples (collection of key-value pairs) where a tuple is a unit of data.

A stream of tuples flows from spout to bolt(s) or from bolt(s) to another bolt(s). There are various stream grouping techniques to let you define how the data should flow in topology like global grouping, etc.

Spouts

A spout is the entry point in a Storm topology. It represents the source of data in Storm. Generally, spouts will read tuples from an external source and emit them into the topology. You can write spouts to read data from data sources such as a database, distributed file systems, messaging frameworks, or a message queue as Kafka from where it gets continuous data, converts the actual data into a stream of tuples, and emits them to bolts for actual processing. Spouts run as tasks in worker processes by Executor threads.

Spouts can broadly be classified as follows:

  • Reliable: These spouts have the ability to replay the tuples (a unit of data in the data stream). This helps applications achieve the "at least once message processing" semantic as, in case of failures, tuples can be replayed and processed again. Spouts for fetching data from messaging frameworks are generally reliable, as these frameworks provide a mechanism to replay the messages.
  • Unreliable: These spouts don’t have the ability to replay the tuples. Once a tuple is emitted, it cannot be replayed, regardless of whether it was processed successfully. This type of spout follows the "at most once message processing" semantic.

Bolts

All processing in topologies is done in bolts. Bolts can do anything from filtering and functions to aggregations, joins, talking to databases, and more.

Bolts can do simple stream transformations. Doing complex stream transformations often requires multiple steps and thus multiple bolts. For example, transforming a stream of tweets into a stream of trending images requires at least two steps: a bolt to do a rolling count of retweets for each image and one or more bolts to stream out the top X images (you can do this particular stream transformation in a more scalable way with three bolts than with two).

Image result for bolt in storm

Bolts can also emit more than one stream.

What Makes a Running Topology: Worker Processes, Executors, and Tasks

Storm distinguishes between the following three main entities that are used to actually run a topology in a Storm cluster:

  1. Worker processes
  2. Executors (threads)
  3. Tasks

Here is a simple illustration of their relationships:

The relationships of worker processes, executors (threads) and tasks in Storm

A worker process executes a subset of a topology. A worker process belongs to a specific topology and may run one or more executors for one or more components (spouts or bolts) of this topology. A running topology consists of many such processes running on many machines within a Storm cluster.

An executor is a thread that is spawned by a worker process. It may run one or more tasks for the same component (spout or bolt).

A task performs the actual data processing — each spout or bolt that you implement in your code executes as many tasks across the cluster. The number of tasks for a component is always the same throughout the lifetime of a topology, but the number of executors (threads) for a component can change over time. This means that the following condition holds true: #threads ≤ #tasks.

By default, the number of tasks is set to be the same as the number of executors, i.e. Storm will run one task per thread.

This pretty much sums up the architecture of Apache Storm. I hope it was helpful!

  • References: http://storm.apache.org/releases/1.1.1/index.html

This article was first published on the Knoldus blog.

Apache Storm Data processing Database cluster Architecture Task (computing) Bolt (CMS) Stream (computing) Nimbus (cloud computing) Tuple

Published at DZone with permission of Ayush Tiwari, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Next-Gen Data Pipes With Spark, Kafka and k8s
  • Apache Storm Tutorial
  • Superior Stream Processing: Apache Flink's Impact on Data Lakehouse Architecture
  • What Is a Streaming Database?

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: