DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Low-Code Development: Leverage low and no code to streamline your workflow so that you can focus on higher priorities.

DZone Security Research: Tell us your top security strategies in 2024, influence our research, and enter for a chance to win $!

Launch your software development career: Dive head first into the SDLC and learn how to build high-quality software and teams.

Open Source Migration Practices and Patterns: Explore key traits of migrating open-source software and its impact on software development.

Related

  • Exploring the Power of the Functional Programming Paradigm
  • Keep Calm and Column Wise
  • NFT Wallets Unleashed: A Data Structures and Application Design Journey
  • Unleash Peak Performance in Java Applications: Overview of Profile-Guided Optimization (PGO)

Trending

  • Using Agile To Recover Failing Projects
  • How a Project Manager Can Increase Software Quality With Agile Practices
  • Javac and Java Katas, Part 2: Module Path
  • Apache Hudi: A Deep Dive With Python Code Examples
  1. DZone
  2. Coding
  3. Java
  4. Flexible Data Generation With Datafaker Gen

Flexible Data Generation With Datafaker Gen

In this article, explore Datafaker Gen, a command-line tool designed to generate realistic data in various formats and sink to different destinations.

By 
Roman Rybak user avatar
Roman Rybak
·
Feb. 22, 24 · Tutorial
Like (4)
Save
Tweet
Share
6.0K Views

Join the DZone community and get the full member experience.

Join For Free

Introduction to Datafaker

Datafaker is a modern framework that enables JVM programmers to efficiently generate fake data for their projects using over 200 data providers allowing for quick setup and usage. Custom providers could be written when you need some domain-specific data. In addition to providers, the generated data can be exported to popular formats like CSV, JSON, SQL, XML, and YAML.

For a good introduction to the basic features, see "Datafaker: An Alternative to Using Production Data."

Datafaker offers many features, such as working with sequences and collections and generating custom objects based on schemas (see "Datafaker 2.0").

Bulk Data Generation

In software development and testing, the need to frequently generate data for various purposes arises, whether it's to conduct non-functional tests or to simulate burst loads. Let's consider a straightforward scenario when we have the task of generating 10,000 messages in JSON format to be sent to RabbitMQ.

From my perspective, these options are worth considering:

  1. Developing your own tool: One option is to write a custom application from scratch to generate these records(messages). If the generated data needs to be more realistic, it makes sense to use Datafaker or JavaFaker.
  2. Using specific tools: Alternatively, we could select specific tools designed for particular databases or message brokers. For example, tools like voluble for Kafka provide specialized functionalities for generating and publishing messages to Kafka topics; or a more modern tool like ShadowTraffic, which is currently under development and directed towards a container-based approach, which may not always be necessary.
  3. Datafaker Gen: Finally, we have the option to use Datafaker Gen, which I want to consider in the current article.

Datafaker Gen Overview

Datafaker Gen offers a command-line generator based on the Datafaker library which allows for the continuous generation of data in various formats and integration with different storage systems, message brokers, and backend services. Since this tool uses Datafaker, there may be a possibility that the data is realistic. Configuration of the scheme, format type, and sink can be done without rebuilding the project.

Datafake Gen consists of the following main components that can be configured:

1. Schema Definition

Users can define the schema for their records in the config.yaml file. The schema specifies the field definitions of the record based on the Datafaker provider. It also allows for the definition of embedded fields.

YAML
 
default_locale: en-EN
fields:
  - name: lastname
    generators: [ Name#lastName ]
  - name: firstname
    generators: [ Name#firstName ]


2. Format

Datafake Gen allows users to specify the format in which records will be generated. Currently, there are basic implementations for CSV, JSON, SQL, XML, and YAML formats. Additionally, formats can be extended with custom implementations. The configuration for formats is specified in the output.yaml file.

YAML
 
formats:
  csv:
    quote: "@"
    separator: $$$$$$$
  json:
    formattedAs: "[]"
  yaml:
  xml:
    pretty: true


3. Sink

The sink component determines where the generated data will be stored or published. The basic implementation includes command-line output and text file sinks. Additionally, sinks can be extended with custom implementations such as RabbitMQ, as demonstrated in the current article. The configuration for sinks is specified in the output.yaml file.

YAML
 
sinks:
  rabbitmq:
    batchsize: 1 # when 1 message contains 1 document, when >1 message contains a batch of documents
    host: localhost
    port: 5672
    username: guest
    password: guest
    exchange: test.direct.exchange
    routingkey: products.key


Extensibility via Java SPI

Datafake Gen uses the Java SPI (Service Provider Interface) to make it easy to add new formats or sinks. This extensibility allows for customization of Datafake Gen according to specific requirements.

How To Add a New Sink in Datafake Gen

Before adding a new sink, you may want to check if it already exists in the datafaker-gen-examples repository. If it does not exist, you can refer to examples on how to add a new sink.

When it comes to extending Datafake Gen with new sink implementations, developers have two primary options to consider:

  1. By using this parent project, developers can implement sink interfaces for their sink extensions, similar to those available in the datafaker-gen-examples repository.
  2. Include dependencies from the Maven repository to access the required interfaces. For this approach, Datafake Gen should be built and exist in the local Maven repository. This approach provides flexibility in project structure and requirements.

1. Implementing RabbitMQ Sink

To add a new RabbitMQ sink, one simply needs to implement the net.datafaker.datafaker_gen.sink.Sink interface.

This interface contains two methods:

  1. getName - This method defines the sink name.
  2. run - This method triggers the generation of records and then sends or saves all the generated records to the specified destination. The method parameters include the configuration specific to this sink retrieved from the output.yaml file as well as the data generation function and the desired number of lines to be generated.
Java
 
import net.datafaker.datafaker_gen.sink.Sink;

public class RabbitMqSink implements Sink {

    @Override
    public String getName() {
        return "rabbitmq";
    }

    @Override
    public void run(Map<String, ?> config, Function<Integer, ?> function, int numberOfLines) {
        // Read output configuration ...
        int numberOfLinesToPrint = numberOfLines;
        String host = (String) config.get("host");
      
        // Generate lines 
        String lines = (String) function.apply(numberOfLinesToPrint);

        // Sending or saving results to the expected resource
        // In this case, this is connecting to RebbitMQ and sending messages.
        ConnectionFactory factory = getConnectionFactory(host, port, username, password);
        try (Connection connection = factory.newConnection()) {
            Channel channel = connection.createChannel();
            JsonArray jsonArray = JsonParser.parseString(lines).getAsJsonArray();
            jsonArray.forEach(jsonElement -> {
                try {
				    channel.basicPublish(exchange, routingKey, null, jsonElement.toString().getBytes());
                } catch (Exception e) {
				    throw new RuntimeException(e);
                }
            });
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}


2. Adding Configuration for the New RabbitMQ Sink

As previously mentioned, the configuration for sinks or formats can be added to the output.yaml file. The specific fields may vary depending on your custom sink. Below is an example configuration for a RabbitMQ sink:

YAML
 
sinks:
  rabbitmq:
    batchsize: 1 # when 1 message contains 1 document, when >1 message contains a batch of documents
    host: localhost
    port: 5672
    username: guest
    password: guest
    exchange: test.direct.exchange
    routingkey: products.key


3. Adding Custom Sink via SPI

Adding a custom sink via SPI (Service Provider Interface) involves including the provider configuration in the ./resources/META-INF/services/net.datafaker.datafaker_gen.sink.Sink file. This file contains paths to the sink implementations:

Properties files
 
net.datafaker.datafaker_gen.sink.RabbitMqSink


These are all 3 simple steps on how to expand Datafake Gen. In this example, we are not providing a complete implementation of the sink, as well as how to use additional libraries. To see the complete implementations, you can refer to the datafaker-gen-rabbitmq module in the example repository.

How To Run

Step 1

Build a JAR file based on the new implementation:

Shell
 
./mvnw clean verify


Step 2

Define the schema for records in the config.yaml file and place this file in the appropriate location where the generator should run. Additionally, define the sinks and formats in the output.yaml file, as demonstrated previously.

Step 3

Datafake Gen can be executed through two options:

  1. Use bash script from the bin folder in the parent project:

    Shell
     
    # Format json, number of lines 10000 and new RabbitMq Sink
    bin/datafaker_gen -f json -n 10000 -sink rabbitmq


2. Execute the JAR directly, like this:

Shell
 
java -cp [path_to_jar] net.datafaker.datafaker_gen.DatafakerGen -f json -n 10000 -sink rabbitmq


How Fast Is It?

The test was done based on the scheme described above, which means that one document consists of two fields. Documents are recorded one by one in the RabbitMQ queue in JSON format. The table below shows the speed for 10,000, 100,000, and 1M records on my local machine:

Records Time
10000 401 ms
100000 11613ms
1000000 121601ms

Conclusion

The Datafake Gen tool enables the creation of flexible and fast data generators for various types of destinations. Built on Datafaker, it facilitates realistic data generation. Developers can easily configure the content of records, formats, and sinks to suit their needs. As a simple Java application, it can be deployed anywhere you want, whether it's in Docker or on-premise machines.

  • The full source code is available here.
  • I would like to thank Sergey Nuyanzin for reviewing this article.

Thank you for reading, and I am glad to be of help.

JSON Service provider interface Data (computing) Java virtual machine

Opinions expressed by DZone contributors are their own.

Related

  • Exploring the Power of the Functional Programming Paradigm
  • Keep Calm and Column Wise
  • NFT Wallets Unleashed: A Data Structures and Application Design Journey
  • Unleash Peak Performance in Java Applications: Overview of Profile-Guided Optimization (PGO)

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: