Data Engineering Resources

DZone's Featured Data Engineering Resources

Enterprise AI

In recent years, artificial intelligence has become less of a buzzword and more of an adopted process across the enterprise. With that, there is a growing need to increase operational efficiency as customer demands arise. AI platforms have become increasingly more sophisticated, and there has become the need to establish guidelines and ownership.In DZone's 2022 Enterprise AI Trend Report, we explore MLOps, explainability, and how to select the best AI platform for your business. We also share a tutorial on how to create a machine learning service using Spring Boot, and how to deploy AI with an event-driven platform. The goal of this Trend Report is to better inform the developer audience on practical tools and design paradigms, new technologies, and the overall operational impact of AI within the business.This is a technology space that's constantly shifting and evolving. As part of our December 2022 re-launch, we've added new articles pertaining to knowledge graphs, a solutions directory for popular AI tools, and more.

You Can Shape Trend Reports: Participate in DZone Research Surveys + Enter the Prize Drawings!

By Caitlin Candelmo

Hello, DZone Community! We have several surveys in progress as part of our research for upcoming Trend Reports. We would love for you to join us by sharing your experiences and insights (anonymously if you choose) — readers just like you drive the content that we cover in our Trend Reports. check out the details for each research survey below Over the coming months, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our Trend Reports. Security Research Security is everywhere; you can’t live with it, and you certainly can’t live without it! We are living in an entirely unprecedented world — one where bad actors are growing more sophisticated and are taking full advantage of the rapid advancements in AI. We will be exploring the most pressing security challenges and emerging strategies in this year’s survey for our August Enterprise Security Trend Report. Our 10-12-minute Enterprise Security Survey explores: Building a security-first organization Security architecture and design Key security strategies and techniques Cloud and software supply chain security At the end of the survey, you're also able to enter the prize drawing for a chance to receive one of two $175 (USD) e-gift cards! Join the Security Research Data Engineering Research As a continuation of our annual data-related research, we're consolidating our database, data pipeline, and data and analytics scopes into a single 12-minute survey that will guide help the narratives of our July Database Systems Trend Report and data engineering report later in the year. Our 2024 Data Engineering Survey explores: Database types, languages, and use cases Distributed database design + architectures Data observability, security, and governance Data pipelines, real-time processing, and structured storage Vector data and databases + other AI-driven data capabilities Join the Data Engineering Research You'll also have the chance to enter the $500 raffle at the end of the survey — five random people will be drawn and will receive $100 each (USD)! Cloud and Kubernetes Research This year, we're combining our annual cloud native and Kubernetes research into one 10-minute survey that dives further into these topics as they relate to both one another and at the intersection of security, observability, AI, and more. DZone's research will be informing these Trend Reports: May – Cloud Native: Championing Cloud Development Across the SDLC September – Kubernetes in the Enterprise Our 2024 Cloud Native Survey covers: Microservices, container orchestration, and tools/solutions Kubernetes use cases, pain points, and security measures Cloud infrastructure, costs, tech debt, and security threats AI for release management + monitoring/observability Join the Cloud Native Research Don't forget to enter the $750 raffle at the end of the survey! Five random people will be selected to each receive $150 (USD). Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Publications team More

Refcard #071

Core PostgreSQL

By Kellyn Gorman

CORE

Profiling Big Datasets With Apache Spark and Deequ

By Akshay Jain

10 ChatGPT Prompts To Boost Developer Productivity

By Shahid Shaikh

The Beginner's Guide To Understanding Graph Databases

As the volume of data increases exponentially and queries become more complex, relationships become a critical component for data analysis. In turn, specialized solutions such as graph databases that explicitly optimize for relationships are needed. Other databases aren’t designed to be able to search and query data based on the intricate relationships found in complex data structures. Graph databases are optimized to handle connected data by modeling the information into a graph, which maps data through nodes and relationships. With this article, readers will traverse a beginner’s guide to graph databases, their terminologies, and comparisons with relational databases. They will also explore graph databases from cloud providers like AWS Neptune to open-source solutions. Additionally, this article can help develop a better understanding of how graph databases are useful for applications such as social network analysis, fraud detection, and many other areas. Readers will also learn how graph databases are used for applications like knowledge graph databases and social media analytics. What Is a Graph Database? A graph database is a purpose-built NoSQL database specializing in data structured in complex network relationships, where entities and their relationships have interconnections. Data is modeled using graph structures, and the essential elements of this structure are nodes, which represent entities, and edges, which represent the relationships between entities. The nodes and edges of a graph can all have attributes. Critical Components of Graph Databases Nodes These are the primary data elements representing entities such as people, businesses, accounts, or any other item you might find in a database. Each node can store a set of key-value pairs as properties. Edges Edges are the lines that connect nodes, defining their relationships. In addition to nodes, edges can also have properties – such as weight, type, or strength – that clarify their relationship. Properties Nodes and edges can each have properties that can be used to store metadata about those objects. These can include names, dates, or any other relevant descriptive attributes to a node or edge. How Graph Databases Store and Process Data In a graph database, nodes and relationships are considered first-class citizens — in contrast to relational databases, nodes are stored in tabular forms, and relationships are computed at query time. This lets graph databases treat the data relationships as having as much value as the data, which enables faster traversal of connected data. With their traversal algorithms, graph databases can explore the relationships between nodes and edges to answer complicated queries like the shortest path, fraud detection, or network analysis. Various graph-specific query languages – Neo4j’s Cypher and Tinkerpop’s Gremlin – enable these operations by focusing on pattern matching and deep-link analytics. Practical Applications and Benefits Graph databases shine in any application where the relationships between the data points are essential, such as web and social networks, recommendation engines, and a whole host of other apps where it’s necessary to know how deep and wide the relationships go. In areas such as fraud detection and network security, it’s essential to adjust and adapt dynamically; this is something graph databases do very well. In conclusion, graph databases offer a solid infrastructure for working with complex, highly connected data. They offer many advantages over relational databases regarding modeling relationships and the interactions between the data. Key Components and Terminology Nodes and Their Properties Nodes are the basic building blocks of a graph database. They typically represent some object or a specific instance, be it a person, place, or thing. For each node, we have a vertex in the graph structure. The node can also contain several properties (also called "labels" in the database context). Each of these properties is a key-value pair, where the value expands or further clarifies the object, and its content depends on the application of the graph database. Edges: Defining Relationships Edges, on the other hand, are the links that tie the nodes together. They are directional, so they can have a start node and an end node (thus defining the flow between one node and another). These edges also define the nature of the relationship—whether it is internalizational or social. Labels: Organizing Nodes The labels help group nodes that might have similarities (Person nodes, Company nodes, etc.) so that graph databases can retrieve sets of nodes more quickly. For example, in a social network analysis, Person and Company nodes might be grouped using labels. Relationships and Their Characteristics Relationships connect nodes, but they also have properties, such as strength, status, or duration, that can define how the relationship might differ between nodes. Graph Query Languages: Cypher and Gremlin Graph databases require unique particular languages to use their often complicated structure, and these languages differ from graph databases. Cypher, used with Neo4j, is a reasonably relatively pattern-based language. Gremlin, used with other graph databases, is more procedural and can traverse more complex graph structures. Both languages are expressive and powerful, capable of queries that would be veritable nightmares written in the languages used with traditional databases. Tools for Managing and Exploring Graph Data Neo4j offers a suite of tools designed to enhance the usability of graph databases: Neo4j Bloom: Explore graph data visually without using a graph query language. Neo4j Browser: A web-based application for executing Cypher queries and visualizing the results. Neo4j Data Importer and Neo4j Desktop: These tools for importing data into a Neo4j database and handling Neo4j database instances, respectively. Neo4j Ops Manager: Useful for managing multiple Neo4j instances to ensure that large-scale deployments can be managed and optimized. Neo4j Graph Data Science: This library is an extension of Neo4j that augments its capabilities, which are more commonly associated with data science. It enables sophisticated analytical tasks to be performed directly on graph data. Equipped with these fundamental components and tools, users can wield the power of graph databases to handle complex data and make knowledgeable decisions based on networked knowledge systems. Comparing Graph Databases With Other Databases While graph and relational databases are designed to store and help us make sense of data, they fundamentally differ in how they accomplish this. Graph databases are built on the foundation of nodes and edges, making them uniquely fitted for dealing with complex relationships between data points. That foundation’s core is structure, representing connected entities through nodes and their relationships through edges. Relational databases arrange data in ‘rows and columns’ – tables, whereas graph databases are ‘nodes and edges.’ This difference in structure makes such a direct comparison between the two kinds of databases compelling. Graph databases organize data in this way naturally, whereas it’s not as easy to represent relationships between certain types of data points in relational databases. After all, they were invented to deal with transactions (i.e., a series of swaps of ‘rows and columns’ between two sides, such as a payment or refund between a seller and a customer). Data Models and Scalability Graph databases store data in a graph with nodes, edges, and properties. They are instrumental in domains with complex relationships, such as social networks or recommendation engines. As an example of the opposite end of the spectrum, relational databases contain data in tables, which is well-suited for applications requiring high levels of data integrity (i.e., applications such as those involved in financial systems or managing customer relationships). Another benefit, for example, is their horizontal scalability: graph databases grow proportionally to their demands by adding more machines to a network instead of the vertical scalability (adding more oomph to an existing machine) typical for a relational database. Query Performance and Flexibility One reason is that graph databases are generally much faster at executing complex queries with deep relationships because they can traverse nodes and edges—unlike relational databases, which might have to perform lots of joins that could speed up or slow down depending on the size of the data set. In addition, graph databases excel in the ease with which the data model can be changed without severe consequences. As business requirements evolve and users learn more about how their data should interact, a graph database can be more readily adapted without costly redesigns. Though better suited for providing strong transactional guarantees or ACID compliance, relational databases are less adept at model adjustments. Use of Query Languages The different languages of query also reflect the distinct nature of these databases. Whereas graph databases tend to use a language tailored to the way a graph is traversed—such as Gremlin or Cypher—relational databases have long been managed and queried through SQL, a well-established language for structured data. Suitability for Different Data Types Relational databases are well suited for handling large datasets with a regular and relatively simple structure. In contrast, graph databases shine in environments where the structures are highly interconnected, and the relationships are as meaningful as the data. In conclusion, while graph and relational databases have pros and cons, which one to use depends on the application’s requirements. Graph databases are better for analyzing intricate and evolving relationships, which makes them ideal for modern applications that demand a detailed understanding of networked data. Advantages of Graph Databases Graph databases are renowned for their efficiency and flexibility, mainly when dealing with complex, interconnected data sets. Here are some of the key advantages they offer: High Performance and Real-Time Data Handling Performance is a huge advantage for graph databases. It comes from the ease, speed, and efficiency with which it can query linked data. Graph databases often beat relational databases at handling complex, connected data. They are well suited to continual, real-time updates and queries, unlike, e.g., Hadoop HDFS. Enhanced Data Integrity and Contextual Awareness Keeping these connections intact across channels and data formats, graph databases maintain rich data relationships and allow that data to be easily linked. This structure surfaces nuance in interactions humans could not otherwise discern, saving time and making the data more consumable. It gives users relevant insights to understand the data better and helps businesses make more informed decisions. Scalability and Flexibility Graph databases have been designed to scale well. They can accommodate the incessant expansion of the underlying data and the constant evolution of the data schema without downtime. They can also scale well in terms of the number of data sources they can link, and again, this linking can temporarily accommodate a continuous evolution of the schema without interrupting service. They are, therefore, particularly well-suited to environments in which rapid adaptation is essential. Advanced Query Capabilities These graphs-based systems can quickly run powerful recursive path queries to retrieve direct (‘one hop’) and indirect (‘two hops’ and ‘twenty hops’) connections, making running complex subgraph pattern-matching queries easy. Moreover, complex group-by-aggregate queries (such as Netflix’s tag aggregation) are also natively supported, allowing arbitrary degree flexibility in aggregating selective dimensions, such as in big-data setups with multiple dimensions, such as time series, demographics, or geographics. AI and Machine Learning Readiness The fact that graph databases naturally represent entities and inter-relations as a structured set of connections makes them especially well-suited for AI and machine-learning foundational infrastructures since they support fast real-time changes and rely on expressive, ergonomic declarative query languages that make deep-link traversal and scalability a simple matter – two features that are critical in the case of next-generation data analytics and inference. These advantages make graph databases a good fit for an organization that needs to manage and efficiently draw meaningful insights from dataset relationships. Everyday Use Cases for Graph Databases Graph databases are being used by more industries because they are particularly well-suited for handling complex connections between data and keeping the whole system fast. Let’s look at some of the most common uses for graph databases. Financial and Insurance Services The financial and insurance services sector increasingly uses graph databases to detect fraud and other risks; how these systems model business events and customer data as a graph allows them to detect fraud and suspicious links between various entities, and the technique of Entity Link Analysis takes this a step further, allowing the detection of potential fraud in the interactions between different kinds of entities. Infrastructure and Network Management Graph databases are well-suited for infrastructure mapping and keeping network inventories up to date. Serving up an interactive map of the network estate and performing network tracing algorithms to walk across the graph is straightforward. Likewise, it makes writing new algorithms to identify problematic dependencies, vulnerable bottlenecks, or higher-order latency issues much easier. Recommendation Systems Many companies – including major e-commerce giants like Amazon – use graph databases to power recommendation engines. These keep track of which products and services you’ve purchased and browsed in the past to suggest things you might like, improving the customer experience and engagement. Social Networking Platforms Social networks such as Facebook, Twitter, and LinkedIn all use graph databases to manage and query huge amounts of relational data concerning people, their relationships, and interactions. This makes them very good at quickly navigating across vast social networks, finding influential users, detecting communities, and identifying key players. Knowledge Graphs in Healthcare Healthcare organizations assemble critical knowledge about patient profiles, past ailments, and treatments in knowledge graphs, while graph queries implemented on graph databases identify patient patterns and trends. These can influence how treatments proceed positively and how patients fare. Complex Network Monitoring Graph databases are used to model and monitor complex network infrastructures, including telecommunications networks or end-to-end environments of clouds (data-center infrastructure including physical networking, storage, and virtualization). This application is undoubtedly crucial for the robustness and scalability of those systems and environments that form the essential backbone of the modern information infrastructure. Compliance and Governance Organizations also use graph databases to manage data related to compliance and governance, such as access controls, data retention policies, and audit trails, to ensure they can continue to meet high standards of data security and regulatory compliance. AI and Machine Learning Graph databases are also essential for developing artificial intelligence and machine learning applications. They allow developers to create standardized means of storing and querying data for applications such as natural language processing, computer vision, and advanced recommendation systems, which is essential for making AI applications more intelligent and responsive. Unraveling Financial Crimes Graphs provide a way to trace the structure of shell corporate entities that criminals use to launder money, studying whether the patterns of supplies to shell companies and cash flows from shell companies to other entities are suspicious. Such applications are helpful for law enforcement and regulatory agencies to unravel complex money laundering networks and fight against financial crime. Automotive Industry In the automotive industry, graph queries help analyze the relationships between tens of thousands of car parts, enabling real-time interactive analysis that has the potential to improve manufacturing and maintenance processes. Criminal Network Analysis In law enforcement, graph databases are used to identify criminal networks, address patterns, and identify critical links in criminal organizations to bring operations down efficiently from all sides. Data Lineage Tracking Graph technology can also track data lineage (the details of where an item of data, such as a fact or number, was created, how it was copied, and where it was used). This is important for auditing and verifying that data assets are not corrupted. This diverse array of applications underscores the versatility of graph databases and their utility in representing and managing complex, interconnected data across multiple diverse fields. Challenges and Considerations Graph databases are built around modeling structures in a specific domain, in a process resembling both knowledge or ontology engineering, and a practical challenge that can require specialized "graph data engineers." All these requirements point to important scalability issues and potentially limit the appeal of this technology to many beyond the opponents of a data web. Inconsistency of data across the system remains a critical issue since developing homogeneous systems that can maintain data consistency while maintaining flexibility and expressivity is challenging. While graph queries don’t require as much coding as SQL, paths for traversal across the data still have to be spelled out explicitly. This increases the effort needed to write queries and prevents graph queries from being as easily abstracted and reused as SQL code, impairing their generalization. Furthermore, because there isn’t a unified standard for capabilities or query languages, developers invent their own – a further step in API fragmentation. Another significant issue is knowing which machine is the best place to put that data, given all the subtle relationships between nodes, deciding that is crucial to performance but hard to do on the fly. As necessary, many existing graph database systems weren’t architected for today’s high volumes of data, so they can end up being performance bottlenecks. From a project management standpoint, failure to accurately capture and map business requirements to technical requirements often results in confusion and delay. Poor data quality, inadequate access to data sources, verbose data modeling, or time-consuming data modeling will magnify the pain of a graph data project. On the end-user side, asking people to learn new languages or skills in order to read some graphs could deter adoption, while the difficulty of sharing those graphs or collaborating on the analysis will eventually lower the range and impact of the insights. The Windows 95 interface had an excellent early advantage in the virtues of simplicity: we can tell the same story about graph technologies nowadays. Adopting this technology is also hindered when the analysis process is criticized as too time-consuming. From a technical perspective, managing large graphs by storing and querying complex structures presents more significant challenges. For example, the data must be distributed on a cluster of multiple machines, adding another level of complexity for developers. Data is typically sharded (split) into smaller parts and stored on various machines, coordinated by an "intelligent" virtual server managing access control and query across multiple shards. Choosing the Right Graph Database When selecting a graph database, it’s crucial to consider the queries’ complexity and the data’s interconnectedness. A well-chosen graph database can significantly enhance the performance and scalability of data-driven applications. Key Factors to Consider Native graph storage and processing: Opt for databases designed from the ground up to handle graph data structures. Property graphs and Graph Query Languages: Ensure the database supports robust graph query languages and can handle property graphs efficiently. Data ingestion and integration capabilities: The ability to seamlessly integrate and ingest data from various sources is vital for dynamic data environments. Development tools and graph visualization: Tools that facilitate development and allow intuitive graph visualizations to improve usability and insights. Graph data science and analytics: Databases with advanced analytics and data science capabilities can provide deeper insights. Support for OLTP, OLAP, and HTAP: Depending on the application, support for transactional (OLTP), analytical (OLAP), and hybrid (HTAP) processing may be necessary. ACID compliance and system durability: Essential for ensuring data integrity and reliability in transaction-heavy environments Scalability and performance: The database should scale vertically and horizontally to handle growing data loads. Enterprise security and privacy features: Robust security features are crucial to protect sensitive data and ensure privacy. Deployment flexibility: The database should match the organization’s deployment strategy, whether on-premises or cloud. Open-source foundation and community support: A strong community and open-source foundation can provide extensive support and flexibility. Business and technology partnerships: Partnerships can offer additional support and integration options, enhancing the database’s capabilities. Comparing Popular Graph Databases Dgraph: This is the most performant and scalable option for enterprise systems that need to handle massive amounts of fast-flowing data. Memgraph: An open-source, in-memory storage database with a query language specially designed for real-time data and analytics Neo4j: Offers a comprehensive graph data science library and is well-suited for static data storage and Java-oriented developers Each of these databases has its advantages: Memgraph is the strongest contender in the Python ecosystem (you can choose Python, C++, or Rust for your custom stored procedures), and Neo4j’s managed solution offers the most control over your deployment into the cloud (its AuraDB service provides a lot of power and flexibility). Community and Free Resources Memgraph has a free community edition and a paid enterprise edition, and Neo4j has a community "Labs" edition, a free enterprise trial, and hosting services. These are all great ways for developers to get their feet wet without investing upfront. In conclusion, choosing the proper graph database to use is contingent upon understanding the realities of your project well enough and the potential of the database to which you are selecting. If you bear this notion in mind, your organization will be using graph databases to their full potential to enhance its data infrastructure and insights. Conclusion Having navigated through the expansive realm of graph databases, the hope is that you now know not only the basics of these beautiful databases, from nodes to edges, from vertex storage to indexing, but also those of their applications across industries, including finance, government, and healthcare. This master guide comprehensively introduces graph databases, catering to sophomores and seniors in the database field. Now, every reader of this broad stratum is fully prepared to take the following steps in understanding how graph databases work, how they compare against traditional and non-relational databases, and where they are utilized in the real world. We have seen that choosing a graph database requires careful consideration of the project’s requirements and features. The reflections and difficulties highlighted the importance of correct implementation and the advantage of the graph database in changing our way of processing and looking at data. The graph databases’ complexity and power allow us to provide new insights and be more efficient in computation. In this way, new data management and analysis methods may be developed. References Graph Databases for Beginners How to choose a graph database: we compare 6 favorites What is A Graph Database? A Beginner's Guide Video: What is a graph database? (in 10 minutes) AWS: What Is a Graph Database? Neo4j: What is a graph database? Wikipedia: Graph database Geeks for Geeks: What is Graph Database – Introduction Memgraph: What is a Graph Database? Geeks for Geeks: Introduction to Graph Database on NoSQL Graph Databases: A Comprehensive Overview and Guide. Part1 Graph database concepts AWS: What’s the Difference Between a Graph Database and a Relational Database? Comparison of Relational Databases and Graph Databases Nebula Graph: Graph Database vs Relational Database: What to Choose? Graph database vs. relational database: Key differences The ultimate guide to graph databases Neo4j: Why Graph Databases? What is a Graph Database and What are the Benefits of Graph Databases What Are the Major Advantages of Using a Graph Database? Graph Databases for Beginners: Why Graph Technology Is the Future Understanding Graph Databases: Unleashing the Power of Connected Data in Data Science Use cases for graph databases 7 Graph Database Use Cases That Will Change Your Mind When Connected Data Matters Most 17 Use Cases for Graph Databases and Graph Analytics The Challenges of Working with a Graph Database Where the Path Leads: State of the Art and Challenges of Graph Database Systems 5 Reasons Graph Data Projects Fail 16 Things to Consider When Selecting the Right Graph Database How to Select a Graph Database: Best Practices at RoyalFlush Neo4j vs Memgraph - How to Choose a Graph Database?

By Shantanu Kumar

CORE

Dynamic Watermarking of Bitmaps in Databases

Synopsis Many databases contain bitmaps stored as blobs or files: photos, document scans, medical images, etc. When these bitmaps are retrieved by various database clients and applications, it is sometimes desirable to uniquely watermark them as they are being retrieved, so that they can be identified later. In some cases, you may even want to make this watermark invisible. This kind of dynamic bitmap manipulation can easily be done by a programmable database proxy, without changing the persisted bitmaps. This approach has the following benefits: The watermark can be customized for each retrieval and can contain information about the date, time, user identity, IP address, etc. The image processing is done by the proxy, which puts no extra load on the database. This requires no changes to the database or to the database clients. The End Result Given a bitmap stored in a database, such as: a programmable database proxy can modify the bitmap on its way to the client to include a watermark containing any desired information, such as: How This Works The architecture is simple: instead of the normal connection between database clients and servers: the clients connect to the proxy, and the proxy connects to the server: The proxy can then manipulate the bitmaps as needed when they are retrieved. For instance, it can watermark only some bitmaps, or it can use different styles of watermarks, depending on the circumstances. The bitmaps stored in the database are completely unaffected: they are modified on the fly as they are forwarded to the clients. Advantages The clients and the database are blissfully unaware - this is completely transparent to them. Each image can be watermarked uniquely when it is retrieved (e.g., date/time, user name, IP address of client, etc.). No additional load is put on the database server(s). Disadvantages The system is more complex with the addition of the proxy. There will be (typically modest) an increase in latency, depending mostly on the size of the images, but this should be compared to the alternatives. Example Using a proxy, we can create a simple filter to add a watermark to certain bitmaps. If we assume that our database contains a table called images, with a column called bitmap of type blob or varbinary (depending on your database), we can create a result set filter in the proxy with the following parameter: Query pattern: regex:select.*from.*images.* and a bit of JavaScript code (which also uses the underlying Java engine): JavaScript // Get the value of the bitmap column as a byte stream let stream = context.packet.getJavaStream("bitmap"); if (stream === null) { return; } // The text to use as watermark const now = new Date(); const watermark = "Retrieved by " + context.connectionContext.userName + " on " + now.getFullYear() + "/" + (now.getMonth()+1) + "/" + now.getDate(); // Read the bitmap const ImageIO = Java.type("javax.imageio.ImageIO"); let img = ImageIO.read(stream); // Create the Graphics to draw the text let g = img.createGraphics(); const Color = Java.type("java.awt.Color"); g.setColor(new Color(255, 255, 0, 150)); const Font = Java.type("java.awt.Font"); g.setFont(new Font("sans-serif", Font.BOLD, 16)); // Draw the text at the bottom of the bitmap let textRect = textFont.getStringBounds(watermark, g.getFontRenderContext()); g.drawString(watermark, (img.getWidth() / 2) - (textRect.getWidth() / 2), img.getHeight() - (textRect.getHeight() / 2)); // Write the bitmap to the column value const ByteArrayOutputStream = Java.type("java.io.ByteArrayOutputStream"); let outStream = new ByteArrayOutputStream(); ImageIO.write(img, "png", outStream); context.packet.bitmap = outStream.toByteArray(); With this filter in place, bitmaps retrieved from this table will include a watermark containing the name of the database user, and a timestamp. The database is never affected: the bitmaps stored in the database are completely unchanged. They are modified on the fly as they are delivered to the client. Obviously, we can watermark bitmaps selectively, we can change the text of the watermark depending on any relevant factors, and we can play with fonts, colors, positioning, transparency, etc. See this example for details. Secret Watermarks In some cases, it might be desirable to mark the bitmaps in a way that is not visible to the naked eye. One trivial way to do this would be to edit the image's metadata, but if we need something more subtle, we can use steganography to distribute a secret message among the bitmap in a way that makes it difficult to detect. The example above can be modified to use the Adumbra library: // Get the value of the bitmap column as a byte stream let inStream = context.packet.getJavaStream("bitmap"); if (inStream === null) { return; } // The hidden message const now = new Date(); const message = "Retrieved by " + context.connectionContext.userName + " on " + now.getFullYear() + "/" + (now.getMonth()+1) + "/" + now.getDate(); const messageBytes = context.utils.getUTF8BytesForString(message); const keyBytes = context.utils.getUTF8BytesForString("This is my secret key"); // Hide the message in the bitmap const Encoder = Java.type("com.galliumdata.adumbra.Encoder"); const ByteArrayOutputStream = Java.type("java.io.ByteArrayOutputStream"); let outStream = new ByteArrayOutputStream(); let encoder = new Encoder(1); encoder.encode(inStream, outStream, "png", messageBytes, keyBytes); context.packet.bitmap = outStream.toByteArray(); With this in place, the modified bitmaps served to the clients will contain a secret watermark that will be difficult to detect, and almost impossible to extract without the secret key. What Else Can You Do With This? This watermarking technique can also be applied to documents other than bitmaps: Documents such as PDF and MS Word can be given some extra metadata on the fly, or they can be given a visible or invisible watermark - see this example for PDF documents. All text documents can be subtly marked using techniques such as altering spacing, spelling, layout, fonts and colors, zero-width characters, etc. All digital documents that can sustain minor changes without losing any significant meaning, such as bitmaps, audio files, and sample sets, can be altered in a similar way. In fact, entire data sets can be watermarked by subtly modifying some non-critical aspects of the data, making it possible to identify these datasets later on and know exactly their origin. This is beyond the scope of this article, but there are many ways to make data traceable back to its origin. Conclusion When you need to have a custom watermark for every retrieval of some bitmaps or documents from a database, the technique shown here is a solid approach that avoids any additional load on the database and requires no changes to the clients or servers.

By Max Tardiveau

Understanding Bayesian Modeling and Probabilistic Programming for Machine Learning

Traditional machine learning (ML) models and AI techniques often suffer from a critical flaw: they lack uncertainty quantification. These models typically provide point estimates without accounting for the uncertainty surrounding their predictions. This limitation undermines the ability to assess the reliability of the model's output. Moreover, traditional ML models are data-hungry and often require correctly labeled data, and as a result, tend to struggle with problems where data is limited. Furthermore, these models lack a systematic framework for incorporating expert domain knowledge or prior beliefs into the model. Without the ability to leverage domain-specific insights, the model might overlook crucial nuances in data and tend not to perform up to its potential. ML models are becoming more complex and opaque, while there is a growing demand for more transparency and accountability in decisions derived from data and AI. Probabilistic Programming: A Solution To Addressing These Challenges Probabilistic programming provides a modeling framework that addresses these challenges. At its core lies Bayesian statistics, a departure from the frequentist interpretation of statistics. Bayesian Statistics In frequentist statistics, probability is interpreted as the long-run relative frequency of an event. Data is considered random and a result of sampling from a fixed-defined distribution. Hence, noise in measurement is associated with the sampling variations. Frequentists believe that probability exists and is fixed, and infinite experiments converge to that fixed value. Frequentist methods do not assign probability distributions to parameters, and their interpretation of uncertainty is rooted in the long-run frequency properties of estimators rather than explicit probabilistic statements about parameter values. In Bayesian statistics, probability is interpreted as a measure of uncertainty in a particular belief. Data is considered fixed, while the unknown parameters of the system are regarded as random variables and are modeled using probability distributions. Bayesian methods capture uncertainty within the parameters themselves and hence offer a more intuitive and flexible approach to uncertainty quantification. Frequentist vs. Bayesian Statistics [1] Probabilistic Machine Learning In frequentist ML, model parameters are treated as fixed and estimated through Maximum Likelihood Estimation (MLE), where the likelihood function quantifies the probability of observing the data given the statistical model. MLE seeks point estimates of parameters maximizing this probability. To implement MLE: Assume a model and the underlying model parameters. Derive the likelihood function based on the assumed model. Optimize the likelihood function to obtain point estimates of parameters. Hence, frequentist models which include Deep Learning rely on optimization, usually gradient-based, as its fundamental tool. To the contrary, Bayesian methods model the unknown parameters and their relationships with probability distributions and use Bayes' theorem to compute and update these probabilities as we obtain new data. Bayes Theorem: "Bayes’ rule tells us how to derive a conditional probability from a joint, conditioning tells us how to rationally update our beliefs, and updating beliefs is what learning and inference are all about" [2]. This is a simple but powerful equation. Prior represents the initial belief about the unknown parameters Likelihood represents the probability of the data based on the assumed model Marginal Likelihood is the model evidence, which is a normalizing coefficient. The Posterior distribution represents our updated beliefs about the parameters, incorporating both prior knowledge and observed evidence. In Bayesian machine learning inference is the fundamental tool. The distribution of parameters represented by the posterior distribution is utilized for inference, offering a more comprehensive understanding of uncertainty. Bayesian update in action: The plot below illustrates the posterior distribution for a simple coin toss experiment across various sample sizes and with two distinct prior distributions. This visualization provides insights into how the combination of different sample sizes and prior beliefs influences the resulting posterior distributions. Impact of Sample Size and Prior on Posterior Distribution How to Model the Posterior Distribution The seemingly simple posterior distribution in most cases is hard to compute. In particular, the denominator i.e. the marginal likelihood integral tends to be interactable, especially when working with a higher dimension parameter space. And in most cases there's no closed-form solution and numerical integration methods are also computationally intensive. To address this challenge we rely on a special class of algorithms called Markov Chain Monte Carlo simulations to model the posterior distribution. The idea here is to sample from the posterior distribution rather than explicitly modeling it and using those samples to represent the distribution of the model parameters Markov Chain Monte Carlo (MCMC) "MCMC methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain a sample of the desired distribution by recording states from the chain" [3]. A few of the commonly used MCMC samplers are: Metropolis-Hastings Gibbs Sampler Hamiltonian Monte Carlo (HMC) No-U-Turn Sampler (NUTS) Sequential Monte Carlo (SMC) Probabilistic Programming Probabilistic Programming is a programming framework for Bayesian statistics i.e. it concerns the development of syntax and semantics for languages that denote conditional inference problems and develop "solvers” for those inference problems. In essence, Probabilistic Programming is to Bayesian Modeling what automated differentiation tools are to classical Machine Learning and Deep Learning models [2]. There exists a diverse ecosystem of Probabilistic Programming languages, each with its own syntax, semantics, and capabilities. Some of the most popular languages include: BUGS (Bayesian inference Using Gibbs Sampling) [4]: BUGS is one of the earliest probabilistic programming languages, known for its user-friendly interface and support for a wide range of probabilistic models. It implements Gibbs sampling and other Markov Chain Monte Carlo (MCMC) methods for inference. JAGS (Just Another Gibbs Sampler) [5]: JAGS is a specialized language for Bayesian hierarchical modeling, particularly suited for complex models with nested structures. It utilizes the Gibbs sampling algorithm for posterior inference. STAN: A probabilistic programming language renowned for its expressive modeling syntax and efficient sampling algorithms. STAN is widely used in academia and industry for a variety of Bayesian modeling tasks. "Stan differs from BUGS and JAGS in two primary ways. First, Stan is based on a new imperative probabilistic programming language that is more flexible and expressive than the declarative graphical modeling languages underlying BUGS or JAGS, in ways such as declaring variables with types and supporting local variables and conditional statements. Second, Stan’s Markov chain Monte Carlo (MCMC) techniques are based on Hamiltonian Monte Carlo (HMC), a more efficient and robust sampler than Gibbs sampling or Metropolis-Hastings for models with complex posteriors" [6]. BayesDB: BayesDB is a probabilistic programming platform designed for large-scale data analysis and probabilistic database querying. It enables users to perform probabilistic inference on relational databases using SQL-like queries [7] PyMC3: PyMC3 is a Python library for Probabilistic Programming that offers an intuitive and flexible interface for building and analyzing probabilistic models. It leverages advanced sampling algorithms such as Hamiltonian Monte Carlo (HMC) and Automatic Differentiation Variational Inference (ADVI) for inference [8]. TensorFlow Probability: "TensorFlow Probability (TFP) is a Python library built on TensorFlow that makes it easy to combine probabilistic models and deep learning on modern hardware (TPU, GPU)" [9]. Pyro: "Pyro is a universal probabilistic programming language (PPL) written in Python and supported by PyTorch on the backend. Pyro enables flexible and expressive deep probabilistic modeling, unifying the best of modern deep learning and Bayesian modeling" [10]. These languages share a common workflow, outlined below: Model definition: The model defines the processes governing data generation, latent parameters, and their interrelationships. This step requires careful consideration of the underlying system and the assumptions made about its behavior. Prior distribution specification: Define the prior distributions for the unknown parameters within the model. These priors encode the practitioner's beliefs, domain, or prior knowledge about the parameters before observing any data. Likelihood specification: Describe the likelihood function, representing the probability distribution of observed data conditioned on the unknown parameters. The likelihood function quantifies the agreement between the model predictions and the observed data. Posterior distribution inference: Use a sampling algorithm to approximate the posterior distribution of the model parameters given the observed data. This typically involves running Markov Chain Monte Carlo (MCMC) or Variational Inference (VI) algorithms to generate samples from the posterior distribution. Case Study: Forecasting Stock Index Volatility In this case study, we will employ Bayesian modeling techniques to forecast the volatility of a stock index. Volatility here measures the degree of variation in a stock's price over time and is a crucial metric for assessing the risk associated with a particular stock. Data: For this analysis, we will utilize historical data from the S&P 500 stock index. The S&P 500 is a widely used benchmark index that tracks the performance of 500 large-cap stocks in the United States. By examining the percentage change in the index's price over time, we can gain insights into its volatility. S&P 500 — Share Price and Percentage Change From the plot above, we can see that the time series — price change between consecutive days has: Constant Mean Changing variance over time, i.e., the time series exhibits heteroscedasticity Modeling Heteroscedasticity: "In statistics, a sequence of random variables is homoscedastic if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as heterogeneity of variance" [11]. Auto-regressive Conditional Heteroskedasticity (ARCH) models are specifically designed to address heteroscedasticity in time series data. Bayesian vs. Frequentist Implementation of ARCH Model The key benefits of Bayesian modeling include the ability to incorporate prior information and quantify uncertainty in model parameters and predictions. These are particularly useful in settings with limited data and when prior knowledge is available. In conclusion, Bayesian modeling and probabilistic programming offer powerful tools for addressing the limitations of traditional machine-learning approaches. By embracing uncertainty quantification, incorporating prior knowledge, and providing transparent inference mechanisms, these techniques empower data scientists to make more informed decisions in complex real-world scenarios. References Fornacon-Wood, I., Mistry, H., Johnson-Hart, C., Faivre-Finn, C., O'Connor, J.P. and Price, G.J., 2022. Understanding the differences between Bayesian and frequentist statistics. International journal of radiation oncology, biology, physics, 112(5), pp.1076-1082. Van de Meent, J.W., Paige, B., Yang, H. and Wood, F., 2018. An Introduction to Probabilistic Programming. arXiv preprint arXiv:1809.10756. Markov chain Monte Carlo Spiegelhalter, D., Thomas, A., Best, N. and Gilks, W., 1996. BUGS 0.5: Bayesian inference using Gibbs sampling manual (version ii). MRC Biostatistics Unit, Institute of Public Health, Cambridge, UK, pp.1-59. Hornik, K., Leisch, F., Zeileis, A. and Plummer, M., 2003. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In Proceedings of DSC (Vol. 2, No. 1). Carpenter, B., Gelman, A., Hoffman, M.D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M.A., Guo, J., Li, P. and Riddell, A., 2017. Stan: A probabilistic programming language. Journal of statistical software, 76. BayesDB PyMC TensorFlow Probability Pyro AI Homoscedasticity and heteroscedasticity Introduction to ARCH Models pymc.GARCH11

By Salman Khan

Snowflake Data Sharing Capabilities

Data drives business in the modern economy; the faster businesses can get to data and provide meaningful insights, the more they can enable informed decision-making. Snowflake has come a long way in this space in recent years, and the progress is impressive. Snowflake is also being increasingly adopted by several firms, as it is well known for its large dataset processing and computing power. It provides scalability, affordability, security, ease of use, customization, and easy data integration. In addition, Snowflake provides a host of specialized services, like Snowflake Arctic, Snowflake for Big Data, Snowflake Data Sharing, and Snow Pipe, as required depending on the use case. They bring a powerful weapon to the table for all enterprises striving to cash in on strategic data utilization. In this paper, I will explore how data sharing works in Snowflake. Data sharing is the process of making data available to multiple users, applications, or organizations while maintaining its quality. Organizations often need to share data with customers, suppliers, and partners, but they face significant challenges such as poor governance, outdated solutions, manual data transfers, and being tied to specific vendors. To become truly data-driven, organizations need an improved method for sharing data. Snowflake offers a modern solution to these challenges, enabling seamless and secure data sharing. Data Sharing When using Snowflake as your data warehouse, you can share selected data objects with another Snowflake account holder or even with someone who doesn't have a Snowflake account (through a Reader Account). One major advantage of data sharing in Snowflake is that the data isn't copied or transferred between accounts. Instead, any updates made to the provider account are immediately available to the consumer. Provider The provider, also referred to as the data provider or producer, is the user of a Snowflake account responsible for creating a share and making it available to other Snowflake account users for consumption. As the creator of the share, the provider holds the authority to determine which data and resources are shared and accessible to other users within the Snowflake ecosystem. Consumer A data consumer refers to any account that opts to establish a database using a share provided by a data provider. As a data consumer, upon integrating a shared database into your account, you gain access to and can query the contents of the database in the same manner as any other database within your account. There are different methods for sharing data in Snowflake. You can either restrict access based on specific permissions, ensuring only authorized users can view certain objects, or you can make the data available for all intended users to read. This flexibility allows for secure and efficient data collaboration. Direct Share Direct Share is the simplest method for consumers to access data shared by a provider when the provider and consumer are in the same region. This approach requires the data provider to have access to the account IDs of the consumer accounts. Once set up, consumers can easily view and use the shared data objects. Consumers With Snowflake Account Consumers with a Snowflake account can be given access to data shared by a provider. The shared objects can only be accessed by these consumers. In this setup, the provider is charged for storage, while the consumer is charged for compute usage. When sharing and consuming data via Snowflake shares, it’s important to follow best practices. These include validating the data shares, auditing access to the shared data, and adding or removing objects from the shares as needed. This ensures secure and efficient data sharing. A list of commands to be executed in Snowflake to share an object from the provider to the consumer. SQL CREATE DATABASE ROLE MYSHARE; GRANT USAGE ON SCHEMA PUBLIC TO DATABASE ROLE MYSHARE; GRANT SELECT ON VIEW VW_CUSTOMER TO DATABASE ROLE MYSHARE; SHOW GRANTS TO DATABASE ROLE MYSHARE; CREATE OR REPLACE SHARE MY_TEST_SHARE; GRANT USAGE ON DATABASE SAMPLE_DB TO SHARE MY_TEST_SHARE; GRANT USAGE ON SCHEMA SAMPLE_DB.PUBLIC TO SHARE MY_TEST_SHARE; GRANT SELECT ON TABLE SAMPLE_DB.PUBLIC.CUSTOMER_TEST TO SHARE MY_TEST_SHARE; SHOW SHARES; SHOW GRANTS TO SHARE MY_TEST_SHARE; ALTER SHARE MY_TEST_SHARE ADD ACCOUNT= ACC12345; ------ TO SHARE ALL OBJECTS GRANT SELECT ON ALL TABLES IN SCHEMA SAMPLE_DB.PUBLIC TO SHARE MY_TEST_SHARE; GRANT SELECT ON ALL TABLES IN DATABASE SAMPLE_DB TO SHARE MY_TEST_SHARE; Consumers With No Snowflake Account (Reader Account) As a data provider, you might want to share data with a consumer who does not have a Snowflake account or is not ready to become a licensed Snowflake customer. Consumers without a Snowflake account cannot access shared data. To enable access, you can create a Reader Account and share it with the consumer. With Reader Accounts, sharing data is quick, simple, and affordable without having the user sign up for a Snowflake account. The provider account that generated each reader account is in charge of managing it. Both the processing and storage fees that the customer incurs will be covered by the provider. SQL CREATE MANAGED ACCOUNT READER_ACCT; ADMIN_NAME='READER_ACCT'; ADMIN_PASSWORD='**********'; TYPE='READER'; SHOW MANAGED ACCOUNTS; --Snowflake URL with Locator ( Share it with the Consumer) ALTER SHARE SHARE_NAME ADD ACCTID=LOCATOR IN URL IN ABOVE LINE ; SHARE_RESTRICATIONS=FALSE; ALTER SHARE QCUFBZG.AXYZ5751.POC_SNOWFLAKE_SECURE_SHARE ADD ACCOUNT=XYZ11993; SHARE_RESTRICTIONS=FALSE; When sharing data with a consumer, the consumer can see all the information, and the provider cannot hide any data. It's suggested that you share data using secure views. This way, only the attributes meant for the consumer are visible, and access to other data is restricted. Views should be labeled as secure if they're meant to ensure data privacy. Listing Listing is a more advanced way of securely sharing data and operates on the same producer and consumer model as Direct Share. However, it differs in that it's not restricted to the same region; data can be accessed by Snowflake accounts in different regions. Data can be shared with specific accounts or published in the Snowflake Marketplace. Listings come in two types: Private and Public. Private: Private listings are exclusively accessible to specific consumers. They allow you to utilize listing features to directly share data and other information with other Snowflake accounts in any Snowflake region. Public: Data products can be shared in Snowflake Marketplace publicly. By offering listings on the Snowflake Marketplace, you can promote your data product across the Snowflake Data Cloud. This allows you to share curated data offerings with multiple consumers at once, instead of managing sharing arrangements with each consumer individually. Conclusion Snowflake's data-sharing capabilities provide a modern solution for organizations looking to share data securely and efficiently. By leveraging the features mentioned, businesses can overcome traditional data-sharing challenges and unlock the full potential of their data. Determine the best approach in sharing your data with consumers or accessing the data from providers based on the use case. For more detailed instructions and best practices on implementing data sharing, refer to the official Snowflake documentation.

By Krishnamurty Raju Mudunuru

Convert Your Code From Jupyter Notebooks To Automated Data and ML Pipelines Using AWS

A typical machine learning (ML) workflow involves processes such as data extraction, data preprocessing, feature engineering, model training and evaluation, and model deployment. As data changes over time, when you deploy models to production, you want your model to learn continually from the stream of data. This means supporting the model’s ability to autonomously learn and adapt in production as new data is added. In practice, data scientists often work with Jupyter Notebooks for development work and find it hard to translate from notebooks to automated pipelines. To achieve the two main functions of an ML service in production, namely retraining (retrain the model on newer labeled data) and inference (use the trained model to get predictions), you might primarily use the following: Amazon SageMaker: A fully managed service that provides developers and data scientists the ability to build, train, and deploy ML models quickly AWS Glue: A fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data In this post, we demonstrate how to orchestrate an ML training pipeline using AWS Glue workflows and train and deploy the models using Amazon SageMaker. For this use case, you use AWS Glue workflows to build an end-to-end ML training pipeline that covers data extraction, data processing, training, and deploying models to Amazon SageMaker endpoints. Use Case For this use case, we use the DBpedia Ontology classification dataset to build a model that performs multi-class classification. We trained the model using the BlazingText algorithm, which is a built-in Amazon SageMaker algorithm that can classify unstructured text data into multiple classes. This post doesn’t go into the details of the model but demonstrates a way to build an ML pipeline that builds and deploys any ML model. Solution Overview The following diagram summarizes the approach for the retraining pipeline. The workflow contains the following elements: AWS Glue crawler: You can use a crawler to populate the Data Catalog with tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. ETL jobs that you define in AWS Glue use these Data Catalog tables as sources and targets. AWS Glue triggers: Triggers are Data Catalog objects that you can use to either manually or automatically start one or more crawlers or ETL jobs. You can design a chain of dependent jobs and crawlers by using triggers. AWS Glue job: An AWS Glue job encapsulates a script that connects source data, processes it, and writes it to a target location. AWS Glue workflow: An AWS Glue workflow can chain together AWS Glue jobs, data crawlers, and triggers, and build dependencies between the components. When the workflow is triggered, it follows the chain of operations as described in the preceding image. The workflow begins by downloading the training data from Amazon Simple Storage Service (Amazon S3), followed by running data preprocessing steps and dividing the data into train, test, and validate sets in AWS Glue jobs. The training job runs on a Python shell running in AWS Glue jobs, which starts a training job in Amazon SageMaker based on a set of hyperparameters. When the training job is complete, an endpoint is created, which is hosted on Amazon SageMaker. This job in AWS Glue takes a few minutes to complete because it makes sure that the endpoint is in InService status. At the end of the workflow, a message is sent to an Amazon Simple Queue Service (Amazon SQS) queue, which you can use to integrate with the rest of the application. You can also use the queue to trigger an action to send emails to data scientists that signal the completion of training, add records to management or log tables, and more. Setting up the Environment To set up the environment, complete the following steps: Configure the AWS Command Line Interface (AWS CLI) and a profile to use to run the code. For instructions, see Configuring the AWS CLI. Make sure you have the Unix utility wget installed on your machine to download the DBpedia dataset from the internet. Download the following code into your local directory. Organization of Code The code to build the pipeline has the following directory structure: --Glue workflow orchestration --glue_scripts --DataExtractionJob.py --DataProcessingJob.py --MessagingQueueJob,py --TrainingJob.py --base_resources.template --deploy.sh --glue_resources.template The code directory is divided into three parts: AWS CloudFormation templates: The directory has two AWS CloudFormation templates: glue_resources.template and base_resources.template. The glue_resources.template template creates the AWS Glue workflow-related resources, and base_resources.template creates the Amazon S3, AWS Identity and Access Management (IAM), and SQS queue resources. The CloudFormation templates create the resources and write their names and ARNs to AWS Systems Manager Parameter Store, which allows easy and secure access to ARNs further in the workflow. AWS Glue scripts: The folder glue_scripts holds the scripts that correspond to each AWS Glue job. This includes the ETL as well as model training and deploying scripts. The scripts are copied to the correct S3 bucket when the bash script runs. Bash script: A wrapper script deploy.sh is the entry point to running the pipeline. It runs the CloudFormation templates and creates resources in the dev, test, and prod environments. You use the environment name, also referred to as stage in the script, as a prefix to the resource names. The bash script performs other tasks, such as downloading the training data and copying the scripts to their respective S3 buckets. However, in a real-world use case, you can extract the training data from databases as a part of the workflow using crawlers. Implementing the Solution Complete the following steps: Go to the deploy.sh file and replace algorithm_image name with <ecr_path> based on your Region. The following code example is a path for Region us-west-2: Shell algorithm_image="433757028032.dkr.ecr.us-west-2.amazonaws.com/blazingtext:latest" For more information about BlazingText parameters, see Common parameters for built-in algorithms. Enter the following code in your terminal: Shell sh deploy.sh -s dev AWS_PROFILE=your_profile_name This step sets up the infrastructure of the pipeline. On the AWS CloudFormation console, check that the templates have the status CREATE_COMPLETE. On the AWS Glue console, manually start the pipeline. In a production scenario, you can trigger this manually through a UI or automate it by scheduling the workflow to run at the prescribed time. The workflow provides a visual of the chain of operations and the dependencies between the jobs. To begin the workflow, in the Workflow section, select DevMLWorkflow. From the Actions drop-down menu, choose Run. View the progress of your workflow on the History tab and select the latest RUN ID. The workflow takes approximately 30 minutes to complete. The following screenshot shows the view of the workflow post-completion. After the workflow is successful, open the Amazon SageMaker console. Under Inference, choose Endpoint. The following screenshot shows that the endpoint of the workflow deployed is ready. Amazon SageMaker also provides details about the model metrics calculated on the validation set in the training job window. You can further enhance model evaluation by invoking the endpoint using a test set and calculating the metrics as necessary for the application. Cleaning Up Make sure to delete the Amazon SageMaker hosting services—endpoints, endpoint configurations, and model artifacts. Delete both CloudFormation stacks to roll back all other resources. See the following code: Python def delete_resources(self): endpoint_name = self.endpoint try: sagemaker.delete_endpoint(EndpointName=endpoint_name) print("Deleted Test Endpoint ", endpoint_name) except Exception as e: print('Model endpoint deletion failed') try: sagemaker.delete_endpoint_config(EndpointConfigName=endpoint_name) print("Deleted Test Endpoint Configuration ", endpoint_name) except Exception as e: print(' Endpoint config deletion failed') try: sagemaker.delete_model(ModelName=endpoint_name) print("Deleted Test Endpoint Model ", endpoint_name) except Exception as e: print('Model deletion failed') This post describes a way to build an automated ML pipeline that not only trains and deploys ML models using a managed service such as Amazon SageMaker, but also performs ETL within a managed service such as AWS Glue. A managed service unburdens you from allocating and managing resources, such as Spark clusters, and makes it easy to move from notebook setups to production pipelines.

By Sai Sharanya Nalla

Snowflake Data Time Travel

Snowflake is a leading cloud-based data storage and analytics service that provides various solutions for data warehouses, data engineering, AI/ML modeling, and other related services. It has multiple features and functionalities; one powerful data recovery feature is Time Travel. It allows users to access historical data from the past. It is beneficial when a user comes across any of the below scenarios: Retrieving the previous row or column value before the current DML operation Recovering the last state of data for backup or redundancy Updating or deleting records from the table by mistake Restoring the previous state of the table, schema, or database Snowflake's Continuous Data Protection Life Cycle allows time travel within a window of 1 to 90 days. For the Enterprise edition, up to 90 days of retention is allowed. Time Travel SQL Extensions Time Travel can be achieved using Offsets, Timestamps, and Statements keywords in addition to the AT or BEFORE clause. Offset If a user wants to retrieve past data or recover a table from the older state data using time parameters, then the user can use the query below, where offset is defined in seconds. SQL SELECT * FROM any_table AT(OFFSET => -60*5); -- For 5 Minutes CREATE TABLE recoverd_table CLONE any_table AT(OFFSET => -3600); -- For 1 Hour Timestamp Suppose a user wants to query data from the past or recover a schema for a specific timestamp. Then, the user can utilize the below query. SQL SELECT * FROM any_table AT(TIMESTAMP => 'Sun, 05 May 2024 16:20:00 -0700'::timestamp_tz); CREATE SCHEMA recovered_schema CLONE any_schema AT(TIMESTAMP => 'Wed, 01 May 2024 01:01:00 +0300'::timestamp_tz); Statement Users can also use any unique query ID to get the latest data until the statement. SQL SELECT * FROM any_table BEFORE(STATEMENT => '9f6e1bq8-006f-55d3-a757-beg5a45c1234'); CREATE DATABASE recovered_db CLONE any_db BEFORE(STATEMENT => '9f6e1bq8-006f-55d3-a757-beg5a45c1234'); The command below sets the data retention time and increases or decreases. SQL CREATE TABLE any_table(id NUMERIC, name VARCHAR, created_date DATE) DATA_RETENTION_TIME_IN_DAYS=90; ALTER TABLE any_table SET DATA_RETENTION_TIME_IN_DAYS=30; If data retention is not required, then we can also use SET DATA_RETENTION_TIME_IN_DAYS=0;. Objects that do not have an explicitly defined retention period can inherit the retention from the upper object level. For instance, tables that do not have a specified retention period will inherit the retention period from schema, and schema that does not have the retention period defined will inherit from the database level. The account level is the highest level of the hierarchy and should be set up with 0 days for data retention. Now consider a case where a table, schema, or database accidentally drops, causing all the data to be lost. During such cases, when any data object gets dropped, it's kept in Snowflake's back-end until the data retention period. For such cases, Snowflake has a similar great feature that will bring those objects back with below SQL. SQL UNDROP TABLE any_table; UNDROP SCHEMA any_schema; UNDROP DATABASE any_database; If a user creates a table with the same name as the dropped table, then Snowflake creates a new table, not restore the old one. When the user uses the above UNDROP command, Snowflake restores the old object. Also, the user needs permission or ownership to restore the object. After the Time Travel period, if the object isn't retrieved within the data retention period, it is transferred to Snowflake Fail-Safe, where users can't query. The only way to recover that is by using Snowflake's help, and it stores the data for a maximum of 7 days. Challenges Time travel, though useful, has a few challenges, as shown below. The Time Travel has a default one-day setup for transient and temporary tables in Snowflake. Any objects except tables, such as views, UDFs, and stored procedures, are not supported. If a table is recreated with the same name, referring to the older version of the same name requires renaming the current table as, by default, Time Travel will refer to the latest version. Conclusion The Time Travel feature is quick, easy, and powerful. It's always handy and gives users more comfort while operating production-sensitive data. The great thing is that users can run these queries themselves without having to involve admins. With a maximum retention of 90 days, users have more than enough time to query back in time or fix any incorrectly updated data. In my opinion, it is Snowflake's strongest feature. Reference Understanding & Using Time Travel

By BHUSHAN FADNIS

Advancements in AI for Health Data Analysis

Healthcare has ushered in a transformative era dominated by artificial intelligence (AI) and machine learning (ML), which are now central to data analytics and operational utilities. The transformative power of AI and ML is unlocking unprecedented value by rapidly converting vast datasets into actionable insights. These insights not only enhance patient care and streamline treatment processes but also pave the way for groundbreaking medical discoveries. With the precision and efficiency brought by AI and ML, diagnoses and treatment strategies become significantly more accurate and effective, accelerating the pace of medical research and marking a fundamental shift in healthcare. Benefits of AI in Healthcare AI and ML will influence the healthcare industry’s entire ecosystem. From more accurate diagnostic procedures to personalized treatment recommendations and operational efficiency, everything can be sought with the help of AI and ML. AI technologies help healthcare providers take real-time data analytics, predictive analysis, and decision support capabilities towards the most proactive and highly personalized approach to patient care. For instance, AI algorithms will increase diagnostic accuracy through the study of images, while ML models will help analyze historical data to predict the outcomes of a patient, hence making the treatment approach used. Machine Learning in Health Data Analysis The revolution in health data lies at the door of machine learning, with powerful tools that identify patterns and predict future outcomes based on historical data. Prime importance falls on the algorithms that forecast disease progression, improve treatment methodologies, and streamline healthcare delivery. These findings will enable improved personalized medicine for better strategies for slowing disease progression and improving patient care. Most importantly, ML algorithms optimize healthcare operations through thorough data analysis of the trends, which may include patient admission levels and resource utilization in a streamlined hospital workflow to yield improved service delivery. Example: Patient Admission Rates With Random Forest Explanation Data loading: Load your data from a CSV file. Replace 'patient_data.csv' with the path to your actual data file. Feature selection: Only the features relevant to the hospital admissions targets, such as age, blood pressure, heart rate, and previous admissions, are selected. Data splitting: Split the data into training and testing sets to evaluate the model performance. Feature scaling should be used to rescale the features so that the model considers all features equally because logistic regression is sensitive to the features' scaling. Model training: Train a logistic regression model using the training data. Try making admission predictions using the model on the test set. Evaluation: The built model should be evaluated based on accuracy, confusion matrix, and a detailed classification report from the test set to validate model prediction for patient admission. Natural Language Processing in Health Data Analysis Natural Language Processing (NLP) is another critical feature that allows the extraction of useful information, including clinical notes, patient feedback, and medical journals. The NLP tools help analyze and interpret the overwhelming text data produced in health settings daily, thus easing access to appropriate information. This capability is precious for supporting clinical decisions and research, with fast insights from existing patient records and literature, improving the speed and accuracy of medical diagnostics and patient management. Example: Deep Learning Model for Disease Detection in Medical Imaging Explanation ImageDataGenerator: Automatically adjusts the image data during training for augmentation (such as rotation, width shift, and height shift), which helps the model generalize better from limited data. Flow_from_directory: Loads images directly from a directory structure, resizing them as necessary and applying the transformations specified in ImageDataGenerator. Model architecture: In sequence, the model uses several convolutional (Conv2D) and pooling layers (MaxPooling2D). Convolutional layers help the model learn the features in the images, and pooling layers reduce the dimensionality of each feature map. Dropout: This layer randomly sets a fraction of the input units to 0 at each update during training time, which helps to prevent overfitting. Flatten: Converts the pooled feature maps to a single column passed to the densely connected layers. Dense: Fully connected layers. These layers consist of fully connected neurons that take input from the features in the data. The final layer uses a sigmoid activation function to give binary classification output. Compilation and training: The model is compiled using a binary cross-entropy loss function, which is generally suitable for this classification task. Then, it's compiled and optimized with the given optimizer and finally trained using the .fit method on the train data received from the train_generator with validation using the validation_generator. Saving the model: Save the trained model for later use, whether for deployment in medical diagnostic applications or further refinement. Deep Learning in Health Data Analysis Deep learning is a complicated subject of machine learning used for analyzing high-complexity data structures using appropriate neural networks. The technology has been proven helpful in areas such as medical imaging, where deep learning models effectively detect and diagnose diseases from images with a level of precision that is sometimes higher than that exhibited by human experts. In genomics, deep learning aids in parsing and understanding genetic sequences, offering insight central to parsing for personalized medicine and treatment planning. Example: Deep Learning for Genomic Sequence Classification Explanation Data preparation: We simulate sequence data where each base of the DNA sequence (A, C, G, T) is represented as a one-hot encoded vector. This means each base is converted into a vector of four elements. The sequences and corresponding labels (binary classification) are randomly generated for demonstration. Model architecture and Conv1D layers: These convolutional layers are specifically useful for sequence data (like time series or genetic sequences). They process data in a way that respects its temporal or sequential nature. MaxPooling1D layers: These layers reduce the spatial size of the representation, decreasing the number of parameters and computation in the network, and hence, help to prevent overfitting. Flatten layer: This layer flattens the output from the convolutional and pooling layers to be used as input to the densely connected layers. Dense layers: These are fully connected layers. Dropout between these layers reduces overfitting by preventing complex co-adaptations on training data. Compilation and training: The model is compiled with the 'adam' optimizer and 'categorical_crossentropy' loss function, typical for multi-class classification tasks. It is trained using the .fit method, and performance is validated on a separate test set. Evaluation: After training, the model's performance is evaluated on the test set to see how well it can generalize to new, unseen data. AI Applications in Diagnostics and Treatment Planning AI has dramatically improved the speed and accuracy of diagnosing diseases by using medical images, genetic indicators, and patient histories for the most minor signs of the disease. Secondly, AI algorithms help develop personalized treatment regimens by filtering through enormous amounts of treatment data and patient responses to provide tailored care, optimizing therapeutic effectiveness while minimizing side effects. Challenges and Ethical Considerations in AI and Health Data Analysis However, integrating AI and ML in healthcare at its face value also brings ethical considerations. Nevertheless, the areas of concern to be adjusted are data privacy, algorithmic bias, and transparent decision-making processes, pointing to the essential landmarks of these adjustments for proper, responsible use of AI in healthcare. It is necessary to ensure patient data’s safety and protection, and any installation should guarantee freedom from any biases and not lose trust and fairness in service deployment. Conclusion The future of health is quite promising, with the development of AI and ML technologies that provide new sophistication in the spectrum of analytical tools, such as AR in surgical procedures and virtual health assistants powered by AI. These advances will make better diagnosis and treatment possible while ensuring smooth operations and ultimately contributing to more tailor-made and effective patient care. In the further development and continuous integration of AI/ML technologies, healthcare delivery will change through more efficient, accurate, and central patient service provision. This means that several regulatory constraints need to be considered in addition to the business and technical challenges discussed.

By Saigurudatta Pamulaparthyvenkata

Understanding and Learning NoSQL Databases With Java: Three Key Benefits

In today's rapidly evolving technological landscape, it is crucial for any business or application to efficiently manage and utilize data. NoSQL databases have emerged as an alternative to traditional relational databases, offering flexibility, scalability, and performance advantages. These benefits become even more pronounced when combined with Java, a robust and widely-used programming language. This article explores three key benefits of understanding and learning NoSQL databases with Java, highlighting the polyglot philosophy and its efficiency in software architecture. Enhanced Flexibility and Scalability One significant benefit of NoSQL databases is their capability to handle various data models, such as key-value pairs, documents, wide-column stores, and graph databases. This flexibility enables developers to select the most suitable data model for their use case. When combined with Java, a language renowned for its portability and platform independence, the adaptability of NoSQL databases can be fully utilized. Improved Performance and Efficiency Performance is a crucial aspect of database management, and NoSQL databases excel in this area because of their distributed nature and optimized storage mechanisms. When developers combine these performance-enhancing features with Java, they can create applications that are not only efficient but also high-performing. Embracing the Polyglot Philosophy The polyglot philosophy in software development encourages using multiple languages, frameworks, and databases within a single application to take advantage of each one's strengths. Understanding and learning NoSQL databases with Java perfectly embodies this approach, offering several benefits for modern software architecture. Leveraging Eclipse JNoSQL for Success With NoSQL Databases and Java To fully utilize NoSQL databases with Java, developers can use Eclipse JNoSQL, a framework created to streamline the integration and management of NoSQL databases in Java applications. Eclipse JNoSQL supports over 30 databases and is aligned with Jakarta NoSQL and Jakarta Data specifications, providing a comprehensive solution for modern data handling needs. Eclipse JNoSQL: Bridging Java and NoSQL Databases Eclipse JNoSQL is a framework that simplifies the interaction between Java applications and NoSQL databases. With support for over 30 different NoSQL databases, Eclipse JNoSQL enables developers to work efficiently across various data stores without compromising flexibility or performance. Key features of Eclipse JNoSQL include: Support for Jakarta Data Query Language: This feature enhances the power and flexibility of querying across databases. Cursor pagination: Processes large datasets efficiently by utilizing cursor-based pagination rather than traditional offset-based pagination NoSQLRepository: Simplifies the creation and management of repository interfaces New column and document templates: Simplify data management with predefined templates Jakarta NoSQL and Jakarta Data Specifications Eclipse JNoSQL is designed to support Jakarta NoSQL and Jakarta Data specifications, standardizing and simplifying database interactions in Java applications. Jakarta NoSQL: This comprehensive framework offers a unified API and a set of powerful annotations, making it easier to work with various NoSQL data stores while maintaining flexibility and productivity. Jakarta Data: This specification provides an API for easier data access across different database types, enabling developers to create custom query methods on repository interfaces. Introducing Eclipse JNoSQL 1.1.1 The latest release, Eclipse JNoSQL 1.1.1, includes significant enhancements and new features, making it a valuable tool for Java developers working with NoSQL databases. Key updates include: Support to cursor pagination Support to Jakarta Data Query Fixes several bugs and enhances performance For more details, visit the Eclipse JNoSQL Release 1.1.1 notes. Practical Example: Java SE Application With Oracle NoSQL To illustrate the practical use of Eclipse JNoSQL, let's consider a Java SE application using Oracle NoSQL. This example showcases the effectiveness of cursor pagination and JDQL for querying. The first pagination method we will discuss is Cursor pagination, which offers a more efficient way to handle large datasets than traditional offset-based pagination. Below is a code snippet demonstrating cursor pagination with Oracle NoSQL. Java @Repository public interface BeerRepository extends OracleNoSQLRepository<Beer, String> { @Find @OrderBy("hop") CursoredPage<Beer> style(@By("style") String style, PageRequest pageRequest); @Query("From Beer where style = ?1") List<Beer> jpql(String style); } public class App4 { public static void main(String[] args) { var faker = new Faker(); try (SeContainer container = SeContainerInitializer.newInstance().initialize()) { BeerRepository repository = container.select(BeerRepository.class).get(); for (int index = 0; index < 100; index++) { Beer beer = Beer.of(faker); // repository.save(beer); } PageRequest pageRequest = PageRequest.ofSize(3); var page1 = repository.style("Stout", pageRequest); System.out.println("Page 1"); page1.forEach(System.out::println); PageRequest pageRequest2 = page1.nextPageRequest(); var page2 = repository.style("Stout", pageRequest2); System.out.println("Page 2"); page2.forEach(System.out::println); System.out.println("JDQL query: "); repository.jpql("Stout").forEach(System.out::println); } System.exit(0); } } In this example, BeerRepository efficiently retrieves and paginates data using cursor pagination. The style method employs cursor pagination, while the jpql method demonstrates a JDQL query. API Changes and Compatibility Breaks in Eclipse JNoSQL 1.1.1 The release of Eclipse JNoSQL 1.1.1 includes significant updates and enhancements aimed at improving functionality and aligning with the latest specifications. However, it's important to note that these changes may cause compatibility issues for developers, which need to be understood and addressed in their projects. 1. Annotations Moved to Jakarta NoSQL Specification Annotations like Embeddable and Inheritance were previously included in the Eclipse JNoSQL framework. In the latest version, however, they have been relocated to the Jakarta NoSQL specification to establish a more consistent approach across various NoSQL databases. As a result, developers will need to update their imports and references to these annotations. Java // Old import import org.jnosql.mapping.Embeddable; // New import import jakarta.nosql.Embeddable; The updated annotations can be accessed at the Jakarta NoSQL GitHub repository. 2. Unified Query Packages To simplify and unify the query APIs, SelectQuery and DeleteQuery have been consolidated into a single package. Consequently, specific query classes like DocumentQuery, DocumentDeleteQuery, ColumnQuery, and ColumnDeleteQuery have been removed. Impact: Any code using these removed classes will no longer compile and must be refactored to use the new unified classes. Solution: Refactor your code to use the new query classes in the org.eclipse.jnosql.communication.semistructured package. For example: Java // Old usage DocumentQuery query = DocumentQuery.select().from("collection").where("field").eq("value").build(); // New usage SelectQuery query = SelectQuery.select().from("collection").where("field").eq("value").build(); Similar adjustments will be needed for delete queries. 3. Migration of Templates Templates such as ColumnTemplate, KeyValueTemplate, and DocumentTemplate have been moved from the Jakarta Specification to Eclipse JNoSQL. Java // Old import import jakarta.nosql.document.DocumentTemplate; // New import import org.eclipse.jnosql.mapping.document.DocumentTemplate; 4. Default Query Language: Jakarta Data Query Language (JDQL) Another significant update in Eclipse JNoSQL 1.1.1 is the adoption of Jakarta Data Query Language (JDQL) as the default query language. JDQL provides a standardized way to define queries using annotations, making it simpler and more intuitive for developers. Conclusion The use of a NoSQL database is a powerful asset in modern applications. It allows software architects to employ polyglot persistence, utilizing the best persistence capability in each scenario. Eclipse JNoSQL assists Java developers in implementing these NoSQL capabilities into their applications.

By Otavio Santana

CORE

Building Intelligent AI Agents Using Semantic Kernels and Azure OpenAI Models: A Step-By-Step Guide

In this article, we’ll explore how to build intelligent AI agents using Azure Open AI and semantic kernels (Microsoft C# SDK). You can combine it with Open AI, Azure Open AI, Hugging Face, or any other model. We’ll cover the fundamentals, dive into implementation details, and provide practical code examples in C#. Whether you’re a beginner or an experienced developer, this guide will help you harness the power of AI for your applications. What Is Semantic Kernel? In Kevin Scott's talk on "The era of the AI copilot," he showcased how Microsoft's Copilot system uses a mix of AI models and plugins to enhance user experiences. At the core of this setup is an AI orchestration layer, which allows Microsoft to combine these AI components to create innovative features for users. For developers looking to create their own copilot-like experiences using AI plugins, Microsoft has introduced Semantic kernel. Semantic Kernel is an open-source framework that enables developers to build intelligent agents by providing a common interface for various AI models and algorithms. The Semantic Kernel SDK allows you to integrate the power of large language models (LLMs) in your own applications. The Semantic Kernel SDK allows developers to integrate prompts to LLMs and results in their applications, and potentially craft their own copilot-like experiences. It allows developers to focus on building intelligent applications without worrying about the underlying complexities of AI models. Semantic Kernel is built on top of the .NET ecosystem and provides a robust and scalable platform for building intelligent apps/agents. Figure courtesy of Microsoft Key Features of Semantic Kernel Modular architecture: Semantic Kernel has a modular architecture that allows developers to easily integrate new AI models and algorithms. Knowledge graph: Semantic Kernel provides a built-in knowledge graph that enables developers to store and query complex relationships between entities. Machine learning: Semantic Kernel supports various machine learning algorithms, including classification, regression, and clustering. Natural language processing: Semantic Kernel provides natural language processing capabilities, including text analysis and sentiment analysis. Integration with external services: Semantic Kernel allows developers to integrate with external services, such as databases and web services. Let's dive deep into writing some intelligent code using Semantic kernel C# SDK. I will write them in steps so it will be easy to follow along. Step 1: Setting up the Environment Let's set up our environment. You will need to install the following to follow along. .NET 8 or later Semantic Kernel SDK (available on NuGet) Your preferred IDE (Visual Studio, Visual Studio Code, etc.) Azure OpenAI access Step 2: Creating a New Project in VS Open Visual Studio and create a blank empty console DotNet 8 Application. Step 3: Install NuGet References Right-click on the project --> click on Manage NuGet reference section to install the below 2 latest NuGet packages. 1) Microsoft.SemanticKernel 2) Microsoft.Extensions.Configuration.json Note: To avoid Hardcoding Azure Open AI key and endpoint, I am storing these as key-value pair into appsettings.json, and using the #2 package, I can easily retrieve them based on the key. Step 4: Create and Deploy Azure OpenAI Model Once you have obtained access to Azure OpenAI service, login to the Azure portal or Azure OpenAI studio to create Azure OpenAI resource. The screenshots below are from the Azure portal: You can also create an Azure Open AI service resource using Azure CLI by running the following command: PowerShell az cognitiveservices account create -n <nameoftheresource> -g <Resourcegroupname> -l <location> \ --kind OpenAI --sku s0 --subscription subscriptionID You can see your resource from Azure OpenAI studio as well by navigating to this page and selecting the resource that was created from: Deploy a Model Azure OpenAI includes several types of base models as shown in the studio when you navigate to the Deployments tab. You can also create your own custom models by using existing base models as per your requirements. Let's use the deployed GPT-35-turbo model and see how to consume it in the Azure OpenAI studio. Fill in the details and click Create. Once the model is deployed, grab the Azure OpenAI key and endpoint to paste it inside the appsettings.json file as shown below Step 5: Create Kernel in the Code Step 6: Create a Plugin to Call the Azure OpenAI Model Step 7: Use Kernel To Invoke the LLM Models Once you run the program by pressing F5 you will see the response generated from the Azure OpenAI model. Complete Code C# using Microsoft.Extensions.Configuration; using Microsoft.SemanticKernel; var config = new ConfigurationBuilder() .AddJsonFile("appsettings.json", optional: true, reloadOnChange: true) .Build(); var builder = Kernel.CreateBuilder(); builder.Services.AddAzureOpenAIChatCompletion( deploymentName: config["AzureOpenAI:DeploymentModel"] ?? string.Empty, endpoint: config["AzureOpenAI:Endpoint"] ?? string.Empty, apiKey: config["AzureOpenAI:ApiKey"] ?? string.Empty); var semanticKernel = builder.Build(); Console.WriteLine(await semanticKernel.InvokePromptAsync("Give me shopping list for cooking Sushi")); Conclusion By combining AI LLM models with semantic kernels, you’ll create intelligent applications that go beyond simple keyword matching. Experiment, iterate, and keep learning to build remarkable apps that truly understand and serve your needs.

By Naga Santhosh Reddy Vootukuri

CORE

AI Risk Management Framework: A Technical Deep Dive for AI Developers

The rapid integration of artificial intelligence (AI) into software systems brings unprecedented opportunities and challenges for the software development community. As developers, we're not only responsible for building functional AI systems, but also for ensuring they operate safely, ethically, and responsibly. This article delves into the technical details of the NIST AI Risk Management Framework, providing concrete guidance for software developers building and deploying AI solutions. Image from NIST webpage The NIST framework lays out 4 important steps for AI developers to adopt to reduce the risk associated with AI. 1. Govern: Setting up the Fundamentals Governance is the most important and the foundation for this framework. Effective governance of AI risk starts with solid technical groundwork. In order to implement robust governance, developers of AI systems should explore some of the following approaches Version control and reproducibility: Implement rigorous version control for datasets, model architectures, training scripts, and configuration parameters. This ensures reproducibility, enabling tracking of changes, debugging issues, and auditing model behavior. Documentation and code review: Establish clear documentation requirements for all aspects of AI development. Conduct thorough code reviews to identify potential vulnerabilities, enforce coding best practices, and ensure adherence to established standards. Testing and validation frameworks: Build comprehensive testing frameworks to validate data quality, model performance, and system robustness. Employ unit tests, integration tests, and regression tests to catch errors early in the development cycle. Table 1: Examples of Technical Governance Approaches Aspect Approach Example Version Control Utilize Git for tracking code, data, and model versions. Document commit messages with specific changes, link to relevant issue trackers. Documentation Use Sphinx or MkDocs to generate documentation from code comments and Markdown files. Include API references, tutorials, and explanations of design decisions. Testing Employ frameworks like Pytest or JUnit for automated testing. Write tests for data loading, model training, prediction accuracy, and security vulnerabilities. 2. Map: Identifying Technical Risks in AI Systems Understanding the technical nuances of AI systems is crucial for identifying potential risks. Some of the key areas to explore for mapping the AI risks are: Data quality and bias: Assess the quality and representativeness of training data. Identify potential biases stemming from data collection, labeling, or sampling methodologies. Implement data pre-processing techniques (e.g., outlier detection, data cleaning) to mitigate data quality issues. Model robustness and adversarial attacks: Evaluate the vulnerability of AI models to adversarial examples – inputs designed to mislead the model. Implement adversarial training techniques to enhance model robustness and resilience against malicious inputs. Security vulnerabilities: Analyze the software architecture for security flaws. Implement secure coding practices to prevent common vulnerabilities like SQL injection, cross-site scripting, and authentication bypass. Employ penetration testing and vulnerability scanning tools to identify and address security weaknesses. Table 2: Examples of Technical Risk Identification Risk Category Description Example Data Bias Training data reflects historical or societal biases. An AI-powered credit card approval trained on data with historical bias against certain demographic groups might unfairly deny credit cards to individuals from those groups. Adversarial Attacks Maliciously crafted inputs designed to fool the model. An image recognition system could be tricked by an adversarial image to misclassify a positive as a negative result. Data Poisoning Injecting malicious data into the training dataset to compromise model performance. An attacker could insert corrupted data into a spam detection system's training set, causing it to misclassify spam messages as legitimate. 3. Measure: Evaluating and Measuring Technical Risks Evaluating the technical severity of risks requires quantitative metrics and rigorous analysis. A few metrics that we could deploy to measure the performance of AI include, Model performance metrics: Utilize relevant performance metrics to assess model accuracy, precision, recall, and F1 score. Monitor these metrics over time to detect performance degradation and identify potential retraining needs. Explainability and interpretability: Implement techniques like LIME (Local Interpretable Model-agnostic Explanations) or SHAP (SHapley Additive exPlanations) to understand model decision-making processes. Utilize visualization tools to interpret model behavior and identify potential biases. Security assessment tools: Employ static code analysis tools to identify security flaws in the source code. Use dynamic analysis tools (e.g., fuzzing, penetration testing) to uncover vulnerabilities in running systems. Table 3: Technical Risk Measurement Techniques Technique Description Example Confusion Matrix Visualizes the performance of a classification model by showing true positives, true negatives, false positives, and false negatives. Analyzing a confusion matrix can reveal if a model is consistently misclassifying certain categories, indicating potential bias. LIME Generates local explanations for model predictions by perturbing input features and observing the impact on the output. Using LIME, you can understand which features were most influential in a specific loan denial decision made by an AI model. Penetration Testing Simulates real-world attacks to identify security vulnerabilities in a system. A penetration test could uncover SQL injection vulnerabilities in an AI-powered chatbot, enabling attackers to steal user data. 4. Manage: Implementing Risk Controls Managing technical risks demands the implementation of robust controls and mitigation strategies. Some of the strategies to explore for managing the technical risks are Data de-biasing techniques: Implement techniques like re-weighting, data augmentation, or adversarial de-biasing to address biases in training data. if possible conduct fairness audits using appropriate metrics to evaluate the fairness of model outcomes. Secure software development practices: Adhere to secure coding principles to minimize security vulnerabilities. Use strong authentication mechanisms, encrypt sensitive data, and implement access control measures to safeguard systems and data. Model monitoring and anomaly detection: Establish continuous monitoring systems to track model performance and detect anomalies. Implement techniques like statistical process control or machine learning-based anomaly detection to identify deviations from expected behavior. Table 4: Technical Risk Mitigation Strategies Risk Mitigation Strategy Example Data Bias Data Augmentation: Generate synthetic data to increase the representation of underrepresented groups. Augment a dataset for facial recognition with synthetic images of individuals from diverse ethnic backgrounds to reduce bias. Adversarial Attacks Adversarial Training: Train the model on adversarial examples to improve its robustness against such attacks. Use adversarial training to improve the resilience of an image classification model against attacks that aim to manipulate image pixels. Data Poisoning Data Sanitization: Implement rigorous data validation and cleaning processes to detect and remove malicious data. Employ anomaly detection algorithms to identify and remove outliers or malicious data points injected into a training dataset. Conclusion As AI developers, we play a pivotal role in shaping the future of AI. By integrating the NIST AI Risk Management Framework into our development processes, we can build AI systems that are not only technically sound but also ethically responsible, socially beneficial, and worthy of public trust. This framework empowers us to address the technical complexities of AI risk, allowing us to create innovative solutions that benefit individuals, organizations, and society as a whole.

By Ramakrishnan Neelakandan

Data Engineering

Functions of Data Engineering

AI/ML

Big Data

Data

Databases

IoT

DZone's Featured Data Engineering Resources

The Latest Data Engineering Topics