Understanding the Scale Limitations of Graph Databases | eWEEK

Graph databases and models have been around for well over a decade, and are among the most impactful technologies to emerge from the NoSQL movement.

Graph data models are natively designed to focus on the relationships within and between data, representing data as nodes connected by edges. As such, the graph model is strikingly similar to the way humans often think and talk.

The node-edge-node pattern in a graph corresponds directly to the subject-predicate-object pattern common to languages like English. So, if you’ve ever used mind-mapping technology or diagrammed ideas on a whiteboard, you’ve created a graph.

Graph data models have become part of the standard toolkit for data scientists applying artificial intelligence (AI) to everything from fraud detection and manufacturing control systems to recommendation engines and customer 360s.

Given this broad applicability, it’s no surprise Gartner believes that graph database technologies will be used in more than 80% of data and analytics innovations, including real-time event streaming, by 2025. But as adoption accelerates, limitations and challenges are emerging. And one of the most significant limitations graph databases face is their inability to scale.

Also see: Real Time Data Management Trends

Volume and Velocity of Modern Data Generation

Much has changed since the emergence of the most recent generation of graph databases from a decade ago. Enterprises are dealing with previously unimaginable volumes of data to potentially query. That data enters and streams through the enterprise in a variety of channels, and enterprises want action on that information in real time.

Original graph designs couldn’t have imagined today’s sheer volume of data or the computation power needed to put that data to work. And it’s not just the volume of data dragging graph databases down. It’s the velocity of that data.

While graph databases can excel at computation on moderately-sized sets of data at rest, they get especially siloed and suffer significant tradeoffs when real-time actions on streaming data are desired. Streaming is actively moving data; it constantly arrives from diverse sources.

And enterprises want to act upon it immediately in event-processing pipelines because when certain events are not caught quickly, as they happen, the opportunity to act disappears. For example, security incidents, transaction processing (such as fraud or credit validations), and automated machine-to-machine actions.

Anomalies and patterns need to be recognized with AI and ML algorithms that can automate (or at least escalate) an action. And that recognition needs to occur before an automated action can proceed.

Graph databases were simply never built for this scenario. They are typically restricted to hundreds or thousands of events per second. But today’s enterprises need to be able to process a velocity of millions of events per second and, in some advanced use cases, tens of millions.

There’s a hard limit both on how quickly graph systems can process data and on how much complexity (like how many hops in the query) they can handle. Because of those limits, graph systems often don’t get used. Since graph systems don’t get used, data engineering teams have no option other than to recreate the graph database-like functionality spread throughout their microservices architecture.

Also see: Best Data Analytics Tools 

The Rise of Custom Data Pipeline Development

These workarounds to query the event streams in real time require significant effort. Developers typically turn to event stream processing systems like Flink and ksqlDB, which make it possible, but not easy, to use familiar SQL query syntax to query the event streams.

It’s not uncommon for enterprises to have teams of data engineers developing extensive and complex microservice architectures for months or years to get up to the scale and speed needs of streaming data. However, these systems tend to lack the expressive query structures needed to find complex patterns in streams efficiently.

As noted, to operate at the volume and velocity that enterprises require, these systems have had to make tough tradeoffs that lead to significant limitations.

For example, time windows can restrict a system’s ability to connect events that do not arrive within a narrow time interval (often measured in seconds or minutes). This means that rather than providing some critical insight or business value, an event is instead simply ignored if it arrives even seconds too late.

Even with costly limitations like time windows, event stream processing systems have been successful. Many can even scale to process millions of events per second—but with significant effort and limitations that fail to deliver the full power of graph data models.

Also see: Why Cloud Means Cloud Native

Innovation Will Rise to Meet Demand

The demand for insights from instant event data streams and the value they deliver has never been higher. As adoption accelerates, businesses should expect to see new data infrastructure emerge to eliminate many of the scale struggles that can hold back the power of graph database models.

About the Author: 

Rob Malnati is the COO of thatDot

Leave a Reply

Your email address will not be published.