Large Scale Data Analysis

Matt Walker

Subscribe to Matt Walker: eMailAlertsEmail Alerts
Get Matt Walker: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

Related Topics: Java Developer Magazine, OpenMP

Java Developer : Article

It's a Multi-Core World: Let the Data Flow

A functional parallelism paradigm that fits multi-core processor architecture

The multi-core buzz is everywhere. Pick up a newspaper and the local electronics mega-store is advertising multi-core desktops and laptops to the consumer. Interesting, but what does it mean to the everyday Java programmer? Maybe nothing. If you live in the application server world writing EJB-based applications your application server does most of the heavy lifting for you. It handles concurrency just fine. But that doesn't cover all applications. Multi-core technology will especially affect applications that must process large amounts of data in a non-transactional (outside of a database context) manner. For this class of applications, the implications of multi-core are huge.

Why? Well first, notice the processing speeds of multi-core processors. They're not getting faster. In fact, they may be slowing down. As manufacturers add more cores to a chip, the processing speed of each core is usually slowed down to prevent overheating. The 80-core chip that Intel demonstrated recently wasn't 80 cores of x86 architecture but a simpler architecture. This may be an industry trend as more and more cores are squeezed onto a chip. The processor architecture may become heterogeneous, with a few full-power "legacy" cores and many specialized cores. The IBM Cell architecture already employs this scheme, with a single PowerPC core at the center and eight SPU cores connected using high-speed interconnects.

One implication you should take away from all of these processor changes: your single-threaded application may actually slow down on a multi-core system. If you need faster runtimes to meet shrinking SLA windows, you 'll have to multithread your applications, now! No problem, right? Java has included the java.util.concurrent package since Java5. This library contains many powerful constructs from which you can build a fully concurrent and scalable application. But, that isn't always easy or straightforward. The java.util.concurrent package is a set of building blocks that you must master and put to good use. There are several good books on this subject. We highly recommend Java Concurrency in Practice by Brian Goetz, for one.

There's a technology that's been around for years called dataflow that can solve the multi-core dilemma. How? This article will go into detail about dataflow, but the gist is this: dataflow provides a functional parallelism paradigm that fits well into multi-core processor architecture. A dataflow instance consists of a directed graph of processing nodes connected with FIFO data queues. This pipelined architecture lets applications be built from small-size reusable components stitched together with queues. The diagram below gives a small example of dataflow graph. Since it's a pipelined architecture, it naturally takes advantage of many processing cores. But more on that later.

First, we'll cover the overall nature and architecture of dataflow technology, specifically focusing on dataflow in software. Then we'll cover the history of dataflow technology, how it was first conceived and has matured and morphed over the years. Then we'll discuss several implementations of dataflow technology that exist in the marketplace today, highlighting a Java implementation. Read on for a peek into a technology that may have been ahead of its time but appears poised for the new multi-core world.

Yet Another Programming Paradigm
You may be asking: why another programming paradigm? First, the languages we have available to us don't directly support the needs of application builders. As mentioned before, the java.util.concurrent package provides most of the constructs needed to build scalable applications. However, these are lower-level building blocks. It can take many months to become familiar with these constructs and apply them right.

Second, the programming frameworks that have traditionally been used to build highly scalable applications have been targeted at the academic and scientific community. Frameworks such as MPI and OpenMP have been used to solve very large complex problems, taking advantage of some of the world's largest computers and computer grids. However, the confluence of the information explosion with the availability of very inexpensive, off-the-shelf hardware has put high-performance computing within reach of even small and medium-sized businesses. These businesses have ever-increasing demands to process more data in shorter time periods. What they don't have is a staff of concurrent programming experts. (See Figure 1)

On the one extreme we have Java and all of the functionality that it provides. The building blocks to create scalable applications are there, but a cost must be paid to tap into this functionality. On the other extreme are frameworks such as MPI and OpenMP. Again, they provide high functionality, but have traditionally been used by the academic and scientific computing world. They are not easy tools to use. Something is needed to bridge this gap to provide high-performance, highly scalable data processing to the business world. Dataflow technology can be one way to bridge that gap.

Dataflow Programming Model
Dataflow is an alternative to the standard von Neumann model of computation. Typically, we think of a program as a series of instructions each executed one after the other by a processor keeping track of its progress with an instruction pointer. In dataflow, on the other hand, channels transmitting data in one direction join computations to one another. Conceptually, you can think of this structure as a directed graph with data channels as edges and processes performing computation on the data as nodes. The processes each operate only when data is available - the data flowing through the network is all that's needed to organize the computation. The immediate advantage is that many of the processes can be operating simultaneously, thus allowing dataflow applications to take advantage of hardware with multiple processor cores. Notice the concurrency happens external to the process; the developer doesn't have to bother with threads, deadlock detection, starvation, or concurrent memory access to build parallelism into his application.

This type of implicit parallelism stands in stark contrast to the concurrency mechanisms of many other programming paradigms. Gone are the locks of concurrent programming in imperative languages like C, which lack composability - two correct snippets of code using locks may not be correct when they're combined. Dataflow, on the other hand, allows composability: as long as the I/O contract is correct, sub-graphs may replace nodes or be spliced between them in the original dataflow network. This facilitates both program correctness, since sub-graphs can be tested as they're constructed then linked together to form larger programs, and code reuse, since commonly used sub-graphs can be copied from one application to the next.

Dataflow process networks bear some relationship to the dataflow variables of declarative programming languages like Oz. A dataflow variable is simply an unbound variable whose value can only be determined by a separate thread operating in parallel. If the dataflow variable is referenced before it's been bound, the referencing thread pauses awaiting the value. Combined with a single-assignment store (variables can be bound once at most), these variables lead to the nice property that it doesn't matter in what order we evaluate simultaneously executing expressions. Likewise, the outcome of a dataflow network is determined uniquely by its input, regardless of the order in which processes fire. Firing order impacts queue sizes and performance, of course, but this can be dealt with elsewhere besides explicitly within the program itself, dramatically simplifying the task of the dataflow programmer. (See Figure 2)

The History of Dataflow
In the early 1970s, many people grew skeptical of the von Neumann architecture's ability to cope with parallelism. The global instruction pointer and memory could both become bottlenecks in concurrent software if it wasn't carefully designed. Dataflow architecture arose as the only compelling competitor. Designed with concurrency in mind, it eliminated the global instruction pointer and memory by organizing the computation based on the flow of data through a network of processes. However, these radically different architectures proved difficult compiler targets for traditional imperative programs. Dataflow programming and languages arose in response to this need.

At this time, Jack B. Dennis developed the static dataflow model and applied it to the design of computer architectures. His model limited nodes to primitive computations, and the edges were seen as representing data dependencies among the various operations occurring in the network - they held only one data token at a time. The work of Gilles Kahn extended this idea in two ways. First, the edges of his dataflow graphs were unbounded first-in, first-out queues, providing for a flexible rate of flow across each node. Second, he allowed each node to be a complete sequential process, which is often called large-grain dataflow. This approach tends to be more effective in creating efficient software since the threads implementing processes are given a larger fraction of work, reducing the amount of time spent switching between them. Further, the model freed the concept of dataflow from defining the language of its processes. Now it was conceivable to implement them in standard programming languages such as C or Java but still have the network of code operate according to dataflow principles.

More Stories By Jim Falgout

Jim Falgout has 20+ years of large-scale software development experience and is active in the Java development community. As Chief Technologist for Pervasive DataRush, he’s responsible for setting innovative design principles that guide the company’s engineering teams as they develop new releases and products for partners and customers. He applied dataflow principles to help architect Pervasive DataRush.

Prior to Pervasive, Jim held senior positions with NexQL, Voyence Net Perceptions/KD1 Convex Computer, Sequel Systems and E-Systems. Jim has a B.Sc. (Cum Laude) in Computer Science from Nicholls State University. He can be reached at

More Stories By Matt Walker

Matt Walker is a scientist at Etsy, where he is building out their data analytics platform and researching techniques for search and advertising. Previously, he worked as a researcher at Adtuitive and as an engineer at Pervasive Software. He holds an MS in computer science from UT and received his BS in electrical and computer engineering from Rice University.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.