Large Scale Data Analysis

Matt Walker

Subscribe to Matt Walker: eMailAlertsEmail Alerts
Get Matt Walker: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

Related Topics: Java Developer Magazine, OpenMP

Java Developer : Article

It's a Multi-Core World: Let the Data Flow

A functional parallelism paradigm that fits multi-core processor architecture

Dataflow Implementation
The Pervasive DataRush framework implements many of the basic structures of dataflow. Processing nodes (processes in DataRush) are built in Java and interface using dataflow queues. The dataflow queues in DataRush are typed and support native Java types besides string, date, timestamp, and binary.

The dataflow queues in DataRush are somewhat comparable in functionality to the blocking queue implementations in the java.util.concurrent package introduced in the Java 5 release. They're both memory-based queues that block readers on empty queues and block writers of full queues. The DataRush queues, however, must support deadlock detection and handling. Due to support for multiple queue readers and the fact that processes can have multiple inputs and outputs, cycles of dependencies can be created in a dataflow graph. These cycles can lead to deadlock, whereby writers and readers are waiting in a way that needs intervention for the graph to continue working. A deadlock algorithm in the DataRush engine detects deadlock situations and handles it, normally by temporarily expanding the size of the problematic queue.

Besides the pipeline scalability that a dataflow architecture already provides, the Pervasive DataRush framework has built-in support for two other types of scalability: horizontal partitioning and vertical partitioning. Horizontal partitioning replicates a section of dataflow logic and segments the input data into chunks, flowing the data concurrently through the replicated dataflow sections. Figure 2 depicts this scenario using a lookup component as an example. In this example, the lookup operator is replicated with a data partitioner spreading the data load evenly to each lookup instance. This lets each lookup operator run in parallel, fully utilizing multiple cores on the system. Vertical partitioning supports running different dataflow logic in parallel on each field of an input stream. Figure 1 shows the high-level architecture of the Pervasive DataRush framework including design and execution components. The user utilizes an IDE such as Eclipse to create DFXML assemblies and Java processes and customizers. Figure 2 exemplifies horizontal partitioning, one of three types of scalability, which can be implemented using Pervasive DataRush. Horizontal partitioning replicates a section of dataflow logic and segments the input data into chunks, flowing the data concurrently through the replicated dataflow sections.

Why Java?
As the article on dataflow points out, there have been many instantiations of dataflow technology over the years. Most of them have been implemented in C or C++. This makes sense due to the prevalence of C and C++ when the systems were built. When DataRush was first being developed, the decision was made to use Java as the programming language. This decision was based on several factors: portability, flexibility, extensibility, and scalability - and you can throw in productivity for good measure. The decision was also based on the high level of industry investment in JVM technology. Over the past few years, we've seen significant performance improvements with each JDK release. Also, the amount of open source libraries available is astounding. With such a rich environment, the decision has proved to be a good one.

The question always arises about Java and performance. What we've found, with the introduction of the java.nio package and other JVM performance enhancements, is that native speeds can be obtained from Java. This is especially true for frameworks like DataRush in which a static set of classes (the process nodes) are utilized over a relatively long period of time. This scenario provides an environment well suited for JIT compilers.

A Simple Benchmark
To demonstrate the scalability of the DataRush framework, we developed a simple benchmark implementing a one-pass K-means algorithm. The algorithm takes two double-typed values as points and clusters the points into like groups. The benchmark measures the performance of running K-means on 100 input columns over 10 million rows of data. For this particular test, the input data is generated. As can be seen from Figure 3, the performance of the benchmark test improves as more CPU resources are made available. These benchmark results of a K-means test run on an 8-core machine demonstrate how a non-parallelized application fails to scale as more compute resources are added. A snapshot of the CPU utilization is also provided, showing that the DataRush framework was able to keep the machine heavily utilized for the duration of the test. Figure 4 shows CPU usage during the K-means benchmark, the Pervasive DataRush platform has scaled to take full advantage of all 8 cores available on the machine used for this test.

The DataRush application development framework implements dataflow concepts that enable Java programmers to create highly scalable applications that can process many million rows of data. The framework is currently in beta release and can be downloaded at DataRush is built completely in Java and so is easy to install and begin using right away. A user interface in the Eclipse IDE is being developed, so please check back with the site periodically for updates on that development. The site also includes more information on DataRush and forums for discussion and questions.

More Stories By Jim Falgout

Jim Falgout has 20+ years of large-scale software development experience and is active in the Java development community. As Chief Technologist for Pervasive DataRush, he’s responsible for setting innovative design principles that guide the company’s engineering teams as they develop new releases and products for partners and customers. He applied dataflow principles to help architect Pervasive DataRush.

Prior to Pervasive, Jim held senior positions with NexQL, Voyence Net Perceptions/KD1 Convex Computer, Sequel Systems and E-Systems. Jim has a B.Sc. (Cum Laude) in Computer Science from Nicholls State University. He can be reached at [email protected]

More Stories By Matt Walker

Matt Walker is a scientist at Etsy, where he is building out their data analytics platform and researching techniques for search and advertising. Previously, he worked as a researcher at Adtuitive and as an engineer at Pervasive Software. He holds an MS in computer science from UT and received his BS in electrical and computer engineering from Rice University.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.