Large Scale Data Analysis

Matt Walker

Subscribe to Matt Walker: eMailAlertsEmail Alerts
Get Matt Walker: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Related Topics: Java Developer Magazine, OpenMP

Java Developer : Article

It's a Multi-Core World: Let the Data Flow

A functional parallelism paradigm that fits multi-core processor architecture

Though the potential for distinguishing between the process language and the language or mechanism for coordinating these processes was recognized early on, it wasn't until the late 1980s and early 1990s that the idea drew much attention. Thomas M. Parks presented a Kahn process network scheduling policy in his PhD thesis that ensured bounded queues for infinite inputs, making practical implementations realizable. Simultaneously, projects focused on the software engineering aspects of dataflow began showing up. While not technically dataflow because it doesn't obey the dataflow execution model, J. Paul Morrison's flow-based programming explored the reusability of large-grain processes implemented in common programming languages. He applied these ideas to large systems in the banking industry, empirically measuring an improvement in programmer productivity. More modern frameworks combine the engineering benefits of languages such as Java with the dataflow execution model.

Application and Existing Implementations
Now that you have some background in dataflow technology, you're probably starting to see applications for it. The traditional arena for dataflow is signal processing, since that's where it got its start. And dataflow is still used in that area of the industry today. This is especially true in the academic world. A quick search for dataflow on the Web will show many universities with research activity in the area of signal processing using dataflow technology.

Along those lines, the LabVIEW toolset created by National Instruments has an architecture based on dataflow technology (www.ni.com/labview/). The outstanding user experience offered by this toolset lets a user build up a dataflow graph of data collection and data processing nodes very quickly. It appears to have used dataflow concepts as more of a functional paradigm than for performance. However, with the advent of multi-core and the availability of processing power at much more affordable price points, the LabVIEW toolset is poised to provide highly scalable processing due to its use of dataflow architecture.

There are other applications for dataflow technology beyond signal processing. The pipeline nature of dataflow implementations provides a natural fit for data processing applications. Since the data is pipelined in a dataflow architecture, massive amounts of data can be processed in a highly scalable way. This ability implies that dataflow techniques can be applied to many industry problems, including:

  • Data mining and data warehousing
  • Data analytics and business intelligence
  • ETL (extract, transform, and load) processing
  • Data quality
  • Fraud detection

One of the hybrid approaches to dataflow implementation has been used to create massively scalable data processing engines. Way back in the 1990s, a start-up company called Torrent created a C++-based dataflow framework named Orchestrate. This framework implemented dataflow techniques and could run across a cluster of homogeneous systems. Several Torrent customers created business intelligence applications using this dataflow framework. Torrent was eventually acquired by Ascential, which was then acquired by IBM in 2005.

Another data processing engine using dataflow architecture has been built with 100% Java technology. This engine, available from Pervasive Software, uses more of the style of flow-based programming (www.pervasivedatarush.com). It's currently available as part of a free public beta program sponsored by Pervasive. See the sidebar for more information.

Conclusion
As the information age and the multi-core wave continue to collide, more pressure is exerted on software developers to provide access to increasingly inexpensive computing horsepower. High-performance computing was once the domain of system experts, government agencies and universities. Now it's in demand by large as well as medium-sized companies with massive amounts of data to process and shrinking time windows. Compute power that used to cost millions is now available in systems that can be ordered on the Web for $20k.

All of this leads to the need for a better programming paradigm: a paradigm that encourages the developer to build highly performing and scalable software without the burden of low-level system knowledge. Dataflow is one such paradigm. Dataflow, being a pipelined architecture, is inherently scalable. And as we've pointed out, dataflow concepts have been around for many years. They've undergone change and growth as the ideas have matured in the academic community.

There are also several commercial implementations of dataflow in the marketplace, a sign that dataflow technology is real and has many benefits to bring to the software development community. We encourage you to investigate dataflow concepts and determine how they can fit into your software architectures going forward.

Resources
• Johnston, Hanna, and Millar. "Advances in Dataflow Programming Languages." 2004.
http://portal.acm.org/citation.cfm?id=1013208.1013209
• Najjar, Lee, and Gao. "Advances in the Dataflow Computational Model." 1999.
http://ptolemy.eecs.berkeley.edu/publications/papers/99/dataflow/
• Parks. "Bounded Scheduling of Process Networks." 1995.
http://ptolemy.eecs.berkeley.edu/publications/papers/95/parksThesis/
• Harris, Marlow, Jones, and Herlihy. "Composable Memory Transactions." 2006.
http://research.microsoft.com/~simonpj/papers/stm/#composble
• Van Roy and Haridi. "Concepts, Techniques, and Models of Computer Programming." 2004.
WWW.INFO.UCL.AC.BE/~PVR/BOOK.HTML

SIDEBAR

Pervasive DataRush: A Dataflow Implementation

The article on dataflow concepts introduced you to dataflow technology and its relevance to multi-core processing. Here we'll discuss an implementation of dataflow technology built with Java. The application framework is called Pervasive DataRush and is currently in beta release from Pervasive Software.

DataRush is an application development framework. Its purpose is to enable the user to build data processing applications that can easily take advantage of multi-core processors to produce highly scalable software. DataRush implements many dataflow concepts and extends some of these concepts to provide several ways to dramatically increase scalability of applications.

DataRush implements a scripting language, based on XML that provides the means to create reusable components and data processing applications. This language, called DFXML (dataflow XML), is simple in syntax and very flexible. DFXML is used to compose what are called assemblies. An assembly is a composite operator and can be composed of other assemblies, processes, and customizers. A process is the lowest level of operator. It's written in Java and performs the work in the dataflow graph. A customizer is a compiler helper also written in Java that enables the dynamic nature of the DataRush framework. DataRush includes a component library that consists of more than 50 pre-built, ready-to-use components provided by the framework. Included are components that provide connectivity to data and data processing components such as sort, join, merge, lookup, group, and so on.

The architecture block diagram in Figure 1 depicts the high-level architecture of the DataRush framework.

The user utilizes an IDE such as Eclipse to create DFXML assemblies and Java processes and customizers (www.eclipse.org). The DataRush assembler is used to convert DFXML scripts into a binary form. The DataRush execution engine is then invoked to compile the binary files into a dataflow graph for execution. This compilation step is run for every engine instantiation, providing a very dynamic graph-generation capability. Once the dataflow graph is generated, the DataRush engine creates threads and dataflow queues representing the graph and executes the graph. Execution monitoring is provided via JMX.


More Stories By Jim Falgout

Jim Falgout has 20+ years of large-scale software development experience and is active in the Java development community. As Chief Technologist for Pervasive DataRush, he’s responsible for setting innovative design principles that guide the company’s engineering teams as they develop new releases and products for partners and customers. He applied dataflow principles to help architect Pervasive DataRush.

Prior to Pervasive, Jim held senior positions with NexQL, Voyence Net Perceptions/KD1 Convex Computer, Sequel Systems and E-Systems. Jim has a B.Sc. (Cum Laude) in Computer Science from Nicholls State University. He can be reached at [email protected]

More Stories By Matt Walker

Matt Walker is a scientist at Etsy, where he is building out their data analytics platform and researching techniques for search and advertising. Previously, he worked as a researcher at Adtuitive and as an engineer at Pervasive Software. He holds an MS in computer science from UT and received his BS in electrical and computer engineering from Rice University.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.