Plotting Large Datasets Without Waiting Forever

Many years ago in his landmark 1977 book, Exploratory Data Analysis, John Tukey made a comment that in his pre-PC, pre-Excel, pre-most-everything-we-think-we-need-to-do-analysis time was pretty radical, what which is today pretty common-sense:

“…there is never a good reason to not look at a plot of your data.”

Any test engineer worth his or her salt will admit that we often do our best work when we heed that truly sage advice. Looking at a plot allows us to see the data’s basic “shape” and begin to develop something of an intuitive understanding of relationships that might exist in the data. The problem is that in a world where “Big Data” is the catchphrase of the day, that simple advice can be difficult to put into practice. It has become increasingly common to see people on the user forum wondering how to effectively view work with datasets that incorporate hundreds, thousands or even hundreds of thousands of datapoints.

Seeing the Problem

To assist us in visualizing some of the problems inherent in handling large datasets, I have put together a test dataset consisting of 3 traces, each with over 19,000 datapoints. Now when I just read the data and plot it, this is what I get:

Entire dataset with white bars

Clearly there is an issue here – I mean what is up with the wide vertical bars? But there is an even larger problem. Let’s say I change the size of the plot by making it just 9 pixels wider.

But now the bars have moved (9px)

Now what is going on? The white bars have changed and if you look at the peaks in the data carefully, some of them appear to have moved or even disappeared. In order to get your head wrapped around what is happening, consider what LabVIEW is having to do behind the scenes. I mentioned that the dataset had over 19,000 datapoints (19,555, to be exact) but the active plot area of the display is only 350 pixels wide. If you do the math, you discover that to generate this plot, each pixel has to represent about 57 datapoints. The problem of course is that you can’t subdivide a pixel into 57 pieces. So what is LabVIEW to do?

Well it does what any graphing package does when it is confronted with this challenge: it decimates the data. In other words it takes 57-datapoint chunks of the data, performs some sort of statistical operation on each chunk (min, max, mean, etc) and then uses the resulting summary value to represent that chunk of data on the plot. There are several potential problems with this way of handling the situation, but they typically don’t become an issue unless the dataset being plotted is very large relative to the size of the graph. For example, this is why the data on the graph appeared to change as a function of the size of the plot area. As the plot area changed (even slightly) the chunking changes so the data appears to change as well – you can think of it as sort of visual aliasing.

More subtle problems have to do with the way the graphing routines “summarize” the data chunks. Depending upon the shape of your dataset, the operations I mentioned earlier can give dramatically different output and to make matters worse you have no idea what techniques the graphing functions are using. But even if you can live with the visual effects there are good reasons to take action to address the issue.

Finally, in order to plot these huge datasets you have to be carrying them around inside your program. Consequently, rather than having just one copy of these monsters, you can have several – perhaps dozens – it all depends on how your code it written. From this discussion we can then see the two imperatives for our solution:

  1. The approach must minimize the number of copies that LabVIEW has to make of the dataset.
  2. It must reduce the number of datapoints that actually need to be plotted.

Let’s start by looking at the data management aspect of the problem, remembering of course that these two issues are inextricably linked together.

Low-Overhead Storage

Decades ago, people in the nascent computer-science discipline realized that if you had a value, like an array, that consisted of multiple items, the most efficient way of making it available throughout your code was to store it in one location in memory and give the code that needed to access it a “pointer” that served as a reference to that value. Originally this mechanism was pretty primitive with the pointer often consisting of simply the value’s starting address in RAM. In addition, there was no real way of preventing race conditions or security intrusions because there was no way of controlling access to the data. It would be nice to think that we have learned the errors of our ways and fixed all the holes, but such is not always the case. Remember the “Heartbleed” bug panic from last year?

The good news is that LabVIEW does not suffer from the same problems because while we have at our disposal a mechanism that fills the same role as the primitive pointer, it lacks the problems. I am talking about the Data Value Reference, or DVR. It meets the low-overhead storage mandate by accessing the data through a reference that is only 4 bytes long. The DVR is also secure because the buffer that is creates is strongly typed, meaning that you can’t just store anything in it or read whatever you want from it. The data going in and coming out must match the definition of the data structure that was used when the DVR was defined. Finally, the DVR removes problems resulting from simultaneous access to the same resource by defining a new structure that automatically serializes access on a first-come, first-served basis. So the first thing we need to do is get our data into the DVR, and here’s some code to do just that.

Load  Big Data

The VI starts by reading a binary file containing the data which, to simplify this example, is already formatted correctly for how we are going to use it. The resulting array drives a box called an inplace structure that guarantees there will be no other accesses to the DVR occurring in parallel with this one. However, the structure does something else too: Inplace structures operate something like compiler directives telling the LabVIEW compiler that its OK to attempt additional optimizations that would not otherwise be safe to make. For example, they allow to LabVIEW operates on the inplace data without making the copies that the compiler might otherwise need make.

The other thing to note is that funny-looking function in the middle of the inner inplace structure. It’s called Swap Values and its help description really doesn’t do it justice. If all you did was read the context help you might assume that it is simply some sort of switch for routing signals around, and stifling a yawn, go on to consider matters that you think look more exciting. To see why you should consider this function very exciting, we need to look under LabVIEW’s hood.

To store data internally, LabVIEW uses memory data buffers. In fact much of what we think of as “dataflow” consists of the manipulation of those buffers. Now when LabVIEW stores a complex datatype like a cluster (which is what the DVR in this case is holding) it uses a combination of techniques. For simple fix-sized data like numerics or booleans, LabVIEW simply includes the data values directly in the cluster’s memory space. However, it needs a different approach when storing data values like arrays or strings that can vary in length. When a cluster includes an item that can change in size, the item is stored outside the cluster in its own memory buffer and the cluster only holds a reference to that buffer. That way if the item changes in size it can do so without effecting the memory allocation of the cluster containing it.

However this explanation also reveals why the Swap Values node is so important. Let’s look at this code from the standpoint of buffers. Coming into the inner inplace structure there are two buffers allocated that are holding arrays: One contains the data I read from the file, and one the (empty) array that is contained in the cluster that is the contents of the DVR. Now there are two ways that we could initialize that array. The most obvious one is to leave the unbundle (left) side of the cluster inplace structure unwired and wire the array containing the data directly to the bundle (right) side of the cluster inplace structure. While this would work, coding it in that way would result in LabVIEW needing to copy the data contained in the incoming array’s buffer to the array buffer associated with the cluster – and the larger the dataset is, the longer this copy can take.

Now consider what happens when Swap Values is used. Although the node resides inside an inplace structure, it would seem logical that you can’t replace an empty array with an array containing thousands of datapoints in place. Well actually you can. The key point to remember is that at a very low level, the clusters don’t actually contain the arrays, rather they hold references that point to the arrays that are associated with them. So what Swap Values does is it leaves the two arrays in place and simply swaps the references that the clusters contain. Thanks to this optimization, populating this cluster with data will take the exact same amount of time whether the input data contains 2 datapoints or 200,000 datapoints because the only thing that is really being moved is a pair of 4-byte memory buffer references.

Getting Data Out

So we have gotten our data into the DVR as efficiently as we can, but if this storage is going to be of any use, there clearly needs to be a way to get data out of it as well. However, here we face the issue of plotting data that is too large. At the same time we are pulling it out, we also need to be reducing or decimating it to more closely match the size of the available graphing area. To meet those dual requirements I created this VI.

Read and Decimate Big Data

At first this code might seem intimidating, but if you take it step-by-step and analyse what it’s doing, it isn’t really so very different from the example we looked for initializing the data in the DVR. Starting at the left side, the code unbundles the data array from the DVR and passing it into a loop that will execute three times – once for each plot in the dataset. The first point of optimization is in how this loop operates. Note the node with the “P” in it. The presence of this node means that the for loop is set for parallel operation. There are many situations where, even though you specify a for loop, there is no logical reason that the iterations have to operate sequentially. When LabVIEW encounters a “parallelized” loop the optimizer essentially flattens the loop out, creates the necessary parallel code to execute each iteration simultaneously, and then reassemble the output data in the correct order. To find out if a loop is parallelizable, there is an option under Tools>>Profile called Find Parallelizable Loops…. This operation opens a dialog that allows you to identify the loops that can and cannot be run in parallel mode.

Inside the loop, the array drives an inplace structure that indexes out one element, and the resulting cluster feeds a second inplace structure that unbundles the two items in the cluster. The processing of this data occurs in two distinct steps. First the Start and Length inputs produce a subset of the total dataset representing the portion of the data that is to be displayed. Because this operation causes LabVIEW to copy the selected data into a new memory buffer, the code passes the resulting arrays into another inplace structure to ensure that the subset will also be manipulated inplace.

The code inside this inner-most inplace structure performs the second half of the processing – the decimation to reduce the size of the data being plotted. Note that if the selected portion of the dataset is already smaller than the width of plot area, the following code is bypassed. The first step in the decimation process is to reshape the 1D array into a 2D array where each row contains one chunk of data to be statistically summarized. To obtain the final X values, the code takes the first value of each chunk, while the final Y values are the maximum Y for each chunk. Note that this processing occurs in another parallelized loop that auto-indexes the output arrays, which are swapped into the output dataset as they work their way back out through the inplace structures.

Summarizing Options and Challenges

The real heart of this VI is the function that is being used to summarize the Y values for each chunk of data. Right now, I am using the function that returns the minimum and maximum values contained in the array. One of the advantages that it offers is that it is deals well with datasets containing missing datapoints represented by the value NaN. This consideration is important because it is a common (and valuable) practice to represent missing data points using that value. Without the NaN datapoints, any graph will simply connect the dots on either side of the missing datapoints resulting in a graph that visually misrepresents the data being presented. However, with the NaN values, the missing points are shown as breaks in the line (or gaps between bars), thus highlighting the missing data.

The statistical function I selected to summarize the data in the chunks simply returns the minimum and maximum values of the elements that are not NaN. However, most other analysis routines follow the basic rule that any calculation which has NaN as an operand will return an answer of NaN – which in this situation will not be real helpful. More often, what you will want is, for example, the average value of the datapoints that are present in the dataset chunk. If you are wanting to use the chunk mean or median value to summarize a dataset that you know contains NaN value, you should include something like this before the statistical operation:

Filtering out NaN

Basically it works by first sorting the array to move any NaN values to the end of the array. It then looks for the first NaN and simply trims off it (and anything after it). This works because a mean operation doesn’t care about data order, and the first thing a median function does is sort the data anyway.

Let’s See How it Works

When you run the top-level VI in the linked project, the graph that comes up will look a lot like the first image in this post, but minus the vertical white bars. As you make changes to the display that effect the X axis range, you will notice that the resulting image will zoom in on the data, showing ever greater levels of detail. Try manually typing in new X axis end points, or use the horizontal zoom tool on the graph palette to select a range of data points that you want to zoom in on.

Zoom in far enough and you will see why there were white bars on the original plot: There are a lot of missing datapoints. Using the default decimation resulted in wide white bars because the presence of the NaN values effectively hid dozens of real datapoints.

Plotting Large Datasets – Release 1
Toolbox – Release 11

Hopefully this discussion will give you something to think about, and experiment with.

The Big Tease

One of the things that developers often have to face is adding functionality to an existing VI without disrupting, or even modifying what is already there. One way to accomplish this (seemingly impossible) task is to use what are sometimes called “drop-in” VIs. These routines are simply dropped down on an existing block diagram and they do what they do without interaction with the existing code. To demonstrate how this could work, next time we’ll get back to our test bed application and give it the ability to customize the font and size of the test that are on its various displays.

Until Next Time…
Mike…