Plotting Large Datasets Without Waiting Forever

Many years ago in his landmark 1977 book, Exploratory Data Analysis, John Tukey made a comment that in his pre-PC, pre-Excel, pre-most-everything-we-think-we-need-to-do-analysis time was pretty radical, what which is today pretty common-sense:

“…there is never a good reason to not look at a plot of your data.”

Any test engineer worth his or her salt will admit that we often do our best work when we heed that truly sage advice. Looking at a plot allows us to see the data’s basic “shape” and begin to develop something of an intuitive understanding of relationships that might exist in the data. The problem is that in a world where “Big Data” is the catchphrase of the day, that simple advice can be difficult to put into practice. It has become increasingly common to see people on the user forum wondering how to effectively view work with datasets that incorporate hundreds, thousands or even hundreds of thousands of datapoints.

Seeing the Problem

To assist us in visualizing some of the problems inherent in handling large datasets, I have put together a test dataset consisting of 3 traces, each with over 19,000 datapoints. Now when I just read the data and plot it, this is what I get:

Entire dataset with white bars

Clearly there is an issue here – I mean what is up with the wide vertical bars? But there is an even larger problem. Let’s say I change the size of the plot by making it just 9 pixels wider.

But now the bars have moved (9px)

Now what is going on? The white bars have changed and if you look at the peaks in the data carefully, some of them appear to have moved or even disappeared. In order to get your head wrapped around what is happening, consider what LabVIEW is having to do behind the scenes. I mentioned that the dataset had over 19,000 datapoints (19,555, to be exact) but the active plot area of the display is only 350 pixels wide. If you do the math, you discover that to generate this plot, each pixel has to represent about 57 datapoints. The problem of course is that you can’t subdivide a pixel into 57 pieces. So what is LabVIEW to do?

Well it does what any graphing package does when it is confronted with this challenge: it decimates the data. In other words it takes 57-datapoint chunks of the data, performs some sort of statistical operation on each chunk (min, max, mean, etc) and then uses the resulting summary value to represent that chunk of data on the plot. There are several potential problems with this way of handling the situation, but they typically don’t become an issue unless the dataset being plotted is very large relative to the size of the graph. For example, this is why the data on the graph appeared to change as a function of the size of the plot area. As the plot area changed (even slightly) the chunking changes so the data appears to change as well – you can think of it as sort of visual aliasing.

More subtle problems have to do with the way the graphing routines “summarize” the data chunks. Depending upon the shape of your dataset, the operations I mentioned earlier can give dramatically different output and to make matters worse you have no idea what techniques the graphing functions are using. But even if you can live with the visual effects there are good reasons to take action to address the issue.

Finally, in order to plot these huge datasets you have to be carrying them around inside your program. Consequently, rather than having just one copy of these monsters, you can have several – perhaps dozens – it all depends on how your code it written. From this discussion we can then see the two imperatives for our solution:

  1. The approach must minimize the number of copies that LabVIEW has to make of the dataset.
  2. It must reduce the number of datapoints that actually need to be plotted.

Let’s start by looking at the data management aspect of the problem, remembering of course that these two issues are inextricably linked together.

Low-Overhead Storage

Decades ago, people in the nascent computer-science discipline realized that if you had a value, like an array, that consisted of multiple items, the most efficient way of making it available throughout your code was to store it in one location in memory and give the code that needed to access it a “pointer” that served as a reference to that value. Originally this mechanism was pretty primitive with the pointer often consisting of simply the value’s starting address in RAM. In addition, there was no real way of preventing race conditions or security intrusions because there was no way of controlling access to the data. It would be nice to think that we have learned the errors of our ways and fixed all the holes, but such is not always the case. Remember the “Heartbleed” bug panic from last year?

The good news is that LabVIEW does not suffer from the same problems because while we have at our disposal a mechanism that fills the same role as the primitive pointer, it lacks the problems. I am talking about the Data Value Reference, or DVR. It meets the low-overhead storage mandate by accessing the data through a reference that is only 4 bytes long. The DVR is also secure because the buffer that is creates is strongly typed, meaning that you can’t just store anything in it or read whatever you want from it. The data going in and coming out must match the definition of the data structure that was used when the DVR was defined. Finally, the DVR removes problems resulting from simultaneous access to the same resource by defining a new structure that automatically serializes access on a first-come, first-served basis. So the first thing we need to do is get our data into the DVR, and here’s some code to do just that.

Load  Big Data

The VI starts by reading a binary file containing the data which, to simplify this example, is already formatted correctly for how we are going to use it. The resulting array drives a box called an inplace structure that guarantees there will be no other accesses to the DVR occurring in parallel with this one. However, the structure does something else too: Inplace structures operate something like compiler directives telling the LabVIEW compiler that its OK to attempt additional optimizations that would not otherwise be safe to make. For example, they allow to LabVIEW operates on the inplace data without making the copies that the compiler might otherwise need make.

The other thing to note is that funny-looking function in the middle of the inner inplace structure. It’s called Swap Values and its help description really doesn’t do it justice. If all you did was read the context help you might assume that it is simply some sort of switch for routing signals around, and stifling a yawn, go on to consider matters that you think look more exciting. To see why you should consider this function very exciting, we need to look under LabVIEW’s hood.

To store data internally, LabVIEW uses memory data buffers. In fact much of what we think of as “dataflow” consists of the manipulation of those buffers. Now when LabVIEW stores a complex datatype like a cluster (which is what the DVR in this case is holding) it uses a combination of techniques. For simple fix-sized data like numerics or booleans, LabVIEW simply includes the data values directly in the cluster’s memory space. However, it needs a different approach when storing data values like arrays or strings that can vary in length. When a cluster includes an item that can change in size, the item is stored outside the cluster in its own memory buffer and the cluster only holds a reference to that buffer. That way if the item changes in size it can do so without effecting the memory allocation of the cluster containing it.

However this explanation also reveals why the Swap Values node is so important. Let’s look at this code from the standpoint of buffers. Coming into the inner inplace structure there are two buffers allocated that are holding arrays: One contains the data I read from the file, and one the (empty) array that is contained in the cluster that is the contents of the DVR. Now there are two ways that we could initialize that array. The most obvious one is to leave the unbundle (left) side of the cluster inplace structure unwired and wire the array containing the data directly to the bundle (right) side of the cluster inplace structure. While this would work, coding it in that way would result in LabVIEW needing to copy the data contained in the incoming array’s buffer to the array buffer associated with the cluster – and the larger the dataset is, the longer this copy can take.

Now consider what happens when Swap Values is used. Although the node resides inside an inplace structure, it would seem logical that you can’t replace an empty array with an array containing thousands of datapoints in place. Well actually you can. The key point to remember is that at a very low level, the clusters don’t actually contain the arrays, rather they hold references that point to the arrays that are associated with them. So what Swap Values does is it leaves the two arrays in place and simply swaps the references that the clusters contain. Thanks to this optimization, populating this cluster with data will take the exact same amount of time whether the input data contains 2 datapoints or 200,000 datapoints because the only thing that is really being moved is a pair of 4-byte memory buffer references.

Getting Data Out

So we have gotten our data into the DVR as efficiently as we can, but if this storage is going to be of any use, there clearly needs to be a way to get data out of it as well. However, here we face the issue of plotting data that is too large. At the same time we are pulling it out, we also need to be reducing or decimating it to more closely match the size of the available graphing area. To meet those dual requirements I created this VI.

Read and Decimate Big Data

At first this code might seem intimidating, but if you take it step-by-step and analyse what it’s doing, it isn’t really so very different from the example we looked for initializing the data in the DVR. Starting at the left side, the code unbundles the data array from the DVR and passing it into a loop that will execute three times – once for each plot in the dataset. The first point of optimization is in how this loop operates. Note the node with the “P” in it. The presence of this node means that the for loop is set for parallel operation. There are many situations where, even though you specify a for loop, there is no logical reason that the iterations have to operate sequentially. When LabVIEW encounters a “parallelized” loop the optimizer essentially flattens the loop out, creates the necessary parallel code to execute each iteration simultaneously, and then reassemble the output data in the correct order. To find out if a loop is parallelizable, there is an option under Tools>>Profile called Find Parallelizable Loops…. This operation opens a dialog that allows you to identify the loops that can and cannot be run in parallel mode.

Inside the loop, the array drives an inplace structure that indexes out one element, and the resulting cluster feeds a second inplace structure that unbundles the two items in the cluster. The processing of this data occurs in two distinct steps. First the Start and Length inputs produce a subset of the total dataset representing the portion of the data that is to be displayed. Because this operation causes LabVIEW to copy the selected data into a new memory buffer, the code passes the resulting arrays into another inplace structure to ensure that the subset will also be manipulated inplace.

The code inside this inner-most inplace structure performs the second half of the processing – the decimation to reduce the size of the data being plotted. Note that if the selected portion of the dataset is already smaller than the width of plot area, the following code is bypassed. The first step in the decimation process is to reshape the 1D array into a 2D array where each row contains one chunk of data to be statistically summarized. To obtain the final X values, the code takes the first value of each chunk, while the final Y values are the maximum Y for each chunk. Note that this processing occurs in another parallelized loop that auto-indexes the output arrays, which are swapped into the output dataset as they work their way back out through the inplace structures.

Summarizing Options and Challenges

The real heart of this VI is the function that is being used to summarize the Y values for each chunk of data. Right now, I am using the function that returns the minimum and maximum values contained in the array. One of the advantages that it offers is that it is deals well with datasets containing missing datapoints represented by the value NaN. This consideration is important because it is a common (and valuable) practice to represent missing data points using that value. Without the NaN datapoints, any graph will simply connect the dots on either side of the missing datapoints resulting in a graph that visually misrepresents the data being presented. However, with the NaN values, the missing points are shown as breaks in the line (or gaps between bars), thus highlighting the missing data.

The statistical function I selected to summarize the data in the chunks simply returns the minimum and maximum values of the elements that are not NaN. However, most other analysis routines follow the basic rule that any calculation which has NaN as an operand will return an answer of NaN – which in this situation will not be real helpful. More often, what you will want is, for example, the average value of the datapoints that are present in the dataset chunk. If you are wanting to use the chunk mean or median value to summarize a dataset that you know contains NaN value, you should include something like this before the statistical operation:

Filtering out NaN

Basically it works by first sorting the array to move any NaN values to the end of the array. It then looks for the first NaN and simply trims off it (and anything after it). This works because a mean operation doesn’t care about data order, and the first thing a median function does is sort the data anyway.

Let’s See How it Works

When you run the top-level VI in the linked project, the graph that comes up will look a lot like the first image in this post, but minus the vertical white bars. As you make changes to the display that effect the X axis range, you will notice that the resulting image will zoom in on the data, showing ever greater levels of detail. Try manually typing in new X axis end points, or use the horizontal zoom tool on the graph palette to select a range of data points that you want to zoom in on.

Zoom in far enough and you will see why there were white bars on the original plot: There are a lot of missing datapoints. Using the default decimation resulted in wide white bars because the presence of the NaN values effectively hid dozens of real datapoints.

Plotting Large Datasets – Release 1
Toolbox – Release 11

Hopefully this discussion will give you something to think about, and experiment with.

The Big Tease

One of the things that developers often have to face is adding functionality to an existing VI without disrupting, or even modifying what is already there. One way to accomplish this (seemingly impossible) task is to use what are sometimes called “drop-in” VIs. These routines are simply dropped down on an existing block diagram and they do what they do without interaction with the existing code. To demonstrate how this could work, next time we’ll get back to our test bed application and give it the ability to customize the font and size of the test that are on its various displays.

Until Next Time…
Mike…

When are UDEs not the Right Answer?

As we have seen, UDEs can be a powerful way of passing data between processes, but they are no Silver Bullet, no Panacea, no Balm of Gilead, no … well, you get the point. Given that we aren’t going to do something silly like only use one method for passing all our data, we need to ask an important question: When should you consider the other techniques?

Knowing What is Important

The reason people choose one thing over another (or at least why people should choose one thing over another) is that they are looking for better performance — however you might define that word. But before we can say one technique is better than another, we need to understand the key features of each technique. With that knowledge in hand we can then apply those techniques in ways where the features work to our advantage, but we avoid problematic side-effects. So, since our present concern is data passing, let’s consider the issue I call, immediacy.

There are times when something needs to happen as quickly as possible after an input value changes. In the hardware world this sort of condition is called an edge trigger and as a system architect working with them you are often primarily interested in when the change occurred. Of course an event can carry data, but the key distinguishing factor for this sort of communications is timing. These are the sorts of signals for which UDEs are an excellent choice because UDEs are very good at telling remote processes that something has changed.

On the other hand there are signals where functionally it doesn’t matter really when they changed. All the code really cares about is what their value is the next time they are used. These are the signals that you could pass with a UDE, but to do so would be wasteful of the receiving process’ time. Remember that when an event fires any process registered to receive that event has to stop what was doing to handle the event and then return to what it was doing before it was interrupted — even though it fundamentally doesn’t care when the value changes. Although these diversions might be small, they do take some time and so can effect loop timing. Moreover, if you have something running in a Timeout event a UDE (or any other type of event) can drastically effect the timing of when the next timeout occurs. Remember that a timeout, of say 1000 msec, does not mean that the code in it will execute every 1000 msec. Rather, the event triggers when the time since the last occurrence of any event exceeds 1000 msec.

In fact, in this situation, if you have a UDE that fires every 500 msec, the timeout will never occur. To summarize this point, a good way to think about this is that an event actually carries two pieces of information: data and timing. What we need sometimes, though, is just the data.

Just the Data

The good news is that LabVIEW offers a variety of options for passing just the data. The oldest (and still very useful) way is with a Functional Global Variable (FGV), also sometimes called a LabVIEW 2 style global. A FGV uses an uninitialized shift register to hold data in memory. More recent additions to our arsenal include feedback nodes, data value references (DVRs) and shared variables.

Your choice between these options should be driven by memory and performance concerns. A FGV is easy to create and very flexible, but if the data being buffered is large, they can suffer performance issues due to copying of memory buffers. A DVR on the other hand, is a little trickier to use because there is a reference you have to pass around, but is very efficient. With large datasets i will often split the difference and combine the two: I will use the DVR to store the data, but use a FGV to buffer and store the DVR reference. Although it is sort of overkill for the data that I will transferring, I’ll show you the technique in the following example.

What We’ll Build

Right now in the testbed application the delay between sampled is hardcoded to 1000 msec. So to demonstrate the passing of non-time-critical data, let’s modify our code to allow the UI process to vary the delay between data samples.

The first thing we want to do is create the buffer we will use to transfer the data. So in the project directory create a subdirectory named _buffers and inside it create a subdirectory named Sample Rate. Finally inside that subdirectory create a LabVIEW library named Sample Rate.lvlib and two more subdirectories called _subVIs and _typedefs. If you have been following this blog from the beginning, this arrangement should look familiar as it is very similar to what we did for organizing UDEs — and we are repeating it here for all the same reasons (unique name space, access control for subVIs, etc.).

The Buffer

The first real code will be the buffer and this is what it looks like:

DVT-Based Data Cache

As promised, it has the basic structure of a FGV, but the global data is the DVR reference. Note also that the DVR’s datatype is a cluster (typedef’d of course) that contains a single integer. The typedef is saved in the typedef directory with the name Data.ctl and the buffer is saved in the subVI directory with the name Buffer.vi. The two subdirectories have been added to the library and the access scope for the subVI directory is set to Private.

Reading and Writing the Buffer

With the buffer itself created we need to add to the library a pair of publicly accessible VIs for accessing it — and here they are. First the read:

Read Write DVR Data Cache

…and now the write:

Write DVR Data Cache

Note that if anything about the data or how it is stored would ever need to change in the future, this one library would be the only place in the code that would be impacted.

Modifying the Testbed

All we have to do now is modify the testbed to add in the logic that we just created, but thanks to the infrastructure that we have created, that will be an easy task. In fact it will take exactly three changes — total modification time, less than 5 minutes.

  1. Add an I32 control to the front panel of the display process and create a value change event for it. In the event handler for the value change, wire the NewVal event data node to the input of the buffer write VI.

    Sample Period Value Change Event

  2. Setup the initialization by adding a property node to fire the value change event using the default control value.

    Initialize Sample Period Event

  3. Add the buffer read to the acquisition VI as shown.

    Modified Acquisition Loop Timing

That’s all there is to it. Another good point for this approach is that because the DVR’s datatype is defined as a typedef, you can make changes to the data in the buffer without recreating everything. In addition, because the data is a cluster, the existing code won’t break if you add another value to the cluster. Go ahead and give it a try. Open the typedef and add another control to the cluster. After you have saved the change note that the the existing code did not break.

Check here for the updated project:
http://svn.notatamelion.com/blogProject/testbed application/Tags/Release 6

Oh yes, one more thing. Remember how we created a zip archive holding the basic template files for a UDE? It might be a good idea to do the same thing for this structure. Until next time…

Mike…