# University of Twente Student Theses

## On a Design Space for Aggregating over Sliding Windows on a Stream

Meijer, Michael
(2011)
*On a Design Space for Aggregating over Sliding
Windows on a Stream.*

PDF
3MB |

Abstract: | On-line, continuous processing over streaming data challenges performance: processing time and memory space usage. This research is about the best performing way to compute an aggregate function over sliding windows containing the most recent data of a stream. The performance depends on the output semantics: input, aggregate function, accuracy requirements, retention as the quantity of elements aggregated and slide as the quantity of elements to move over the stream. Two approaches are identified, that under the same output semantics, differ in the way sliding windows are materialized. The instantaneous approach takes a window covering the retention and computes the aggregate function over the window upon every slide, independent of previous windows. The incremental approach has a synopsis structure covering the retention, updated by consecutive disjoint batches covering a slide. A proposed theoretical model relates the approaches to processing time and memory space usage. Scalability is assessed by instantiating that model with the computational complexity of the approaches. It is shown that the incremental approach with a trivial synopsis storing the entire retention can have processing time benefits over the instantaneous approach. Moreover it seems approximations of aggregate functions are the only way to reduce memory space usage when sliding one element at a time. Properties of aggregates from literature are related to the approaches and theoretical model. Aggregates with the properties of a priori bounded space and incremental / decremental functions are called bounded differential aggregates. Under the incremental approach they require constant processing time and can significantly reduce memory space usage in practice when the slide is sufficiently large. By instantiating the theoretical model with empirical approximations of performance, approaches can be compared by their performance in practice. Performance in practice shows differences between approaches belonging to the same complexity class, more average-case like performance and (unexpected) performance side-effects. A taxonomy is proposed categorizing synopsis structures by support for fixed-size (e.g. count-based) or variable-size (e.g. time-based) retention, exact or approximate aggregates produced and the deterministic or randomized / probablistic result of approximations. A few synopsis structures covering the taxonomy are explored in the context of the selected aggregates variance and quantiles of (most) interest to Info Support. Exact computations are explored by the trivial synopsis that stores the entire retention and panes that exploit a priori space bounds of aggregates for sufficiently large batches. Approximations are explored by sampling, histograms for variance only and summaries for quantiles only. The selected aggregates under both approaches along with synopsis structures are implemented on the streaming data processing system Microsoft StreamInsight 1.1 of interest to Info Support. A stream of uniform random numbers is used. Accuracy of 95%-100% is of interest to Info Support. The effect of retention is observed by increasing it at a fixed slide. The effect of slide (inversely: overlap) is observed by a fixed retention and increasing slide. The empirical results show the instantaneous approach outperforms the incremental approach when the slide (and batch) is sufficiently large; at about half the retention. Sampling achieves high accuracy and the best performance under the incremental approach. Panes have the best performance when the batches are large enough. Quantile summaries reduce memory space usage significantly, have high accuracy and low processing time. Using SPSS 17 the empirical results are generalized on the environment by regression models for processing time and memory space usage per aggregate, per accuracy requirements and per approach (differentiated by synopsis structure). Violation of the assumptions of homoscedasticity, independence or normality of residuals during analysis, suggests the use of more advanced analysis. Validation of the models on an independent subset of the measurements indicates errors similar to the model errors from analysis. A design space is proposed that uses the theoretical model to compare the approaches in terms of performance, either by scalability or by empirical approximations of performance in practice, and is augmented with guidelines derived during the research. It indicates the best performing approach regarding how to compute an aggregate function over sliding windows, given output semantics. By exploring other aggregates and synopsis structures, the design space can be extended. |

Item Type: | Essay (Master) |

Clients: | Info Support B.V. |

Faculty: | EEMCS: Electrical Engineering, Mathematics and Computer Science |

Subject: | 54 computer science |

Programme: | Computer Science MSc (60300) |

Link to this item: | http://purl.utwente.nl/essays/61062 |

Export this item as: | BibTeX EndNote HTML Citation Reference Manager |

Repository Staff Only: item control page