Managing Memory Overhead for Published Data Sets

Overview

This article describes the various ways to estimate the Live DataMart memory overhead for published data sets.

Memory Estimation Guidelines

What is the overhead for Live DataMart when it comes to memory utilization? For example, if you have 100 MB of raw data and load it into Live DataMart, what will the memory footprint be?

To answer this question as accurately as possible, consider your Live DataMart project's details (for example, field rules, preprocessors, transforms, and so on).

TIBCO recommends a guideline of 3X for estimating Live DataMart general memory overhead, in the absence of any knowledge of the data types and sizes, number of indexes, or other parameters.

The 3X guideline is a little higher than what you might use for a strictly StreamBase application. Live DataMart adds several internal indexes on all tables, has some per-row metadata, and depending on the configuration has additional overhead — sometimes per-row and sometimes publish rates, plus of course the number and types of queries.

The first thing to understand is how is the raw data size being measured? In one example, consider the number of bytes a CSV data file representation takes. One significant multiplier is when many strings are involved — strings take 2X heap memory size over a CSV character count. For any string fields that are indexes, add another 2X to the number of bytes for the index storage.

So if you have a two-column table with a long as the primary index and a string as a secondary — and your data is a long and contains a 1,000 character string, you may need about 4X more heap memory than you have raw data size.

Suppose your data table contains 201 fields — a long primary index and 200 strings. If your data is a long and 200 one-character strings, you will need 200* (20 bytes — most for object overhead) or about 4000 bytes of heap for each approximately 208 bytes of raw data size. This is more like 20X overhead.

The LVStatistics system table's MBMemoryUsed column shows the approximate number of megabytes the specified table currently consumes. Applying 3X to this number is also useful when estimating memory overhead.