CSV File Reader Input Adapter

Introduction

The Spotfire Streaming CSV File Reader is an embedded adapter that reads comma-separated value (CSV) files.

An embedded adapter is an adapter that runs in the same process as StreamBase Server. The CSV File Reader reads records from a CSV file, creates tuples from these records, then sends these tuples to the operator downstream from it in its StreamBase application. A record typically consists of a line in the CSV file. If quoted, however, a record can span more than one line in the file.

The CSV File Reader is similar to an input stream that supplies its own input from a CSV file. As with an input stream, a schema needs to be specified for the CSV File Reader. The schema used by the CSV File Reader is specified in the Edit Schema tab of the Properties view in StreamBase Studio.

An embedded adapter that reads from a CSV file differs from an external data source, in that it consumes its input file as rapidly as it can. This means the rate at which it consumes records and produce tuples is governed only by the speed at which it can read records from disk and create tuples from them. This would not typically be true of an external data source and it may not be the desired behavior. A property of the CSV File Reader, Period, is used to govern the rate at which the CSV File Reader consumes records. The period is the amount of time that the CSV File Reader pauses between consuming records. That is, the CSV File Reader reads one record, processes it to completion, pauses for the specified period, and then reads another record.

The name of the CSV file is specified as a property of the CSV File Reader. If you use the File Name field without the Start Control Port option, the specified file must exist in the same project folder in StreamBase Studio, or in a referenced project's folder. If you use the File Name field in conjunction with the Start Control Port option, you can specify a relative or absolute path to the CSV file. If you specify a relative path, the named file is searched for in the directory specified in the StreamBase Server configuration file. In the <global> section, look for the operator-resource-search parameter. By default, it is commented out. Uncomment the element and specify a path. For example:

<global>
  <operator-resource-search directory="/home/sbuser/mysbapps/resources"/>
</global>

The size of a CSV file may be limited by practical considerations, and it may not be practical to provide the desired amount of data in a single file. One possible solution is to iterate over one CSV file a number of times, which is provided for by the Repeat property. If 0 is specified for Repeat, then the CSV File Reader iterates over the CSV file indefinitely.

Note that the CSV file can be either imported into your StreamBase Studio project, or created and edited in Studio. To create a new one, select File>New>File. In the New File dialog, specify the file's name and project. A new, empty file is opened in a text editor, where you can edit and save it.

The CSV File Reader allows you to specify a string that, when encountered in an incoming CSV field, will be translated into a null tuple field value. The default string is null, but you can specify any string in the NULL String property.

The CSV File Reader can read files compressed in the zip or gzip formats, automatically extracting the file to be read from the zip or gzip archive file. For this to work, the adapter requires the target file to have the extension .zip, .gz, or .bz2 file and expects to find exactly one CSV file inside each compressed file. This feature allows the adapter to read market data files provided by a market data vendor in compressed format, without needing to uncompress the files in advance.

The CSV File Reader considers lines starting with the number sign (#), also known the hash character, to be comments and discards them.

CSV File Reader Properties

This section describes the properties you can set for this adapter, using the various tabs of the Properties view in StreamBase Studio.

In the tables in this section, the Property column shows each property name as found in the one or more adapter properties tabs of the Properties view for this adapter.

General Tab

Name: Use this required field to specify or change the name of this instance of this component. The name must be unique within the current EventFlow module. The name can contain alphanumeric characters, underscores, and escaped special characters. Special characters can be escaped as described in Identifier Naming Rules. The first character must be alphabetic or an underscore.

Adapter: A read-only field that shows the formal name of the adapter.

Class name: Shows the fully qualified class name that implements the functionality of this adapter. If you need to reference this class name elsewhere in your application, you can right-click this field and select Copy from the context menu to place the full class name in the system clipboard.

Start options: This field provides a link to the Cluster Aware tab, where you configure the conditions under which this adapter starts.

Enable Error Output Port: Select this checkbox to add an Error Port to this component. In the EventFlow canvas, the Error Port shows as a red output port, always the last port for the component. See Using Error Ports to learn about Error Ports.

Description: Optionally, enter text to briefly describe the purpose and function of the component. In the EventFlow Editor canvas, you can see the description by pressing Ctrl while the component's tooltip is displayed.

Adapter Properties Tab

Property Data Type Default Description
File Name drop-down list None

The name of the CSV file to read, without any path. The specified file must be in the current project folder, or in a referenced project's folder. You must enter a file name in this field, or enable the Start Control Port, or both. If Start Control Port is disabled, the file specified in this field is the only file to be read by the current adapter instance. If Start Control Port is enabled, a file specified in this field is the default file to be read, as described below.

This adapter automatically uncompresses the input file before attempting to interpret the CSV content, if the input file was compressed with Zip and has the .zip extension, with Gzip and has the .gz extension, or with Bzip2 and has the .bz2 extension.

Read As Resource checkbox enabled If enabled and the path given is not absolute then the file will be resolved as a resource file
Use Default Charset check box Selected If selected, specifies whether the Java platform's default character set is to be used. If cleared, a valid character set name must be specified for the Character Set property.
Character Set string None The name of the character set encoding that the adapter is to use to read input or write output.
Start Control Port check box Cleared

Select this checkbox to give this adapter instance an input port that you can use to control which CSV files to read, and in which order. The input schema for the Start Control Port must have at least one field of type string. You can optionally define a more complex schema for this port for use with the Map Control Port to Event Port option; in this case, the first field must be of type string and the second field used for user must also be of type string. The schema is typechecked as you define it.

If the File Name property is empty, the adapter begins reading when it receives a control tuple on this port. Specify the full, absolute path to the CSV file to be read in the first field of the tuple, and optionally specify the user as the second field. There is no need to surround the full path with quotes if the path contains spaces.

If the File Name property specifies a file name, there are two cases:

  1. If a control tuple received on this port has an empty or null string, the file specified in the File Name property is read or re-read.

  2. If a control tuple contains the path to a CSV file, then that specified file is read, as above, ignoring the File Name field.

Start Event Port check box Cleared

Select this checkbox to create an output port that emits an informational tuple each time a CSV file is opened or closed. The informational tuple schema has five fields:

  • Type, string

  • Object, string

  • Action, string

  • Status, int

  • Info, string

For a file open event, the event port tuple's Type field is set to "Open", while the Object field is set to the path name of the CSV file being opened. If the file open procedure succeeded the Status field will be "0". If an error occurred while opening the file, Status will be set to "-1" and Info will contain an error message.

For a file close event, Type is set to "Close", Object is set to the path name of the CSV file being closed, and Status is set to the number of rows that were read from the CSV file. The Close event tuple is sent after the adapter processes the entire CSV file and emits data tuples for each record in the file.

If an unexpected error occurs, Type is set to "Error", Status is set to the record where the error occurred, and Info contains the error message.

If you enable the Map Control Port to Event Port option below, the event port tuple also includes a sixth field named ControlInfo of type tuple.

When running in Studio, remember that tuples from more than one output port may appear in the Output Streams view in a different order than they are emitted from the adapter. Thus, you may see the Close event appear on the output of this event port while data tuples are still displaying.

Map Control Port to Event Port check box Cleared

Select this checkbox to pass all information received on the control input port to the event output port. When enabled, this property adds a field of type tuple named ControlInfo to the tuple passed to the event output stream. The ControlInfo field contains the entire tuple of the input stream sent to the Control Port.

Tail Mode check box Cleared

Select this checkbox to process records as they are appended to the CSV file. Newly appended records are not emitted until the reader detects the line ending character appropriate for the operating system.

Ignore Existing Records check box Selected

Select this checkbox to ignore existing records in the CSV file when in tail mode.

Tail Update Interval int 1000

The time, in milliseconds, between checks for updates to the CSV file when in tail mode.

Read Files Synchronously check box Unselected If enabled, files will be read synchronously.
Sync Read Mode bool Unselected If enabled, files will be read synchronously.
Log Level drop-down list INFO Controls the level of verbosity the adapter uses to send notifications to the console. This setting can be higher than the containing application's log level. If set lower, the system log level will be used. Available values, in increasing order of verbosity, are: OFF, ERROR, WARN, INFO, DEBUG, TRACE.

Parsing Options Tab

Property Data Type Default Description
Field Delimiter string , (comma) The delimiter used to separate tokens in the input file. Control characters can be entered as &#ddd; where ddd is the character's ASCII value. For example, use &#9; for a tab character. A special exception also allows the \t character to be used in this field to represent a tab delimiter.
String Quote Character string " (double quote) The optional quote character used in pairs to delimit string constants.
Timestamp Format string yyyy-MM-dd HH:mm:ss.SSSZ

The string format used to represent timestamp fields extracted from the input file. The default and ideal is the form expected by the java.text.SimpleDateFormat class described in the Oracle Java Platform SE reference documentation.

If a timestamp value is read that does not match the specified format string, the entire record is discarded and a WARN message appears on the console that includes the text invalid timestamp value.

Lenient Parsing boolean Selected

Set this to true if you would like to parse timestamp values that do not conform to the specified format using default formats.

NULL String string None The string which, if encountered in a CSV field when reading a file, is to be translated as a null tuple field value for the corresponding tuple field. If unspecified, the default string is null. You can designate any string to be considered the null value string.
Preserve Whitespace boolean Cleared Set this to true to preserve leading and trailing white space in string fields.
Header Type drop-down list No header

The type of header used in the CSV file. Choose one of the following:

No header

The CSV file contains no header and is to be parsed without a header.

Ignore header

The first line of the CSV file is to be considered the header. The first line is skipped and not read into the adapter as a tuple.

Read header

The first line of the CSV file is to be considered the header, and compared against the schema used in your StreamBase application. Fields that do not match the schema are not parsed (including the subsequent fields in the following rows), and fields outside the range of the header are not parsed. Field order does not matter, because the adapter reorganizes the CSV file to fit the schema of the StreamBase application.

Incomplete Records radio button Populate with nulls

Specifies what should be done when the adapter reads a record with less than the required number of fields.

Discard

Discard records with less than the required number of fields.

Populate with nulls

When records with less than the required number of fields are encountered, process the records after populating the unspecified fields with nulls.

Discard Empty Records check box Selected

This is a special case to handle empty lines. If rows with some fields must send output, but not empty lines, leave this selected. Unselect this to send empty tuples for empty lines.

Log Warning check box Cleared

Select this checkbox if warning messages are to be logged when incomplete records are encountered. If cleared, no warning messages are logged for records with less than the required number of fields.

Emit Options Tab

Property Data Type Default Description
Repeat int 1 The number of times to iterate over the CSV file. 0 specifies iterating indefinitely. Note that if you send a new file to be read using the control port when this control is set to iterate indefinitely means the new file is not picked up.
Emit Policy Radio button Periodic Specifies whether to emit tuples with a regular period or based on a field in the data.

Specify Periodic, the default setting, to use the Period property below. In this case, the two Time field properties are dimmed.

Specify Field based to use a field in the output tuple to control the tuple emission rate. In this case, the Period property is dimmed. Specify the field to use in the Time field property, and specify how to use that field with a selection in the Time field meaning property.

Period int O Active only when Emit Policy is Periodic. Specifies the time, in milliseconds, to wait between the processing of records.
Time field meaning Drop-down list Emission times relative to the first record. Active only when Emit Policy is Field based. In the dropdown list, select one of the following options to specify how to use the time field named in the next property.
  • Absolute delays before the first record

  • Emission times relative to the first record

  • Emission times relative to zero

Time field string none Active only when Emit Policy is Field based. Specifies the name of a field in the output tuple whose values are used to control the tuple emission rate.
Capture Transform Strategy radio button FLATTEN The strategy to use when transforming capture fields for this operator: FLATTEN or NEST.

Edit Schema Tab

Use the Edit Schema tab to specify the schema of the output tuple for this adapter. For general instructions on using the Edit Schema tab, see the Properties: Edit Schema Tab section of the Defining Input Streams page.

Cluster Aware Tab

Use the settings in this tab to enable this operator or adapter for runtime start and stop conditions in a multi-node cluster. During initial development of the fragment that contains this operator or adapter, and for maximum compatibility with releases before 10.5.0, leave the Cluster start policy control in its default setting, Start with module.

Cluster awareness is an advanced topic that requires an understanding of StreamBase Runtime architecture features, including clusters, quorums, availability zones, and partitions. See Cluster Awareness Tab Settings on the Using Cluster Awareness page for instructions on configuring this tab.

Concurrency Tab

Use the Concurrency tab to specify parallel regions for this instance of this component, or multiplicity options, or both. The Concurrency tab settings are described in Concurrency Options, and dispatch styles are described in Dispatch Styles.

Caution

Concurrency settings are not suitable for every application, and using these settings requires a thorough analysis of your application. For details, see Execution Order and Concurrency, which includes important guidelines for using the concurrency options.

Typechecking

Typechecking fails if the schema does not have at least one parameter, if the Delimiter is not a single character string, if the QuoteChar is longer than one character, or if the TimestampFormat is malformed. The File Name field fails to typecheck only if it is blank and you have not enabled the Start Control Port option.

A warning is emitted if the File Name property is empty and a null control tuple is received on the Start Control Port.

Suspend and Resume Behavior

On suspend, the CSV File Reader adapter finishes processing the current record, outputs the tuple, and then pauses. The input file remains open and the adapter retains its position in the file. The adapter will stay paused until it is either shutdown or resumed.

On resumption, the CSV File Reader adapter continues processing with the next record in the input file.

Back to Top ^