< Previous		Next >

HDFS CSV File Reader Input Adapter

Contents

Introduction
HDFS CSV File Reader Properties
Typechecking
Connecting to the Amazon S3 File System
Suspend and Resume Behavior

Introduction

The CSV File Reader For Apache HDFS is an embedded adapter that reads comma-separated value (CSV) files from a Hadoop Distributied File System resource.

An embedded adapter is an adapter that runs in the same process as StreamBase Server. The HDFS CSV File Reader reads records from a CSV file, creates tuples from these records, then sends these tuples to the operator downstream from it in its StreamBase application. A record typically consists of a line in the CSV file. If quoted, however, a record can span more than one line in the file.

The HDFS CSV File Reader is similar to an input stream that supplies its own input from a CSV file. As with an input stream, a schema needs to be specified for the HDFS CSV File Reader. The schema used by the HDFS CSV File Reader is specified in the Edit Schema tab of the Properties view in StreamBase Studio.

Note

When using the HDFS core libraries on Windows, you may see a console message similar to Failed to locate the winutils binary in the hadoop binary path. The Apache Hadoop project acknowledges that this is an error to be fixed in a future Hadoop release. In the meantime, you can either ignore the error message as benign, or you can download, build, and install the winutils executable as discussed on numerous Internet resources.

An embedded adapter that reads from a CSV file differs from an external data source, in that it consumes its input file as rapidly as it can. This means the rate at which it consumes records and produce tuples is governed only by the speed at which it can read records from the hadoop HDFS file system and create tuples from them. This would not typically be true of an external data source and it may not be the desired behaviour. A property of the HDFS CSV File Reader, Period, is used to govern the rate at which the HDFS CSV File Reader consumes records. The period is the amount of time that the HDFS CSV File Reader pauses between consuming records. That is, the HDFS CSV File Reader reads one record, processes it to completion, pauses for the specified period, and then reads another record.

The name of the CSV file is specified as a property of the HDFS CSV File Reader.

The size of a CSV file may be limited by practical considerations, and it may not be practical to provide the desired amount of data in a single file. One possible solution is to iterate over one CSV file a number of times, which is provided for by the Repeat property. If 0 is specified for Repeat, then the HDFS CSV File Reader iterates over the CSV file indefinitely.

The HDFS CSV File Reader allows you to specify a string that, when encountered in an incoming CSV field, will be translated into a null tuple field value. The default string is null, but you can specify any string in the NULL String property.

The HDFS CSV File Reader can read files compressed in the zip or gzip formats, automatically extracting the file to be read from the zip or gzip archive file. For this to work, the adapter requires the target file to have the extension .zip, .gz, or .bz2 file and expects to find exactly one CSV file inside each compressed file. This feature allows the adapter to read market data files provided by a market data vendor in compressed format, without needing to uncompress the files in advance.

The HDFS CSV File Reader considers lines starting with the number sign (#), also known the hash character, to be comments and discards them.

HDFS CSV File Reader Properties

This section describes the properties you can set for this adapter, using the various tabs of the Properties view in StreamBase Studio.

In the tables in this section, the Property column shows each property name as found in the one or more adapter properties tabs of the Properties view for this adapter.

General Tab

Name: Use this required field to specify or change the name of this instance of this component. The name must be unique within the current EventFlow module. The name can contain alphanumeric characters, underscores, and escaped special characters. Special characters can be escaped as described in Identifier Naming Rules. The first character must be alphabetic or an underscore.

Adapter: A read-only field that shows the formal name of the adapter.

Class name: Shows the fully qualified class name that implements the functionality of this adapter. If you need to reference this class name elsewhere in your application, you can right-click this field and select Copy from the context menu to place the full class name in the system clipboard.

Start options: This field provides a link to the Cluster Aware tab, where you configure the conditions under which this adapter starts.

Enable Error Output Port: Select this checkbox to add an Error Port to this component. In the EventFlow canvas, the Error Port shows as a red output port, always the last port for the component. See Using Error Ports to learn about Error Ports.

Description: Optionally, enter text to briefly describe the purpose and function of the component. In the EventFlow Editor canvas, you can see the description by pressing Ctrl while the component's tooltip is displayed.

Adapter Properties Tab

Property	Data Type	Default	Description
File Name	String	None	The name of the CSV file to read. You must enter a file name in this field, or enable the Start Control Port, or both. If Start Control Port is disabled, the file specified in this field is the only file to be read by the current adapter instance. If Start Control Port is enabled, a file specified in this field is the default file to be read, as described below. This adapter automatically uncompresses the input file before attempting to interpret the CSV content, if the input file was compressed with Zip and has the `.zip` extension, with Gzip and has the `.gz` extension, or with Bzip2 and has the `.bz2` extension.
User	String	None	The user to access the HDFS file system with if none is provided on the control input port. If no user name is provided by the control port or this field the user running the application is used.
Read As Resource	checkbox	enabled	If enabled and the path given is not absolute then the file will be resolved as a resource file
Use Default Charset	check box	Selected	If selected, specifies whether the Java platform's default character set is to be used. If cleared, a valid character set name must be specified for the Character Set property.
Character Set	string	None	The name of the character set encoding that the adapter is to use to read input or write output.
Start Control Port	check box	Cleared	Select this checkbox to give this adapter instance an input port that you can use to control which CSV files to read, and in which order. The input schema for the Start Control Port must have at least one field of type string. You can optionally define a more complex schema for this port for use with the Map Control Port to Event Port option; in this case, the first field must be of type string and the second field used for user must also be of type string. The schema is typechecked as you define it. If the File Name property is empty, the adapter begins reading when it receives a control tuple on this port. Specify the full, absolute path to the CSV file to be read in the first field of the tuple, and optionally specify the user as the second field. There is no need to surround the full path with quotes if the path contains spaces. If the File Name property specifies a file name, there are two cases: If a control tuple received on this port has an empty or null string, the file specified in the File Name property is read or re-read. If a control tuple contains the path to a CSV file, then that specified file is read, as above, ignoring the File Name field.
Start Event Port	check box	Cleared	Select this checkbox to create an output port that emits an informational tuple each time a CSV file is opened or closed. The informational tuple schema has five fields: Type, string Object, string Action, string Status, int Info, string For a file open event, the event port tuple's `Type` field is set to "Open", while the `Object` field is set to the path name of the CSV file being opened. If the file open procedure succeeded the `Status` field will be "0". If an error occurred while opening the file, `Status` will be set to "-1" and `Info` will contain an error message. For a file close event, `Type` is set to "Close", `Object` is set to the path name of the CSV file being closed, and `Status` is set to the number of rows that were read from the CSV file. The Close event tuple is sent after the adapter processes the entire CSV file and emits data tuples for each record in the file. If an unexpected error occurs, `Type` is set to "Error", `Status` is set to the record where the error occurred, and `Info` contains the error message. If you enable the Map Control Port to Event Port option below, the event port tuple also includes a sixth field named `ControlInfo` of type tuple. When running in Studio, remember that tuples from more than one output port may appear in the Output Streams view in a different order than they are emitted from the adapter. Thus, you may see the Close event appear on the output of this event port while data tuples are still displaying.
Map Control Port to Event Port	check box	Cleared	Select this checkbox to pass all information received on the control input port to the event output port. When enabled, this property adds a field of type tuple named `ControlInfo` to the tuple passed to the event output stream. The `ControlInfo` field contains the entire tuple of the input stream sent to the Control Port.
Log Level	drop-down list	INFO	Controls the level of verbosity the adapter uses to send notifications to the console. This setting can be higher than the containing application's log level. If set lower, the system log level will be used. Available values, in increasing order of verbosity, are: OFF, ERROR, WARN, INFO, DEBUG, TRACE.

Parsing Options Tab

Property	Data Type	Default	Description
Field Delimiter	string	, (comma)	The delimiter used to separate tokens in the input file. Control characters can be entered as `&#ddd;` where `ddd` is the character's ASCII value. For example, use ` ` for a tab character. A special exception also allows the `\t` character to be used in this field to represent a tab delimiter.
String Quote Character	string	" (double quote)	The optional quote character used in pairs to delimit string constants.
Timestamp Format	string	yyyy-MM-dd HH:mm:ss.SSSZ	The string format used to represent timestamp fields extracted from the input file. The default and ideal is the form expected by the `java.text.SimpleDateFormat` class described in the Oracle Java Platform SE reference documentation. If a timestamp value is read that does not match the specified format string, the entire record is discarded and a WARN message appears on the console that includes the text `invalid timestamp value`.
Lenient Parsing	boolean	Selected	Set this to true if you would like to parse timestamp values that do not conform to the specified format using default formats.
NULL String	string	None	The string which, if encountered in a CSV field when reading a file, is to be translated as a null tuple field value for the corresponding tuple field. If unspecified, the default string is `null`. You can designate any string to be considered the null value string.
Preserve Whitespace	boolean	Cleared	Set this to true to preserve leading and trailing white space in string fields.
Header Type	drop-down list	No header	The type of header used in the CSV file. Choose one of the following: No header The CSV file contains no header and is to be parsed without a header. Ignore header The first line of the CSV file is to be considered the header. The first line is skipped and not read into the adapter as a tuple. Read header The first line of the CSV file is to be considered the header, and compared against the schema used in your StreamBase application. Fields that do not match the schema are not parsed (including the subsequent fields in the following rows), and fields outside the range of the header are not parsed. Field order does not matter, because the adapter reorganizes the CSV file to fit the schema of the StreamBase application.
Incomplete Records	radio button	Populate with nulls	Specifies what should be done when the adapter reads a record with less than the required number of fields. Discard Discard records with less than the required number of fields. Populate with nulls When records with less than the required number of fields are encountered, process the records after populating the unspecified fields with nulls.
Discard Empty Records	check box	Selected	This is a special case to handle empty lines. If rows with some fields must send output, but not empty lines, leave this selected. Unselect this to send empty tuples for empty lines.
Log Warning	check box	Cleared	Select this checkbox if warning messages are to be logged when incomplete records are encountered. If cleared, no warning messages are logged for records with less than the required number of fields.

Emit Options Tab

Property	Data Type	Default	Description
Repeat	int	1	The number of times to iterate over the CSV file. 0 specifies iterating indefinitely. Note that if you send a new file to be read using the control port when this control is set to iterate indefinitely means the new file is not picked up.
Emit Policy	Radio button	Periodic	Specifies whether to emit tuples with a regular period or based on a field in the data. Specify Periodic, the default setting, to use the Period property below. In this case, the two Time field properties are dimmed. Specify Field based to use a field in the output tuple to control the tuple emission rate. In this case, the Period property is dimmed. Specify the field to use in the Time field property, and specify how to use that field with a selection in the Time field meaning property.
Period	int	O	Active only when Emit Policy is Periodic. Specifies the time, in milliseconds, to wait between the processing of records.
Time field meaning	Drop-down list	Emission times relative to the first record.	Active only when Emit Policy is Field based. In the dropdown list, select one of the following options to specify how to use the time field named in the next property. Absolute delays before the first record Emission times relative to the first record Emission times relative to zero
Time field	string	none	Active only when Emit Policy is Field based. Specifies the name of a field in the output tuple whose values are used to control the tuple emission rate.
Capture Transform Strategy	radio button	FLATTEN	The strategy to use when transforming capture fields for this operator: FLATTEN or NEST.

HDFS Tab

Property	Data Type	Description
Buffer Size (Bytes)	int	The size of the buffer to be used. If empty, the default is used.
Configuration Files	Strings	The HDFS configuration files to use when connecting to the HDFS file system. For example, use the standard file, `core-defaults.xml`.

Edit Schema Tab

Use the Edit Schema tab to specify the schema of the output tuple for this adapter. For general instructions on using the Edit Schema tab, see the Properties: Edit Schema Tab section of the Defining Input Streams page.

Cluster Aware Tab

Use the settings in this tab to enable this operator or adapter for runtime start and stop conditions in a multi-node cluster. During initial development of the fragment that contains this operator or adapter, and for maximum compatibility with releases before 10.5.0, leave the Cluster start policy control in its default setting, Start with module.

Cluster awareness is an advanced topic that requires an understanding of StreamBase Runtime architecture features, including clusters, quorums, availability zones, and partitions. See Cluster Awareness Tab Settings on the Using Cluster Awareness page for instructions on configuring this tab.

Concurrency Tab

Use the Concurrency tab to specify parallel regions for this instance of this component, or multiplicity options, or both. The Concurrency tab settings are described in Concurrency Options, and dispatch styles are described in Dispatch Styles.

Caution

Concurrency settings are not suitable for every application, and using these settings requires a thorough analysis of your application. For details, see Execution Order and Concurrency, which includes important guidelines for using the concurrency options.

Typechecking

Typechecking fails if the schema does not have at least one parameter, if the Delimiter is not a single character string, if the QuoteChar is longer than one character, or if the TimestampFormat is malformed. The File Name field fails to typecheck only if it is blank and you have not enabled the Start Control Port option.

A warning is emitted if the File Name property is empty and a null control tuple is received on the Start Control Port.

Connecting to the Amazon S3 File System

The HDFS Adapters can connect to and work with the Amazon S3 file system using the s3a://{bucket}/yourfile.txt URI style.

In order for the HDFS adapters to be able to access the Amazon S3 file system, you must also supply S3 authentication information via a configuration file or one of the other supported ways described here: Authenticating with S3.

You must also add the following dependency to your pom file to include the Apache Hadoop Amazon Web Services Support

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-aws</artifactId>
    <version>2.8.1</version>
</dependency>

The HDFS S3 sample gives an example of using a configuration file with the access key and secret key supplied.

Suspend and Resume Behavior

On suspend, the HDFS CSV File Reader adapter finishes processing the current record, outputs the tuple, and then pauses. The input file remains open and the adapter retains its position in the file. The adapter will stay paused until it is either shutdown or resumed.

On resumption, the HDFS CSV File Reader adapter continues processing with the next record in the input file.