HDFS File Writer Adapter

Introduction

The Spotfire Streaming File Writer Adapter for Apache HDFS writes a file to a configured Hadoop Distributed File System resource.

The adapter writes in response to the receipt of a tuple on its control input port to open and close the file and then the data port for the contents.

The adapter has a sample, described in HDFS File Writer and File System Adapter Sample.

Note

When using the HDFS core libraries on Windows, you may see a console message similar to Failed to locate the winutils binary in the hadoop binary path. The Apache Hadoop project acknowledges that this is an error to be fixed in a future Hadoop release. In the meantime, you can either ignore the error message as benign, or you can download, build, and install the winutils executable as discussed on numerous Internet resources.

HDFS File Writer Properties

This section describes the properties you can set for this adapter, using the various tabs of the Properties view in StreamBase Studio.

General Tab

Name: Use this required field to specify or change the name of this instance of this component. The name must be unique within the current EventFlow module. The name can contain alphanumeric characters, underscores, and escaped special characters. Special characters can be escaped as described in Identifier Naming Rules. The first character must be alphabetic or an underscore.

Adapter: A read-only field that shows the formal name of the adapter.

Class name: Shows the fully qualified class name that implements the functionality of this adapter. If you need to reference this class name elsewhere in your application, you can right-click this field and select Copy from the context menu to place the full class name in the system clipboard.

Start options: This field provides a link to the Cluster Aware tab, where you configure the conditions under which this adapter starts.

Enable Error Output Port: Select this checkbox to add an Error Port to this component. In the EventFlow canvas, the Error Port shows as a red output port, always the last port for the component. See Using Error Ports to learn about Error Ports.

Description: Optionally, enter text to briefly describe the purpose and function of the component. In the EventFlow Editor canvas, you can see the description by pressing Ctrl while the component's tooltip is displayed.

Adapter Properties Tab

Property Description
Default File Name The default file name to create on start-up. If no file is specified the control port must be enabled to send control tuples to open a file for writing.
Serialization Format When writing in binary format this is the serialization format to use.
User The default user if none is provided on the control input port.
Data Field The field from the incoming data port to write to the file. If left blank the entire tuple will be written as a string or binary depending on the serialization type selected.
FileName Field The field from the incoming data to use to determine which file to write data to. This option is only valid when Enable Multiple Open Files is used.
Enable Control Port If checked this enables the control port, which allows the user to send control commands to open, flush, and close files.
Enable Status Port If checked this will enable the status port, which gives information about what actions are being taken against files.
Write Line Separator After Tuple If checked this will write the system new line character after each tuple is received and written to the file.
File Create Mode This determines how a file will be created:
  1. Append: If the file exists, its contents will remain and the new data is appended at the end.

  2. Overwrite: If the file exists, its contents are overwritten.

  3. Fail: If the file exists, a Reject status tuple is emitted and the file will not be opened for writing.

File Compression This determines how a file will be compressed when written:
  1. None: The file will be written with no compression.

  2. GZip: The file will be written with the GZip compression format.

  3. BZip2: The file will be written with the BZip2 compression format.

  4. Zip: The file will be written with the standard Zip format and contain a single entry with the same name as the file name you specify, minus the extension.

Flush Interval The value in milliseconds to perform a flush on the file to write out any buffered data. A value of 0 means a flush is performed after each tuple is received.
Synchronized Open Command If enabled the operator will perform the open command and wait for it to complete before continuing.
Synchronized Close Command If enabled the operator will perform the close command and wait for it to complete before continuing.
Synchronized Flush Command If enabled the operator will perform the flush command and wait for it to complete before continuing.
Synchronized Write Command If enabled the operator will perform the write command and wait for it to complete before continuing.
Enable Multiple Open Files If enabled the operator will allow multiple files open at the same time. This option requires that every command has a filename associated.
Log Level Controls the level of verbosity the adapter uses to send notifications to the console. This setting can be higher than the containing application's log level. If set lower, the system log level is used. Available values, in increasing order of verbosity, are: OFF, ERROR, WARN, INFO, DEBUG, TRACE.

HDFS Tab

Property Data Type Description
Buffer Size (Bytes) Int The size of the buffer to be used. If empty, the default is used.
Replication Short The required block replication for the file. If empty, the server default is used, and only used during file creation.
Block Size (Bytes) Long The default data block size. If empty, the server default is used, and only used during file creation.
Configuration Files String The HDFS configuration files to use when connecting to the HDFS file system. For example, use the standard file, core-defaults.xml.

Serialize Options

This section describes the options available when a serialization format is selected.

Options

Property Description
Output Schema

The schema used by the serialization type as the target output schema. If this field is empty, the system generates the required schema based on the input tuple. If the log level is set to debug, the serializer outputs the generated schema to the log.

Default Mapping

This section describes the default schema mapping when no schema is provided.

Avro

All data types will be converted as a union of a null type and the described type below as all StreamBase fields are nullable.

StreamBase Data Type Avro Data Type
BLOB BYTES
BOOL BOOLEAN
DOUBLE DOUBLE
INT INT
LIST MAP or ARRAY, MAP if the list is a tuple with 2 fields (key, value)
LONG LONG
TIMESTAMP LONG
STRING STRING
FUNCTION STRING
TUPLE RECORD
CAPTURE NOT SUPPORTED
Parquet

All data types are marked as optional.

StreamBase Data Type Parquet Data Type
BLOB BINARY
BOOL BOOLEAN
DOUBLE DOUBLE
INT INT32
LIST MAP or ARRAY, MAP if the list is a tuple with 2 fields (key, value)
LONG INT64
TIMESTAMP INT64
STRING BINARY (UTF-8)
FUNCTION BINARY (UTF-8)
TUPLE GROUP
CAPTURE NOT SUPPORTED

Serialize Format Data Types

Avro Data Type Map to StreamBase

StreamBase Data Type Avro Data Type
BLOB BYTES, FIXED, STRING, UNION[NULL, BYTES], UNION[NULL, FIXED], UNION[NULL, STRING]
BOOL BOOLEAN, UNION[NULL, BOOLEAN]
DOUBLE DOUBLE, FLOAT, UNION[NULL, DOUBLE], UNION[NULL, FLOAT]
INT INT, UNION[NULL, INT]
LIST ARRAY, UNION[NULL, ARRAY]
LIST<TUPLE<STRING key, ? value>> MAP, UNION[NULL, MAP]
LONG LONG, UNION[NULL, LONG]
STRING STRING, UNION[NULL, STRING]
TIMESTAMP STRING, LONG, DOUBLE, UNION[NULL, STRING], UNION[NULL, LONG], UNION[NULL, DOUBLE]
TUPLE RECORD, UNION[NULL, RECORD]
FUNCTION STRING, UNION[NULL, STRING]
CAPTURE * Not supported

Parquet Data Type Map to StreamBase

StreamBase Data Type Parquet Data Type
BLOB BINARY
BOOL BOOLEAN
DOUBLE DOUBLE, FLOAT
INT INT
LIST LIST
LIST<TUPLE<? key, ? value>> MAP
LONG LONG
STRING BINARY UTF8
TIMESTAMP BINARY UTF8, INT64, DOUBLE
TUPLE GROUP
FUNCTION BINARY UTF8
CAPTURE * Not supported

Cluster Aware Tab

Use the settings in this tab to enable this operator or adapter for runtime start and stop conditions in a multi-node cluster. During initial development of the fragment that contains this operator or adapter, and for maximum compatibility with releases before 10.5.0, leave the Cluster start policy control in its default setting, Start with module.

Cluster awareness is an advanced topic that requires an understanding of StreamBase Runtime architecture features, including clusters, quorums, availability zones, and partitions. See Cluster Awareness Tab Settings on the Using Cluster Awareness page for instructions on configuring this tab.

Concurrency Tab

Use the Concurrency tab to specify parallel regions for this instance of this component, or multiplicity options, or both. The Concurrency tab settings are described in Concurrency Options, and dispatch styles are described in Dispatch Styles.

Caution

Concurrency settings are not suitable for every application, and using these settings requires a thorough analysis of your application. For details, see Execution Order and Concurrency, which includes important guidelines for using the concurrency options.

Description of This Adapter's Ports

The File Writer adapter's ports are used as follows:

  • Data (input): Tuples are received on this port and written to the current file. The schema for this port can be anything.

  • Control (input): Tuples enqueued on this port cause the adapter to perform actions on files. The schema for this port has the following fields:

    • Command, string. The command to perform:

      1. Open: Open a file for written with the options given.

      2. Close: Close the currently open file.

      3. Flush: Force a flush operation on the current file to write out any buffered data.

    • FileName, string. Required by the Open command. The name of the file to open.

    • CreateMode, string. Determines how a file will be created. If null this remains unchanged and the last value loaded is used:

      1. Append: If the file exists its contents will remain the new data will be appended at the end.

      2. Overwrite: If the file exists its contents will be overwritten.

      3. Fail: If the file exists a Reject status tuple will be emitted and the file will not be opened for writing.

    • Compression, string. Determines how a file will be compressed when written. If null this remains unchanged and the last value loaded is used:

      1. None: The file will be written with no compression.

      2. GZip: The file will be written with the GZip compression format.

      3. BZip2: The file will be written with the BZip2 compression format.

      4. Zip: The file will be written with the standard Zip format and contain a single entry with the same name as the file name you specify minus the extension.

    • FlushInterval, int. The value in milliseconds to perform a flush on the file to write out any buffered data. If null this remains unchanged and the last value loaded is used.

    • WriteLineSeparator, boolean. Determines if the system new line character will be written after each tuple is received. If null this remains unchanged and the last value loaded is used.

  • Status (output): The adapter emits tuples from this port when significant events occur, such as when an attempt to open, flush, or close a file occurs. The schema for this port has the following fields:

    • Type, string: returns one of the following values to convey the type of event:

      • User Input

      • System

    • Action, string: returns an action associated with the event Type:

      • Rejected

      • Open

      • Close

      • Flush

      • Error

    • Object, string. Returns an event type-specific value, such as the name of the file which a open failed or the control input tuple that was rejected.

    • Message, string. Returns a human-readable description of the event.

    • Time, timestamp. Returns the time this event occurred.

    • InputTuple, tuple. Returns a combined data and control tuple which was used during this event.

Typechecking and Error Handling

The File Writer adapter uses typecheck messages to help you configure the adapter within your StreamBase application. In particular, the adapter generates typecheck messages for the following reasons:

  • The Control Input Port does not have the required schema.

  • The field specified in the data field does not exists in the incoming data schema.

  • The flush interval is not an integer value greater than or equal to 0.

  • The Default File Name property is blank and the control port is disabled.

  • The compression mode is Zip and the file create mode is Append.

Suspend and Resume Behavior

When suspended, the adapter stops processing requests to write files.

When resumed, the adapter once again starts processing requests to write files.

Related Topics