HDFS File Writer Adapter

Introduction

The TIBCO StreamBase® File Writer Adapter for Apache HDFS writes a file to a configured Hadoop Distributed File System resource.

The adapter writes in response to the receipt of a tuple on its control input port to open and close the file and then the data port for the contents.

The adapter has a sample, described in HDFS File Writer and File System Adapter Sample.

Note

When using the HDFS core libraries on Windows, you may see a console message similar to Failed to locate the winutils binary in the hadoop binary path. The Apache Hadoop project acknowledges that this is an error to be fixed in a future Hadoop release. In the meantime, you can either ignore the error message as benign, or you can download, build, and install the winutils executable as discussed on numerous Internet resources.

HDFS File Writer Properties

This section describes the properties you can set for this adapter, using the various tabs of the Properties view in StreamBase Studio.

General Tab

Name: Use this required field to specify or change the name of this instance of this component, which must be unique in the current EventFlow module. The name must contain only alphabetic characters, numbers, and underscores, and no hyphens or other special characters. The first character must be alphabetic or an underscore.

Adapter: A read-only field that shows the formal name of the adapter.

Start with application: If this field is set to Yes (default) or to a module parameter that evaluates to true, this instance of this adapter starts as part of the JVM engine that runs this EventFlow module. If this field is set to No or to a module parameter that evaluates to false, the adapter instance is loaded with the engine, but does not start until you send an sbadmin resume command, or until you start the component with StreamBase Manager.

Enable Error Output Port: Select this check box to add an Error Port to this component. In the EventFlow canvas, the Error Port shows as a red output port, always the last port for the component. See Using Error Ports to learn about Error Ports.

Description: Optionally enter text to briefly describe the component's purpose and function. In the EventFlow canvas, you can see the description by pressing Ctrl while the component's tooltip is displayed.

Adapter Properties Tab

Property Description
Default File Name The default file name to create on start-up. If no file is specified the control port must be enabled to send control tuples to open a file for writing.
Serialization Format When writing in binary format this is the serialization format to use.
User The default user if none is provided on the control input port.
Data Field The field from the incoming data port to write to the file. If left blank the entire tuple will be written as a string or binary depending on the serialization type selected.
FileName Field The field from the incoming data to use to determine which file to write data to. This option is only valid when Enable Multiple Open Files is used.
Enable Control Port If checked this enables the control port, which allows the user to send control commands to open, flush, and close files.
Enable Status Port If checked this will enable the status port, which gives information about what actions are being taken against files.
Write Line Separator After Tuple If checked this will write the system new line character after each tuple is received and written to the file.
File Create Mode This determines how a file will be created:
  1. Append: If the file exists, its contents will remain and the new data is appended at the end.

  2. Overwrite: If the file exists, its contents are overwritten.

  3. Fail: If the file exists, a Reject status tuple is emitted and the file will not be opened for writing.

File Compression This determines how a file will be compressed when written:
  1. None: The file will be written with no compression.

  2. GZip: The file will be written with the GZip compression format.

  3. BZip2: The file will be written with the BZip2 compression format.

  4. Zip: The file will be written with the standard Zip format and contain a single entry with the same name as the file name you specify, minus the extension.

Flush Interval The value in milliseconds to perform a flush on the file to write out any buffered data. A value of 0 means a flush is performed after each tuple is received.
Synchronized Open Command If enabled the operator will perform the open command and wait for it to complete before continuing.
Synchronized Close Command If enabled the operator will perform the close command and wait for it to complete before continuing.
Synchronized Flush Command If enabled the operator will perform the flush command and wait for it to complete before continuing.
Synchronized Write Command If enabled the operator will perform the write command and wait for it to complete before continuing.
Enable Multiple Open Files If enabled the operator will allow multiple files open at the same time. This option requires that every command has a filename associated.
Log Level Controls the level of verbosity the adapter uses to send notifications to the console. This setting can be higher than the containing application's log level. If set lower, the system log level is used. Available values, in increasing order of verbosity, are: OFF, ERROR, WARN, INFO, DEBUG, TRACE.

HDFS Tab

Property Data Type Description
Buffer Size (Bytes) Int The size of the buffer to be used. If empty, the default is used.
Replication Short The required block replication for the file. If empty, the server default is used, and only used during file creation.
Block Size (Bytes) Long The default data block size. If empty, the server default is used, and only used during file creation.
Configuration Files String The HDFS configuration files to use when connecting to the HDFS file system. For example, use the standard file, core-defaults.xml.

Serialize Options

This section describes the options available when a serialization format is selected.

Options

Property Description
Output Schema

The schema used by the serialization type as the target output schema. If this field is empty, the system generates the required schema based on the input tuple. If the log level is set to debug, the serializer outputs the generated schema to the log.

Default Mapping

This section describes the default schema mapping when no schema is provided.

Avro

All data types will be converted as a union of a null type and the described type below as all StreamBase fields are nullable.

StreamBase Data Type Avro Data Type
BLOB BYTES
BOOL BOOLEAN
DOUBLE DOUBLE
INT INT
LIST MAP or ARRAY, MAP if the list is a tuple with 2 fields (key, value)
LONG LONG
TIMESTAMP LONG
STRING STRING
FUNCTION STRING
TUPLE RECORD
CAPTURE NOT SUPPORTED
Parquet

All data types are marked as optional.

StreamBase Data Type Parquet Data Type
BLOB BINARY
BOOL BOOLEAN
DOUBLE DOUBLE
INT INT32
LIST MAP or ARRAY, MAP if the list is a tuple with 2 fields (key, value)
LONG INT64
TIMESTAMP INT64
STRING BINARY (UTF-8)
FUNCTION BINARY (UTF-8)
TUPLE GROUP
CAPTURE NOT SUPPORTED

Serialize Format Data Types

Avro Data Type Map to StreamBase

StreamBase Data Type Avro Data Type
BLOB BYTES, FIXED, STRING, UNION[NULL, BYTES], UNION[NULL, FIXED], UNION[NULL, STRING]
BOOL BOOLEAN, UNION[NULL, BOOLEAN]
DOUBLE DOUBLE, FLOAT, UNION[NULL, DOUBLE], UNION[NULL, FLOAT]
INT INT, UNION[NULL, INT]
LIST ARRAY, UNION[NULL, ARRAY]
LIST<TUPLE<STRING key, ? value>> MAP, UNION[NULL, MAP]
LONG LONG, UNION[NULL, LONG]
STRING STRING, UNION[NULL, STRING]
TIMESTAMP STRING, LONG, DOUBLE, UNION[NULL, STRING], UNION[NULL, LONG], UNION[NULL, DOUBLE]
TUPLE RECORD, UNION[NULL, RECORD]
FUNCTION STRING, UNION[NULL, STRING]
CAPTURE * Not supported

Parquet Data Type Map to StreamBase

StreamBase Data Type Parquet Data Type
BLOB BINARY
BOOL BOOLEAN
DOUBLE DOUBLE, FLOAT
INT INT
LIST LIST
LIST<TUPLE<? key, ? value>> MAP
LONG LONG
STRING BINARY UTF8
TIMESTAMP BINARY UTF8, INT64, DOUBLE
TUPLE GROUP
FUNCTION BINARY UTF8
CAPTURE * Not supported

Concurrency Tab

Use the Concurrency tab to specify parallel regions for this instance of this component, or multiplicity options, or both. The Concurrency tab settings are described in Concurrency Options, and dispatch styles are described in Dispatch Styles.

Caution

Concurrency settings are not suitable for every application, and using these settings requires a thorough analysis of your application. For details, see Execution Order and Concurrency, which includes important guidelines for using the concurrency options.

Description of This Adapter's Ports

The File Writer adapter's ports are used as follows:

  • Data (input): Tuples are received on this port and written to the current file. The schema for this port can be anything.

  • Control (input): Tuples enqueued on this port cause the adapter to perform actions on files. The schema for this port has the following fields:

    • Command, string. The command to perform:

      1. Open: Open a file for written with the options given.

      2. Close: Close the currently open file.

      3. Flush: Force a flush operation on the current file to write out any buffered data.

    • FileName, string. Required by the Open command; the name of the file to open.

    • CreateMode, string. Determines how a file will be created. If null this remains unchanged and the last value loaded is used:

      1. Append: If the file exists its contents will remain the new data will be appended at the end.

      2. Overwrite: If the file exists its contents will be overwritten.

      3. Fail: If the file exists a Reject status tuple will be emitted and the file will not be opened for writing.

    • Compression, string. Determines how a file will be compressed when written. If null this remains unchanged and the last value loaded is used:

      1. None: The file will be written with no compression.

      2. GZip: The file will be written with the GZip compression format.

      3. BZip2: The file will be written with the BZip2 compression format.

      4. Zip: The file will be written with the standard Zip format and contain a single entry with the same name as the file name you specify minus the extension.

    • FlushInterval, int. The value in milliseconds to perform a flush on the file to write out any buffered data. If null this remains unchanged and the last value loaded is used.

    • WriteLineSeparator, boolean. Determines if the system new line character will be written after each tuple is received. If null this remains unchanged and the last value loaded is used.

  • Status (output): The adapter emits tuples from this port when significant events occur, such as when an attempt to open, flush, or close a file occurs. The schema for this port has the following fields:

    • Type, string: returns one of the following values to convey the type of event:

      • User Input

      • System

    • Action, string: returns an action associated with the event Type:

      • Rejected

      • Open

      • Close

      • Flush

      • Error

    • Object, string. Returns an event type-specific value, such as the name of the file which a open failed or the control input tuple that was rejected.

    • Message, string. Returns a human-readable description of the event.

    • Time, timestamp. Returns the time this event occurred.

    • InputTuple, tuple. Returns a combined data and control tuple which was used during this event.

Typechecking and Error Handling

The File Writer adapter uses typecheck messages to help you configure the adapter within your StreamBase application. In particular, the adapter generates typecheck messages for the following reasons:

  • The Control Input Port does not have the required schema.

  • The field specified in the data field does not exists in the incoming data schema.

  • The flush interval is not an integer value greater than or equal to 0.

  • The Default File Name property is blank and the control port is disabled.

  • The compression mode is Zip and the file create mode is Append.

Connecting to the Amazon S3 File System

The HDFS Adapters can connect to and work with the Amazon S3 file system using the s3a://{bucket}/yourfile.txt URI style.

In order for the HDFS adapters to be able to access the Amazon S3 file system, you must also supply S3 authentication information via a configuration file or one of the other supported ways described here: Authenticating with S3.

The HDFS S3 sample gives an example of using a configuration file with the access key and secret key supplied.

Suspend and Resume Behavior

When suspended, the adapter stops processing requests to write files.

When resumed, the adapter once again starts processing requests to write files.

Related Topics