HDFS Binary File Writer Output Adapter

Introduction

The TIBCO StreamBase® Binary File Writer For Apache HDFS is an embedded output adapter that takes tuples from a stream and writes them to a structured binary file on a connected Hadoop Distributed File System resource.

In general, you will use the HDFS Binary File Writer and HDFS Binary File Reader as an adapter pair to write a stream of tuples to a file, then read them back from that file at a different point in your application. The HDFS Binary File Reader and Writer adapters are much like the HDFS CSV File Reader and Writer adapter pair, except that reading and writing binary files is significantly faster than reading and writing text files in CSV format. You might use the HDFS Binary File Writer and Reader as part of a High Availability scenario to preserve the contents of a tuple stream to disk. As part of a failover event, the secondary server in a StreamBase Server cluster could restore the contents of a stream with a fast read of the saved file.

You can also generate an output file of binary tuple data for use as an input for a feed simulation. In this case, be sure to use the .bin file name extension.

Note

When using the HDFS core libraries on Windows, you may see a console message similar to Failed to locate the winutils binary in the hadoop binary path. The Apache Hadoop project acknowledges that this is an error to be fixed in a future Hadoop release. In the meantime, you can either ignore the error message as benign, or you can download, build, and install the winutils executable as discussed on numerous Internet resources.

Ports

By default, an instance of the Binary File Writer adapter has a single input port.

You can add an optional Error Output Port, which outputs a StreamBase error tuple for any adapter error.

Properties

This section describes the properties you can set for this adapter, using the various tabs of the Properties view in StreamBase Studio.

In the tables in this section, the Property column shows each property name as found in the Adapter Properties tab of the Properties view for this adapter.

When using this adapter in a StreamSQL program with the APPLY JAVA statement, you must convert the Studio property names to parameter names using the simple formula described in APPLY Statement.

General Tab

This section describes the properties on the General tab in the Properties view for the HA Heartbeat adapter.

Name: Use this required field to specify or change the name of this instance of this component, which must be unique in the current EventFlow module. The name must contain only alphabetic characters, numbers, and underscores, and no hyphens or other special characters. The first character must be alphabetic or an underscore.

Adapter: A read-only field that shows the formal name of the adapter.

Class name: Shows the fully qualified class name that implements the functionality of this adapter. If you need to reference this class name elsewhere in your application, you can right-click this field and select Copy from the context menu to place the full class name in the system clipboard.

Start with application: If this field is set to Yes (default) or to a module parameter that evaluates to true, this instance of this adapter starts as part of the JVM engine that runs this EventFlow module. If this field is set to No or to a module parameter that evaluates to false, the adapter instance is loaded with the engine, but does not start until you send an sbadmin resume command, or until you start the component with StreamBase Manager.

Enable Error Output Port: Select this check box to add an Error Port to this component. In the EventFlow canvas, the Error Port shows as a red output port, always the last port for the component. See Using Error Ports to learn about Error Ports.

Description: Optionally enter text to briefly describe the component's purpose and function. In the EventFlow canvas, you can see the description by pressing Ctrl while the component's tooltip is displayed.

Adapter Settings Tab

Property Data Type Default Description
File Name string None

Specify the name of a binary output file to write.

When using this adapter to generate binary tuple data as input for a feed simulation, use a filename with the .bin extension, even when using the Compress Data option described below.

If this adapter is configured with a maximum file size, then this filename is appended with a counter extension if the maximum file size is reached. For example, if you specify a file name of output, then the file is rolled over to output.0000001, then to output.0000002 and so on.

User string None

The user to access the file with, if blank the currently logged in user will be used.

Capture Transform Strategy radio button FLATTEN The strategy to use when transforming capture fields for this operator. The FLATTEN strategy expands capture fields at the same level as non-captured fields. FLATTEN is the default and is usually sufficient. The NEST strategy expands captured fields by encapsulating them as type Tuple. Use NEST to avoid duplication of field names, at the cost of a more complex schema.
Max File Size int 0 (no rollover) Maximum size, in bytes, of the file on disk. If the file reaches this limit, it is renamed with the current timestamp and new data is written to the current name specified in the File Name property.
Max Roll Seconds int 0 (no rollover) The maximum number of seconds before file names are rolled over as described above.
Flush Interval int 1 How often, in seconds, to force tuples to disk. Set this value to zero to flush immediately.
Sync on flush check box false (cleared) Syncs operating system buffers to the file system on flush, to make sure that all changes are written. Using this options incurs a large performance penalty.
Compress data check box false (cleared) If selected, the adapter compresses its output in gzip format.
If File Exists Radio buttons Truncate existing file Action to take if the specified binary file already exists when the adapter is started: Truncate existing file, or Fail.
Throttle Error Messages check box false (cleared) Only show a given error message once.
Log Level drop-down list INFO Controls the level of verbosity the adapter uses to send notifications to the console. This setting can be higher than the containing application's log level. If set lower, the system log level will be used. Available values, in increasing order of verbosity, are: OFF, ERROR, WARN, INFO, DEBUG, TRACE.
Log Level drop-down list INFO Controls the level of verbosity the adapter uses to send notifications to the console. This setting can be higher than the containing application's log level. If set lower, the system log level will be used. Available values, in increasing order of verbosity, are: OFF, ERROR, WARN, INFO, DEBUG, TRACE.

HDFS Tab

Property Data Type Description
Buffer Size (Bytes) Int The size of the buffer to be used. If empty, the default is used.
Replication Short The required block replication for the file. If empty, the server default is used, and only used during file creation.
Block Size (Bytes) Long The default data block size. If empty, the server default is used, and only used during file creation.
Configuration Files String The HDFS configuration files to use when connecting to the HDFS file system. For example, use the standard file, core-defaults.xml.

Concurrency Tab

Use the Concurrency tab to specify parallel regions for this instance of this component, or multiplicity options, or both. The Concurrency tab settings are described in Concurrency Options, and dispatch styles are described in Dispatch Styles.

Caution

Concurrency settings are not suitable for every application, and using these settings requires a thorough analysis of your application. For details, see Execution Order and Concurrency, which includes important guidelines for using the concurrency options.

Typechecking and Error Handling

Typechecking fails in the following circumstances:

  • The File Name is null or zero length string.

  • The Flush Interval is less than zero.

  • The Max File Size is less than zero.

Connecting to the Amazon S3 File System

The HDFS Adapters can connect to and work with the Amazon S3 file system using the s3a://{bucket}/yourfile.txt URI style.

In order for the HDFS adapters to be able to access the Amazon S3 file system, you must also supply S3 authentication information via a configuration file or one of the other supported ways described here: Authenticating with S3.

The HDFS S3 sample gives an example of using a configuration file with the access key and secret key supplied.

Suspend and Resume Behavior

On suspend, this adapter stops processing tuples and closes the current file.

On resumption, it reopens a new output file named according to the file name rollover logic described for the File Name parameter above, and resumes processing tuples. That is, if the output file is named output.bin, on resumption, the new output file is named output.bin.000000, then output.bin.000001, and so on.