HDFS File System Adapter

Introduction

The TIBCO StreamBase® File System Adapter for HDFS performs various operations on the HDFS system as determined by the command value of the input tuple. HDFS is the well-known acronym for Apache's Hadoop Distributed File System.

The adapter shares a sample with one for the HDFS File Writer adapter, which is described in HDFS File Writer and File System Adapter Sample.

Note

When using the HDFS core libraries on Windows, you may see a console message similar to Failed to locate the winutils binary in the hadoop binary path. The Apache Hadoop project acknowledges that this is an error to be fixed in a future Hadoop release. In the meantime, you can either ignore the error message as benign, or you can download, build, and install the winutils executable as discussed on numerous Internet resources.

HDFS File System Properties

This section describes the properties you can set for this adapter, using the various tabs of the Properties view in StreamBase Studio.

General Tab

Name: Use this required field to specify or change the name of this instance of this component, which must be unique in the current EventFlow module. The name must contain only alphabetic characters, numbers, and underscores, and no hyphens or other special characters. The first character must be alphabetic or an underscore.

Adapter: A read-only field that shows the formal name of the adapter.

Start with application: If this field is set to Yes (default) or to a module parameter that evaluates to true, this instance of this adapter starts as part of the JVM engine that runs this EventFlow module. If this field is set to No or to a module parameter that evaluates to false, the adapter instance is loaded with the engine, but does not start until you send an sbadmin resume command, or until you start the component with StreamBase Manager.

Enable Error Output Port: Select this check box to add an Error Port to this component. In the EventFlow canvas, the Error Port shows as a red output port, always the last port for the component. See Using Error Ports to learn about Error Ports.

Description: Optionally enter text to briefly describe the component's purpose and function. In the EventFlow canvas, you can see the description by pressing Ctrl while the component's tooltip is displayed.

Adapter Properties Tab

Property Data Type Default Description
Enable Status Port check box Selected If enabled, a status output port is present on which tuples are emitted to convey to important events within the adapter.
Enable HDFS Monitor Event Port check box Selected If enabled, the startmonitor command may be used to start monitoring a path in the HDFS file system. When changes occur tuples will be emitted on this port.
HDFS Configuration Files Strings Blank The HDFS configuration files to use when connecting to the HDFS file system. For example, use the standard file, core-defaults.xml is the standard file to use.
Log Level drop-down list INFO Controls the level of verbosity the adapter uses to send notifications to the console. This setting can be higher than the containing application's log level. If set lower, the system log level will be used. Available values, in increasing order of verbosity, are: OFF, ERROR, WARN, INFO, DEBUG, TRACE.

Concurrency Tab

Use the Concurrency tab to specify parallel regions for this instance of this component, or multiplicity options, or both. The Concurrency tab settings are described in Concurrency Options, and dispatch styles are described in Dispatch Styles.

Caution

Concurrency settings are not suitable for every application, and using these settings requires a thorough analysis of your application. For details, see Execution Order and Concurrency, which includes important guidelines for using the concurrency options.

Description of This Adapter's Ports

The HDFS File System adapter's ports are used as follows:

  • Control (input): Tuples enqueued on this port cause the adapter to perform an operation on the HDFS file system. The schema for this port has the following fields:

    • Command, string. The command to perform on the HDFS file system. Valid commands are:

      • exists: Checks the HDFS file system and determines if a path exists. This works for both files and directories.

      • rename: Rename a file or directory in the HDFS file system; both Path and PathDestination fields are required.

      • listfiles: List files in a path. Use the Recursive flag to list all subdirectories recursively as well.

      • filestatus: Gets the file status for a path. This can be a file or a directory.

      • delete: Delete a path (file or directory).

      • status: Gets some system status information about the HDFS system (Capacity, Remaining, and Used).

      • startmonitor: Starts monitoring of an HDFS Path. Updates will be output on the monitor event port, which must be enabled.

      • stopmonitor: Stops monitoring of an HDFS Path.

    • Path, string or list<string>. The path(s) to use when performing an operation. This value is used for all commands but is option for status if no path is supplied for the status the status of the entire HDFS system is returned.

    • PathDestination, string. The destination path when performing a rename operation.

    • Recursive, boolean. Used during the delete and listfiles operations.

      • delete: If path is a directory and set to true, the directory is deleted and all its contents. If false and the directory is not empty an exception occurs.

      • listfiles: If the subdirectories need to be traversed recursively.

    • User, string. The user name to use when performing the operation. If null the user executing the sbapp is used.

    • Permissions, tuple. A set of permissions to use when using the mkdirs operation.

      • User: The user permissions to set.

      • Group: The group permissions to set.

      • Other: The other permissions to set.

      The value of the permissions string can be exactly one of the following:

      • NONE

      • EXECUTE

      • WRITE

      • WRITE_EXECUTE

      • READ

      • READ_EXECUTE

      • READ_WRITE

      • ALL

  • Data (output): The adapter emits tuples from this port when an input command required outputs:

    • Exists, boolean. Used for the exists command.

    • Files, list. Used for the listfiles and filestatus commands. This returns a detailed list of the files status.

    • Status, tuple. Used for the status command. This returns basic file system status.

    • PassThrough, tuple. Returns the input tuple that caused the output to occur.

  • Status (output): The adapter emits tuples from this port when significant events occur. The schema for this port has the following fields:

    • type, string. Returns the level type of this event (ERROR, WARN, INFO):

    • action, string. Returns an action associated with the event type which will be the command entered.

    • object, string. Returns an event type-specific value.

    • message, string. Returns a human-readable description of the event.

    • time, timestamp. The time the status event occurred.

    • inputTuple, tuple. Returns input tuple that caused the status message.

  • MonitorEvents (output): The adapter emits tuples from this port the startmonitor command is used and HDFS file system changes occur on the path being monitored:

    • Action, string. The type of action that occurred (append, close, create, metadata, rename, unlink, unknown).

    • Path, string. The path affected by this event.

    • PathDestination, string. If a rename operation, this is the destination path.

    • Length, long. The size of the file on close events.

    • CreationTime, timestamp. The time the path was created.

    • ModificationTime, timestamp. The time the path was changed.

    • Permissions, tuple. Returns a tuple containing the permissions if changed and the event is metadata.

Typechecking and Error Handling

The HDFS File System adapter uses typecheck messages to help you configure the adapter within your StreamBase application. In particular, the adapter generates typecheck messages for the following reasons:

  • The Control Input Port does not have exactly the correct fields with the correct type associated.

  • The adapter generates warning messages during runtime under various conditions.

Connecting to the Amazon S3 File System

The HDFS Adapters can connect to and work with the Amazon S3 file system using the s3a://{bucket}/yourfile.txt URI style.

In order for the HDFS adapters to be able to access the Amazon S3 file system, you must also supply S3 authentication information via a configuration file or one of the other supported ways described here: Authenticating with S3.

The HDFS S3 sample gives an example of using a configuration file with the access key and secret key supplied.

Suspend and Resume Behavior

When suspended, the adapter stops processing requests to perform operations on the file system.

When resumed, the adapter once again starts processing requests to perform operations on the file system.

Related Topics