Contents
The TIBCO StreamBase® Binary File Reader For Apache Hadoop Distributed File System (HDFS) is an embedded input adapter that reads structured binary files from an Apache Hadoop HDFS file system and was created with the StreamBase HDFS Binary File Writer Output Adapter. It is not a general purpose tool to read arbitrary binary files.
In general, you use the HDFS Binary File Writer and HDFS Binary File Reader as an adapter pair to write a stream of tuples to a file, then read them back from that file at a different point in your application. The HDFS Binary File Reader and Writer adapters are much like the HDFS CSV File Reader and Writer adapter pair, except that reading and writing binary files is significantly faster than reading and writing text files in CSV format. You might use the HDFS Binary File Writer and Reader as part of a High Availability scenario to preserve the contents of a tuple stream to disk. As part of a failover event, the secondary server in a StreamBase Server cluster could restore the contents of a stream with a fast read of the saved file.
If the HDFS Binary File Writer wrote its output file with its Compress data option enabled (which write the output file in gzip format with .gz
extension), then the HDFS Binary File Reader adapter automatically uncompresses the file when reading it.
Note
When using the HDFS core libraries on Windows, you may see a console message similar to Failed to locate the winutils binary in the hadoop binary path
. The Apache Hadoop project acknowledges that this is an error to be fixed in a future Hadoop release. In the meantime, you can either ignore the error message as benign, or you can download,
build, and install the winutils executable as discussed on numerous Internet resources.
By default, an instance of the Binary File Reader adapter has a single output port. In this case, the name of the file to process is designated in the File Name setting, and the adapter begins processing that file as soon as it starts.
You can add one or more of the following optional ports:
-
An Error Output Port, which outputs a StreamBase error tuple for any adapter error.
-
A Control Input Port, which accepts a tuple containing the name of an input file to process.
-
A Event Output Port, which outputs a tuple every time an input file is opened or closed.
The optional ports are further described in the next section.
The Binary File Reader Adapter has the following properties on the General tab in its Properties View in StreamBase Studio:
Name: Use this required field to specify or change the name of this instance of this component, which must be unique in the current EventFlow module. The name must contain only alphabetic characters, numbers, and underscores, and no hyphens or other special characters. The first character must be alphabetic or an underscore.
Adapter: A read-only field that shows the formal name of the adapter.
Class name: Shows the fully qualified class name that implements the functionality of this adapter. If you need to reference this class name elsewhere in your application, you can right-click this field and select Copy from the context menu to place the full class name in the system clipboard.
Start with application: If this field is set to Yes (default) or to a module parameter that evaluates to true
, this instance of this adapter starts as part of the JVM engine that runs this EventFlow module. If this field is set to
No or to a module parameter that evaluates to false
, the adapter instance is loaded with the engine, but does not start until you send an sbadmin resume command, or until you start the component with StreamBase Manager.
Enable Error Output Port: Select this check box to add an Error Port to this component. In the EventFlow canvas, the Error Port shows as a red output port, always the last port for the component. See Using Error Ports to learn about Error Ports.
Description: Optionally enter text to briefly describe the component's purpose and function. In the EventFlow canvas, you can see the description by pressing Ctrl while the component's tooltip is displayed.
The Binary File Reader Adapter has the following properties on the Adapter Settings tab in its Properties View in StreamBase Studio:
Property | Data Type | Description |
---|---|---|
File Name | string | Specify the name of a binary input file to read. The file must have been created by the Binary File Writer Adapter. This Reader
adapter automatically determines whether the file was created with the compression option when using the Writer adapter.
As an alternative to hard-coding an input file name, you can use the Start Control Port setting described below. |
User | string | The user to use when reading on startup |
Capture Transform Strategy | radio button | The strategy to use when transforming capture fields for this operator. The FLATTEN strategy expands capture fields at the same level as non-captured fields. FLATTEN is the default and is usually sufficient. The NEST strategy expands captured fields by encapsulating them as type Tuple. Use NEST to avoid duplication of field names, at the cost of a more complex schema. |
Period | int | The time, in milliseconds, to wait between processing each record. If 0, the adapter reads the file one record after another without pausing. Use this setting to slow down processing of the input file. |
Start Control Port | check box | If selected, this setting adds an input port to the adapter instance. The input port is a control port that accepts a schema with a single field of string type and optionally a second field of type string which represents the user. Your application can send a tuple to this control port containing the name of a binary file to read and optionally the user to for accessing the file. In this case, the adapter waits for a tuple to arrive at the input control port before it begins reading any files. You can send a sequence of file names to the control port, for each to be processed by the adapter in turn. |
Start Event Port | check box | If selected, this setting adds an event output port to the adapter instance. The event port outputs a tuple for each file open and file close event in the adapter. That is, an event tuple is sent every time an input file is opened or closed. |
Log Level | INFO | Controls the level of verbosity the adapter uses to issue informational traces to the console. This setting is independent of the containing application's overall log level. Available values, in increasing order of verbosity, are: OFF, ERROR, WARN, INFO, DEBUG, TRACE. |
Property | Data Type | Description |
---|---|---|
Buffer Size (Bytes) | int | The size of the buffer to be used. If empty, the default is used. |
Configuration Files | Strings | The HDFS configuration files to use when connecting to the HDFS file system. For example, use the standard file, core-defaults.xml .
|
Use the Edit Schema tab to specify the schema of the output tuple for this adapter. For general instructions on using the Edit Schema tab, see the Properties: Edit Schema Tab section of the Defining Input Streams page.
Use the Concurrency tab to specify parallel regions for this instance of this component, or multiplicity options, or both. The Concurrency tab settings are described in Concurrency Options, and dispatch styles are described in Dispatch Styles.
Caution
Concurrency settings are not suitable for every application, and using these settings requires a thorough analysis of your application. For details, see Execution Order and Concurrency, which includes important guidelines for using the concurrency options.
Typechecking fails if the schema does not have at least one parameter. The File Name field fails to typecheck only if it is blank and you have not enabled the Start Control Port option.
The HDFS Adapters can connect to and work with the Amazon S3 file system using the s3a://{bucket}/yourfile.txt
URI style.
In order for the HDFS adapters to be able to access the Amazon S3 file system, you must also supply S3 authentication information via a configuration file or one of the other supported ways described here: Authenticating with S3.
The HDFS S3 sample gives an example of using a configuration file with the access key and secret key supplied.
On suspend, the Binary File Reader will finish processing the current record, output the tuple, and then pause. The input file will remain open and the adapter will retain its position in the file. It will stay paused until it is either shut down or resumed.
When it resumes, the Binary File Reader continues processing with the next record in the input file.