HDFS XML File Writer Output Adapter

Introduction

The TIBCO StreamBase® XML File Writer for Apache Hadoop Distributed File System (HDFS) is an embedded adapter suitable for streaming tuples to a text file formatted with XML tags.

The XML file created by this adapter does not contain a root node. Each tuple streamed by the adapter should be treated as an individual document. Input fields of type list or tuple are not supported.

Each field value can be optionally parsed for XML reserved characters such as <, >, and & in field values and XML-escaped if detected. Parsing for XML escaping incurs a performance penalty; disable this feature (the default) if you are sure your input data contains no XML reserved characters.

The adapter rolls over its output file when it reaches the size specified in the (optional) Max File Size parameter. When an output file is rolled over, the current File Name is appended with the current date, formatted as: yyyyMMddhhmmss. If an output file was already created with a particular date suffix, then subsequent log files are further appended with the suffix -0, -1, -2, and so on.

The adapter flushes writes to disk after every tuple unless Flush Interval is specified. If Flush Interval is specified, then a flush() is done asynchronously. Each flush and write to the output file is synchronized to prevent partial tuples from being written to disk.

Note

When using the HDFS core libraries on Windows, you may see a console message similar to Failed to locate the winutils binary in the hadoop binary path. The Apache Hadoop project acknowledges that this is an error to be fixed in a future Hadoop release. In the meantime, you can either ignore the error message as benign, or you can download, build, and install the winutils executable as discussed on numerous Internet resources.

Properties

The HDFS XML File Writer adapter is configured with the properties shown in the following sections. In certain cases, properties can also be set in the StreamBase Server configuration file.

General Tab

Name: Use this required field to specify or change the name of this instance of this component, which must be unique in the current EventFlow module. The name must contain only alphabetic characters, numbers, and underscores, and no hyphens or other special characters. The first character must be alphabetic or an underscore.

Adapter: A read-only field that shows the formal name of the adapter.

Class name: Shows the fully qualified class name that implements the functionality of this adapter. If you need to reference this class name elsewhere in your application, you can right-click this field and select Copy from the context menu to place the full class name in the system clipboard.

Start with application: If this field is set to Yes (default) or to a module parameter that evaluates to true, this instance of this adapter starts as part of the JVM engine that runs this EventFlow module. If this field is set to No or to a module parameter that evaluates to false, the adapter instance is loaded with the engine, but does not start until you send an sbadmin resume command, or until you start the component with StreamBase Manager.

Enable Error Output Port: Select this check box to add an Error Port to this component. In the EventFlow canvas, the Error Port shows as a red output port, always the last port for the component. See Using Error Ports to learn about Error Ports.

Description: Optionally enter text to briefly describe the component's purpose and function. In the EventFlow canvas, you can see the description by pressing Ctrl while the component's tooltip is displayed.

Adapter Properties Tab

Property Description Default
File Name (string) The local file path. None
User (string) The user to use for HDFS file access. An empty value means use the current system user. None
Use Default Charset (optional bool) Specifies whether the default Java platform character set should be used. If this check box is cleared, a valid character set name must be specified for the Character Set property. cleared
Character Set (optional string) The character set for the XML file you are writing.

Note

If you specify a non-ASCII character set such as UTF-8, remember that character mappings are not necessarily one byte to one byte in non-ASCII character sets.

The setting of the system property streambase.tuple-charset. US-ASCII if the property is not set.
Truncate File (bool) If true, truncate log on startup. false, append
Max File Size (optional int) The maximum XML file size in bytes before roll; 0 = no roll. 0, unlimited
Buffer Size (optional int) The buffer size, which must be greater than 0; 0 = no buffer. 0, unlimited
Flush Interval (optional int) The frequency with which to force flush, in milliseconds; 0 = always flush. 0, write immediately
Escape (check box) If selected, characters in input tuples such as <, >, and & are XML encoded with &lt; &gt; &amp; respectively. If cleared, such characters are embedded in the resulting XML verbatim. cleared
Log Level Controls the level of verbosity the adapter uses to send notifications to the console. This setting can be higher than the containing application's log level. If set lower, the system log level will be used. Available values, in increasing order of verbosity, are: OFF, ERROR, WARN, INFO, DEBUG, TRACE, and ALL. INFO

HDFS Tab

Property Data Type Description
Buffer Size (Bytes) Int The size of the buffer to be used. If empty, the default is used.
Replication Short The required block replication for the file. If empty, the server default is used, and only used during file creation.
Block Size (Bytes) Long The default data block size. If empty, the server default is used, and only used during file creation.
Configuration Files String The HDFS configuration files to use when connecting to the HDFS file system. For example, use the standard file, core-defaults.xml.

Concurrency Tab

Use the Concurrency tab to specify parallel regions for this instance of this component, or multiplicity options, or both. The Concurrency tab settings are described in Concurrency Options, and dispatch styles are described in Dispatch Styles.

Caution

Concurrency settings are not suitable for every application, and using these settings requires a thorough analysis of your application. For details, see Execution Order and Concurrency, which includes important guidelines for using the concurrency options.

XML Formatting

The XML file emitted from this adapter takes the following form for each input tuple:

<tuple>
    <FieldName1>FieldValue1</FieldName1>
    <FieldName2>FieldValue2</FieldName2>
    <FieldName3>FieldValue3</FieldName3>
    ...
</tuple>

Typechecking and Error Handling

Typechecking fails if the input tuple includes fields of type list or tuple, and if the following optional parameters are specified and do not meet the following conditions:

  • Max File Size must be greater than or equal to 0.

  • Flush Interval must be greater than or equal to 0.

  • Buffer Size must be greater than or equal to 0.

If specified and those conditions are not met, typechecking fails.

Connecting to the Amazon S3 File System

The HDFS Adapters can connect to and work with the Amazon S3 file system using the s3a://{bucket}/yourfile.txt URI style.

In order for the HDFS adapters to be able to access the Amazon S3 file system, you must also supply S3 authentication information via a configuration file or one of the other supported ways described here: Authenticating with S3.

The HDFS S3 sample gives an example of using a configuration file with the access key and secret key supplied.

Suspend and Resume Behavior

On suspend, all tuples are written to disk and the currently open XML file is closed.

On resumption, caching is restarted. The first log message occurs at resume time + Flush Interval.