Using the Multilayer Perceptron Classification Operator

Introduction

The Spotfire Streaming Multilayer Perceptron Classifcation Operator is used to build multilayer perceptron neural networks.

Multilayer Perceptron Classification Properties

This section describes the properties you can set for this adapter, using the various tabs of the Properties view in StreamBase Studio.

General Tab

Name: Use this required field to specify or change the name of this instance of this component. The name must be unique within the current EventFlow module. The name can contain alphanumeric characters, underscores, and escaped special characters. Special characters can be escaped as described in Identifier Naming Rules. The first character must be alphabetic or an underscore.

Operator: A read-only field that shows the formal name of the operator.

Class name: Shows the fully qualified class name that implements the functionality of this adapter. If you need to reference this class name elsewhere in your application, you can right-click this field and select Copy from the context menu to place the full class name in the system clipboard.

Start options: This field provides a link to the Cluster Aware tab, where you configure the conditions under which this adapter starts.

Enable Error Output Port: Select this checkbox to add an Error Port to this component. In the EventFlow canvas, the Error Port shows as a red output port, always the last port for the component. See Using Error Ports to learn about Error Ports.

Description: Optionally, enter text to briefly describe the purpose and function of the component. In the EventFlow Editor canvas, you can see the description by pressing Ctrl while the component's tooltip is displayed.

Operator Properties Tab

Property Description
Activation function for hidden layer Specify one of the following as the activation function: linear, sigmoid, tanh, or rectifier.
Number of hidden neurons Specify the number of hidden neurons. The more neurons, the more flexible the model a network can build, however, this may also take longer to train and may also overfit the data.
Number of epochs Specify the number of epochs.
Learning rate Specify the value of the learning rate, too small may translate to long training times whereas too large may cause oscillation and failure of the algorithm to find a set of weights that adequately minimizes loss function.
Momentum Specify the value of the momentum term.
Weight decay Specify a positive value to perform weight decay regularization. This is used to help reduce overfitting.
Update existing model If enabled, then new training data will be used to update an existing model instead of building a new model.
Batch update If enabled, training will be done in batch, that is, parameters/weights of the network are only updated after all of the data has propagated through the network.
Factor coding The MLP classification operator requires categorical factors to be coded into numeric variables. MLP classification operator offers either One-hot or reference coding. One-hot encoding consists of creating k new columns in the design matrix corresponding to the k levels of the categorical factor. The value for the ith column, where 1<=i <= k, for a given observation is 1 if the value of the factor is equal the ith level otherwise its 0. Reference coding is similar to one-hot encoding with one exception, in that, the last level is removed where the last level is determined by sorting the values of the categorical variable (in ascending order by number if column is an integral type or alphabetical if string type). The k-1 vector of values for an observation that takes the value of the kth level will be the zero vector, in essence one would code according to the reference coding method and then drop the last column.
Log Level Controls the level of verbosity the adapter uses to send notifications to the console. This setting can be higher than the containing application's log level. If set lower, the system log level is used. Available values, in increasing order of verbosity, are: OFF, ERROR, WARN, INFO, DEBUG, TRACE.

Testing Tab

Property Description
Use last k rows for testing, k=: Specify the last 'k' rows at the end of the incoming data to use as a hold-out or test sample, that is, the last 'k' rows will not be used in estimation of the model's parameters. Predictions and model summary measures for the test data are available.

Unmatched Categories Tab

Property Description
How to handle unmatched categories: If Set predictions to missing data is specified, then cases with new categorical levels that were not observed in the training data are ignored. If Stop scoring and display error is selected, then if a new categorical level is observed in the scoring data that was not observed in the training data, then no predictions will be created and error will be displayed.

Field Select Tab

Property Description
Predictors Specify the list of predictor variables. Regular expression matching is supported. Integral and string variables are assumed to be categorical variables and will be coded according to the option specified on the Operator properties tab.
Response Specify the single categorical response variable.

Code Select Tab

The Code Select tab allows you to run a statistical operator on a subset of the incoming data where the selected cases take on specific values of one or more categorical variables.

Column Description
Field The name of the categorical variable
Codes Specify the codes to use in the analysis.

Cluster Aware Tab

Use the settings in this tab to enable this operator or adapter for runtime start and stop conditions in a multi-node cluster. During initial development of the fragment that contains this operator or adapter, and for maximum compatibility with releases before 10.5.0, leave the Cluster start policy control in its default setting, Start with module.

Cluster awareness is an advanced topic that requires an understanding of StreamBase Runtime architecture features, including clusters, quorums, availability zones, and partitions. See Cluster Awareness Tab Settings on the Using Cluster Awareness page for instructions on configuring this tab.

Concurrency Tab

Use the Concurrency tab to specify parallel regions for this instance of this component, or multiplicity options, or both. The Concurrency tab settings are described in Concurrency Options, and dispatch styles are described in Dispatch Styles.

Caution

Concurrency settings are not suitable for every application, and using these settings requires a thorough analysis of your application. For details, see Execution Order and Concurrency, which includes important guidelines for using the concurrency options.

Operator Ports

The operator expects that the response variable to be analyzed is of type 'int', 'long', or 'string', continuous predictors are of type 'double', and categorical predictors are of type 'int', 'long', or 'string'. The two input ports are described below.

Input Ports

Port Description
Training/Testing Input data associated with the training and testing data. Incoming training data will be used to estimate model parameters, whereas incoming testing data will be used as a hold-out to evaluate the performance of the trained model.
Scoring Incoming scoring data will be scored by the trained model.

Output Ports

The operator expects that the response variable to be analyzed is of type 'int', 'long', or 'string', continuous predictors are of type 'double', and categorical predictors are of type 'int', 'long', or 'string'. The two input ports are described below.

Port Description
Model summary

The output tuple will consist of the incoming data passed through along with a list of the following analytic results:

  1. Misclassification rate for training

  2. Misclassification rate fortesting data

Train/Test predictions Predicted and predicted probabilities values for both training and testing data.
Scoring predictions Predicted and predicted probabilities values for the scoring data.