Diagnosing Service Discovery Issues

This page lists various network related issues that may be encountered with service discovery, how to diagnose them, and how to resolve them.

Node Installation Failures

A node requires that service discovery functions properly at least locally on the machine where the node is installed.

Discovery Port in Use

If another program has the service discovery UDP port opened exclusively, node installation fails with a message similar to:

$ epadmin install node --nodename mynode.mycluster
[mynode.mycluster]      Installing node
[mynode.mycluster]              DEVELOPMENT executables
[mynode.mycluster]              File shared memory
[mynode.mycluster]              4 concurrent allocation segments
[mynode.mycluster]              Host name fig.local
[mynode.mycluster]              Starting node services
[mynode.mycluster]              Loading node configuration
[mynode.mycluster]              Auditing node security
Service discovery verification failed: Could not start the discovery service listener 
           on port 54321, network error: SWSocket::initServer: 
           Call to 'bind' failed: Address already in use [errno:98].

Resolution: either choose another port for service discovery using the --discoveryport option (see epadmin-node(1)), or find and terminate the program that is using the port.

Discovery Port Permission Error

If an invalid port number is specified for service discovery using the --discoveryport option (see epadmin-node(1)), node installation will fail with a message similar to:

$ epadmin install node --nodename mynode.mycluster --discoveryport 1
[mynode.mycluster]      Installing node
[mynode.mycluster]              DEVELOPMENT executables
[mynode.mycluster]              File shared memory
[mynode.mycluster]              4 concurrent allocation segments
[mynode.mycluster]              Host name fig.local
[mynode.mycluster]              Starting node services
[mynode.mycluster]              Loading node configuration
[mynode.mycluster]              Auditing node security
Service discovery verification failed: Could not start the discovery service listener 
           on port 1, network error: SWSocket::initServer: Call to 'bind' failed: 
           Permission denied [errno:13].

Resolution: choose another unused UDP port in the range of 1024 to 65535.

Service Name Already in Use

If the name of the node being installed (either by default or specified using the --nodename) is already in use by another node using the same service discovery port, node installation will failure with a message similar to:

$ epadmin install node --nodename mynode.mycluster
[mynode.mycluster]      Installing node
install of node mynode.mycluster using discovery port 54321 failed: the service name is 
           already in use by service address fig.local:35883

Resolution: either stop and remove the other node, or choose a different node name. See epadmin-node(1).

UDP Packets Being Dropped or Filtered

Service discovery uses UDP broadcast packets for making discovery requests, and socket to socket UDP packets for responses. If these packets are being filtered or dropped by the operating system, or by routers in between the node and epadmin, discovery requests will not be seen by the discovery server running within the node.

When UDP packets are being filtered or dropped on the machine where the node is installed, epadmin install node will fail with a message similar to:

$ epadmin install node --nodename mynode.mycluster
[mynode.mycluster]      Installing node
[mynode.mycluster]              DEVELOPMENT executables
[mynode.mycluster]              File shared memory
[mynode.mycluster]              4 concurrent allocation segments
[mynode.mycluster]              Host name fig.local
[mynode.mycluster]              Starting node services
[mynode.mycluster]              Loading node configuration
[mynode.mycluster]              Auditing node security
Service discovery verification failed: Service discovery did not find any results

Resolution: ensure that UDP packets are not being filtered on the port being used by the discovery service.

Service Discovery Diagnostics

During node installation, local service discovery verification is done. Failures cause the installation to fail (see Node Installation Failures).

Service discovery verification may also be done as a stand-alone epadmin command, with or without any nodes installed, using the epadmin verify services command.

Running Verify Services with Defaults

By default, the epadmin verify services command runs both a discovery server and a discovery client locally. The client makes a request, and verifies that it receives the expected response.

$ epadmin verify services
Service discovery is functioning properly locally.

Running Verification Server and Clients Separately

The verification server may be run independently of the client. Run the server in one terminal:

$ epadmin verify services --mode server
Service discovery server started. Interrupt to exit.

Note

The server does not return until interrupted.

In another terminal run a verification client:

$ epadmin verify services --mode client
Service discovery is functioning properly locally.

Note

The verification client may be run multiple times using the same verification server.

Running Verification Server and Client on Different Machines

Start the server on one machine:

$ hostname
fig.local
$ epadmin verify services --mode server
Service discovery server started. Interrupt to exit.    

Run the client on another machine:

$ hostname
mulberry.local
$ epadmin verify services --mode client
Service discovery is functioning properly locally.

Global Options

The --discoveryport, --discoveryhosts, and --discoverytimeout global options are honored by the epadmin verify services command. See epadmin-globals(1) .

The --debug global option also effects the verify services command and is shown in the next section.

Debug Tracing

The --debug global option enables the output of debug tracing for the service discovery verification server and client.

Start the verification server with the --debug global option:

$ epadmin --debug verify services --mode server
2018-05-23 15:02:19.159215|DSV|INFO |5214|discovery.cpp(288)|SWDiscovery::Discovery for 
      service x.y.zz.y, type test-type, address test-address, started on port 54321
Service discovery server started. Interrupt to exit.

The trace indicates that the server has successfully started listening on the default discovery port, and contains the x.y.zz.y service.

Run the verification client with the --debug global option:

$ epadmin --debug verify services --mode client
2018-05-23 15:07:41.095600|DSV|DEBUG|6225|client.cpp(351)|Client sending: 
           PDU:A5:2:DiscoverServicesRequest:3:10.240.6.255/6225/3/0:x.y.zz.y:test-type:: 
           on 10.240.6.255:54321
2018-05-23 15:07:41.095656|DSV|DEBUG|6225|client.cpp(351)|Client sending: 
           PDU:A5:2:DiscoverServicesRequest:3:255.255.255.255/6225/4/0:x.y.zz.y:test-type:: 
           on 255.255.255.255:54321
2018-05-23 15:07:41.095818|DSV|DEBUG|6225|client.cpp(674)|Client getResults matched 
           response: PDU:A5:2:DiscoverServicesResponse:
           4:10.240.6.255/6225/3/0:x.y.zz.y:test-type:test-address: from 10.240.6.56:46180
Service discovery is functioning properly locally.

The trace shows the client sending two service discovery broadcast requests, looking for the service name (x.y.zz.y) and the service type (test-type). The first request is sent on the broadcast address for the current host name (in this case 10.240.6.255), and a second requests goes out on the localhost interface. (255.255.255.255).

2018-05-23 15:07:41.095600|DSV|DEBUG|6225|client.cpp(351)|Client sending: 
           PDU:A5:2:DiscoverServicesRequest:3:10.240.6.255/6225/3/0:x.y.zz.y:test-type:: 
           on 10.240.6.255:54321
2018-05-23 15:07:41.095656|DSV|DEBUG|6225|client.cpp(351)|Client sending: 
           PDU:A5:2:DiscoverServicesRequest:3:255.255.255.255/6225/4/0:x.y.zz.y:test-type:: 
           on 255.255.255.255:54321

The client trace then shows it having received a response from the verification server:

2018-05-23 15:07:41.095818|DSV|DEBUG|6225|client.cpp(674)|Client getResults matched 
           response: PDU:A5:2:DiscoverServicesResponse:
           4:10.240.6.255/6225/3/0:x.y.zz.y:test-type:test-address: from 10.240.6.56:46180

The verification server terminal shows the server receiving the two requests and sending responses to each of them:

018-05-23 15:07:41.095708|DSV|DEBUG|6186|discovery.cpp(412)|Discovery test-address received: 
           PDU:A5:2:DiscoverServicesRequest:3:10.240.6.255/6225/3/0:x.y.zz.y:test-type:: 
           from 10.240.6.56:38912
2018-05-23 15:07:41.095730|DSV|DEBUG|6186|util.cpp(365)|Discovery sending response to 
           10.240.6.56:38912 : PDU:A5:2:DiscoverServicesResponse:
           4:10.240.6.255/6225/3/0:x.y.zz.y:test-type:test-address:
2018-05-23 15:07:41.095793|DSV|DEBUG|6186|discovery.cpp(412)|Discovery test-address received: 
           PDU:A5:2:DiscoverServicesRequest:3:255.255.255.255/6225/4/0:x.y.zz.y:test-type:: 
           from 10.240.6.56:47951
2018-05-23 15:07:41.095805|DSV|DEBUG|6186|util.cpp(365)|Discovery sending response to 
           10.240.6.56:47951 : PDU:A5:2:DiscoverServicesResponse:
           4:255.255.255.255/6225/4/0:x.y.zz.y:test-type:test-address:

In the example above, the client returned successfully after receiving the first matching response, because its request contained a fully qualified node name (see the Service Names section of the Spotfire Streaming Architects Guide).

When the service discovery request does not specify a fully qualified node name, then the client will wait for the full discovery timeout period (default: 1 second, see --discoverytimeout in epadmin-globals(1)). The client discards duplicate responses, which is shown below, running a standard epadmin display services command talking to the still running verification server from above.

$ epadmin --debug display services --servicetype test-type
2018-05-23 18:30:09.331217|DSV|DEBUG|13417|client.cpp(351)|Client sending: 
           PDU:A5:2:DiscoverServicesRequest:2:10.240.6.255/13417/3/0::test-type:: 
           on 10.240.6.255:54321
2018-05-23 18:30:09.331250|DSV|DEBUG|13417|client.cpp(351)|Client sending: 
           PDU:A5:2:DiscoverServicesRequest:2:255.255.255.255/13417/4/0::test-type:: 
           on 255.255.255.255:54321
2018-05-23 18:30:09.331314|DSV|DEBUG|13417|client.cpp(674)|Client getResults matched 
           response: PDU:A5:2:DiscoverServicesResponse:
           4:10.240.6.255/13417/3/0:x.y.zz.y:test-type:test-address: from 10.240.6.56:37287
2018-05-23 18:30:09.331375|DSV|DEBUG|13417|client.cpp(674)|Client getResults matched 
           response: PDU:A5:2:DiscoverServicesResponse:
           4:255.255.255.255/13417/4/0:x.y.zz.y:test-type:test-address: from 10.240.6.56:46595
2018-05-23 18:30:09.331382|DSV|DEBUG|13417|results.cpp(300)|Discarding duplicate 
           response from x.y.zz.y:test-type:test-address
Service Name = x.y.zz.y
Service Type = test-type
Network Address = dtm://test-address

Debug Tracing of No Results

In cases where UDP packet filtering is suspected, debug tracing can be used to determine if the discovery server is receiving the requests and responding, and if the client is receiving the responses.

Start a verification discovery server, with debug tracing enabled:

$ epadmin --debug verify services --mode server
2018-05-24 09:45:52.884984|DSV|INFO |8504|discovery.cpp(288)|SWDiscovery::Discovery 
           for service x.y.zz.y, type test-type, address test-address, started on port 54321
Service discovery server started. Interrupt to exit.

In another terminal, on the same machine, or another machine if debugging cross-machine service discovery, run a verification discovery client, with debug tracing enabled. These traces show the client successfully sending two discovery requests but not receiving any responses:

$ epadmin --debug verify services --mode client
2018-05-24 09:55:01.533096|DSV|DEBUG|8521|client.cpp(351)|Client sending: 
           PDU:A5:2:DiscoverServicesRequest:3:10.240.6.255/8521/3/0:x.y.zz.y:test-type:: 
           on 10.240.6.255:54321
2018-05-24 09:55:01.533141|DSV|DEBUG|8521|client.cpp(351)|Client sending: 
           PDU:A5:2:DiscoverServicesRequest:3:255.255.255.255/8521/4/0:x.y.zz.y:test-type:: 
           on 255.255.255.255:54321
Service discovery verification failed: Service discovery did not find any results

Nothing was output in the discovery server terminal, showing that it did not receive either of the requests.