Health Check

Evaluating the health of an application means evaluating the health of various features, and aggregating these local evaluations into a process health evaluation. The Health Check feature checks for things such as:

the engine server CPU and memory consumption are within reasonable ranges
there is a running connection to the Data Server
there is a running connection to the database

Should one of these critical resources be down, the server can be considered to be in a critical state.

This extensible and configurable framework supplies a HTTP REST API which allows for the implementation of Kubernetes sensors as well as an external monitoring tool to request the health of a Calypso server.

The REST API is available on all Calypso servers.

 

1. Health Check Layout

The Health Check framework considers each server as a collection of sub systems (like System, Calypso or Engines that are configured to run on the server). Each sub system maintains a collection of health check sensors, such as memory usage, unconsumed events, etc....

Upon reception of a health request, the framework cascades the request down to each love sub-system and then down to each sensor handled by the sub-system.

Each sensor evaluates its health and returns its health status through an object carrying a state (Green, Amber or Red) and some additional details.

Each sub-system aggregates its sensors' health and computes its state accordingly.  The framework aggregates sub-systems' health and computes the overall health state.

The table below gives examples of sub-subsystems and related sensors:

Sub-System Samples

Available Sensors

The System sub-system could have:

a sensor checking for memory usage
a sensor checking the GC activity
a sensor monitoring CPU usage

 

Presently, the System only supplies the memory usage sensor.

Engine sub-systems have:

a sensor checking that the engine is up and running
a sensor monitoring the number of unconsumed events to be processed.

 

1.1 Sensor Health Check

Each sensor evaluates its health by computing a metric value that it compares with an amber and/or a red threshold. At least one of the two thresholds must not be null.

If the metric value hits the red threshold, the sensor will return a health object whose state is red.
If the metric value hits the amber threshold, the sensor returns an object whose state is amber.
Otherwise, the sensor returns a health object whose state is green.

 

1.2 Sub-System Health Check

Once the health of the children sensors are collected, the sub-system computes its health through the following algorithm:

If a critical sensor has a red state, the sub-system state is red.
If a non-critical sensor has a red state or if a critical sensor has an amber state, the sub-system state is amber.
Otherwise, the sub-system state is green.

 

1.3 Process Health Check

The same framework that applies for sub-system health check applies for process health check.

If a critical sub-system has a red state, the process state is red.
If a non-critical sub-system has a red state or if a critical sub-system has an amber state, the process health state is amber.
Otherwise, the process health state is green.

 

1.4 Format of Health Check Response

After defining sub-systems and sensors, it is possible to detail the structure of the health check response.

This response provides a http status code and a JSON http response body. The JSON body provides:

a state field
a timestamp field - gives the timestamp in UTC format, captured at the end of the query processing
a duration field - gives the number in milliseconds required for processing the query
additional sections supplying indications about the various sub-systems' health.

Some sub-systems being shared by all the processes are displayed in the Common section of the response, which comes first. This way, the first items of the response are the same whatever the query process.

Engine Server Health Check

 

1.5 Authentication

The API is exposed as a REST service and all calls to the API are secured with Calypso Application Server authentication.

A provision has been made to enable or disable authentication for REST API. Users who do now want any authentication in the REST API can set the property DISABLE_HEALTHCHECK_AUTH=true in the user environment property file.

By default, authentication is enabled.

 

2. HTTP API

To perform a http health check request, you need to call a GET on the related web admin and /api/v1/health / and /actuator/health.

 

For example, on the Data Server: 

% curl --user <user>:<password> --request GET http://localhost:8100/api/v1/health

 

or, if SSL is enabled: 

% curl --insecure --user <user>:<password> --request GET https://localhost:8101/api/v1/health

 

The application will return a JSON formatted response with the following indications:

HTTP Code

Health Check state field

Description

500

Red

Unhealthy – Severe condition, component restart is required, memory consumption > 99%.

For Kubernetes, the component might become eligible for an automatic restart.

200 Amber Warning - component is reporting a kind of failure that suggests admin take a look, but server restart not necessarily needed, memory consumption >90%, network latency with dataserver >1ms (or whatever is the least required latency).
200 Green

Healthy

 

3. API Specification

All resources are append with /api/v1/health/, such as:

http://<host>:<port>/api/v1/health

 

For example: 

Data Server health URL: http://localhost:8100/api/v1/health
Engine Server health URL: : http://localhost:8140/api/v1/health

 

The following URI is exposed:

  1. Retrieving the health status of the Engine Server

Resource

http://localhost:8140/api/v1/health

HTTP Method

GET

Input

N/A

Optional Query Parameters

N/A

Output

JSON Object representing health of server, comprising of:

  • State

  • Timestamp

  • Duration

  • System (Memory usage)

  • Calypso Database

  • Server specific checks (unconsumed events in case of Engine Server)

Description

Retrieves health of Engine Server

Sample Invocation

Engine Server health URL: 

http://localhost:8140/api/v1/health

Sample Request

curl --user calypso_user:calypso --request GET http://localhost:8140/api/v1/health

Sample Output

{

"state": "AMBER",

"timestamp": "2020-11-19T07:41:27.921Z",

"duration": "243 ms",

"Common": {

"Dataserver": {

"state": "GREEN",

"ds-connection": {

"value": true

}

},

"CalypsoDatabase": {

"state": "GREEN",

"availability": {

"value": true

}

},

"System": {

"state": "GREEN",

"memory-usage": {

"used": "57 %",

"total": "2048 MB"

}

}

},

"EngineServer": {

"AccountingEngine": {

"state": "GREEN",

"unconsumed-events": {

"count": 0

},

"running": {

"value": true

}

},

"CollateralManagementEngine": {

"state": "GREEN",

"unconsumed-events": {

"count": 0

},

"running": {

"value": true

}

},

"CreEngine": {

"state": "GREEN",

"unconsumed-events": {

"count": 0

},

"running": {

"value": true

}

},

"DiaryEngine": {

"state": "GREEN",

"unconsumed-events": {

"count": 0

},

"running": {

"value": true

}

},

"EcoHedgeEnrichmentEngine": {

"state": "GREEN",

"unconsumed-events": {

"count": 1

},

"running": {

"value": true

}

},

"FTPEngine": {

"state": "GREEN",

"unconsumed-events": {

"count": 0

},

"running": {

"value": true

}

},

"ImportMsg_DTCC_GTR": {

"state": "AMBER",

"unconsumed-events": {

"count": 0

},

"running": {

"state": "RED",

"value": false

}

},

"IncomingEngine": {

"state": "GREEN",

"unconsumed-events": {

"count": 0

},

"running": {

"value": true\

}

},

"InventoryEngine": {

"state": "GREEN",

"unconsumed-events": {

"count": 0

},

"running": {

"value": true

}

},

"LifeCycleEngine": {

"state": "GREEN",

"unconsumed-events": {

"count": 0

},

"running": {

"value": true

}

},

"LiquidationEngine": {

"state": "GREEN",

"unconsumed-events": {

"count": 0

},

"running": {

"value": true

}

},

"MarginCallPositionEngine": {

"state": "GREEN",

"unconsumed-events": {

"count": 0

},

"running": {

"value": true

}

},

"MarginController": {

"state": "AMBER",

"unconsumed-events": {

"count": 0

},

"running": {

"state": "RED",

"value": false

}

},

"MatchableBuilderEngine": {

"state": "GREEN",

"unconsumed-events": {

"count": 0

},

"running": {

"value": true

}

},

"MatchingEngine": {

"state": "GREEN",

"unconsumed-events": {

"count": 0

},

"running": {

"value": true

}

},

"MessageEngine": {

"state": "GREEN",

"unconsumed-events": {

"count": 0

},

"running": {

"value": true

}

},

"PositionEngine": {

"state": "GREEN",

"unconsumed-events": {

"count": 0

},

"running": {

"value": true

}

},

"RelationshipManagerEngine": {

"state": "GREEN",

"unconsumed-events": {

"count": 0

},

"running": {

"value": true

}

},

"SftrEngine": {

"state": "AMBER",

"unconsumed-events": {

"count": 0

},

"running": {

"state": "RED",

"value": false

}

},

"TaskEngine": {

"state": "GREEN",

"unconsumed-events": {

"count": 0

},

"running": {

"value": true

}

},

"TransferEngine": {

"state": "GREEN",

"unconsumed-events": {

"count": 0

},

"running": {

"value": true

}

},

"UpdateManagerEngine": {

"state": "GREEN",

"unconsumed-events": {

"count": 0

},

"running": {

"value": true

}

},

"EnginesMonitoring": {

"state": "GREEN"

}

}

}

 

 

 

2. Retrieving the health status of the Data Server

 

Resource

http://localhost:8100/api/v1/health

HTTP Method

GET

Input

N/A

Optional Query Parameters

N/A

Output

JSON Object representing health of server, comprising of:

  • State

  • Timestamp

  • Duration

  • System (Memory usage)

  • Calypso Database

     

Description

Retrieves health of Data Server

Sample Invocation

Data Server health URL: 

http://localhost:8100/api/v1/health

Sample Request

curl --user calypso_user:calypso --request GET http://localhost:8100/dataserver/api/v1/health

Sample Output

{

"state": "GREEN",

"timestamp": "2020-11-19T07:37:41.479Z",

"duration": "8 ms",

"Common": {

"CalypsoDatabase": {

"state": "GREEN",

"availability": {

"value": true

}

},

"System": {

"state": "GREEN",

"memory-usage": {

"used": "39 %",

"total": "2048 MB"

}

}

}

}

 

List of other servers exposing Health Check REST API: 

Server Name

Example URL

Auth Server

http://localhost:8090/api/v1/health

Event Server

http://localhost:8080/api/v1/health

Risk Server

http://localhost:8160/api/v1/health

In Memory Risk Server http://localhost:8500/api/v1/health

CD ISDA Model Server

http://localhost:8380/api/v1/health

Curve Server

http://localhost:8580/api/v1/health

CBSL Server

http://localhost:8420/api/v1/health

Position Keeping Server

http://localhost:8180/api/v1/health

ERS Risk Server

http://localhost:8260/risk-services/api/v1/health

Analysis Server

http://localhost:8620/api/v1/health

eDealing Server

http://localhost:8480/api/v1/health

XVA Server

http://localhost:8360/api/v1/health

ERS Limits Server

http://localhost:8280/api/v1/health

Liquidity Server

http://localhost:8220/api/v1/health

Calypso Services Interface Server

http://localhost:9160/api/v1/health

Entitlement Server

http://localhost:8320/api/v1/health

BO Report Server

http://localhost:8600/api/v1/health

Calypso Services

http://localhost:9140/audit/api/v1/health

http://localhost:9140/core/api/v1/health

Calypso Services Collateral MS

http://localhost:9150/audit/api/v1/health

http://localhost:9150/core/api/v1/health

ERS Compliance Server

http://localhost:8300/api/v1/health

Regulatory Risk Server

http://localhost:8560/api/v1/health

Scheduler

http://localhost:9180/scheduler-service/api/v1/health

Calypso Messaging Server

http://localhost:9380/api/v1/health

 

4. Environment Property

The property SERVER_STARTUP_RETRY_COUNT is used to set the retry count after which the server will stop trying to hit the health check URL. This parameter is configurable and can be increased and decreased as required. The default value is 20.