Health Check

Evaluating the health of an application means evaluating the health of various features, and aggregating these local evaluations into a process health evaluation. The Health Check feature checks for things such as:

•

the engine server CPU and memory consumption are within reasonable ranges

•

there is a running connection to the Data Server

•

there is a running connection to the database

Should one of these critical resources be down, the server can be considered to be in a critical state.

This extensible and configurable framework supplies a HTTP REST API which allows for the implementation of Kubernetes sensors as well as an external monitoring tool to request the health of a Calypso server.

The REST API is available on all Calypso servers.

1. Health Check Layout

The Health Check framework considers each server as a collection of sub systems (like System, Calypso or Engines that are configured to run on the server). Each sub system maintains a collection of health check sensors, such as memory usage, unconsumed events, etc....

Upon reception of a health request, the framework cascades the request down to each love sub-system and then down to each sensor handled by the sub-system.

Each sensor evaluates its health and returns its health status through an object carrying a state (Green, Amber or Red) and some additional details.

Each sub-system aggregates its sensors' health and computes its state accordingly. The framework aggregates sub-systems' health and computes the overall health state.

The table below gives examples of sub-subsystems and related sensors:

Sub-System Samples

Available Sensors

The System sub-system could have:

•

a sensor checking for memory usage

•

a sensor checking the GC activity

•

a sensor monitoring CPU usage

Presently, the System only supplies the memory usage sensor.

Engine sub-systems have:

•

a sensor checking that the engine is up and running

•

a sensor monitoring the number of unconsumed events to be processed.

1.1 Sensor Health Check

Each sensor evaluates its health by computing a metric value that it compares with an amber and/or a red threshold. At least one of the two thresholds must not be null.

•

If the metric value hits the red threshold, the sensor will return a health object whose state is red.

•

If the metric value hits the amber threshold, the sensor returns an object whose state is amber.

•

Otherwise, the sensor returns a health object whose state is green.

1.2 Sub-System Health Check

Once the health of the children sensors are collected, the sub-system computes its health through the following algorithm:

•

If a critical sensor has a red state, the sub-system state is red.

•

If a non-critical sensor has a red state or if a critical sensor has an amber state, the sub-system state is amber.

•

Otherwise, the sub-system state is green.

1.3 Process Health Check

The same framework that applies for sub-system health check applies for process health check.

•

If a critical sub-system has a red state, the process state is red.

•

If a non-critical sub-system has a red state or if a critical sub-system has an amber state, the process health state is amber.

•

Otherwise, the process health state is green.

1.4 Format of Health Check Response

After defining sub-systems and sensors, it is possible to detail the structure of the health check response.

This response provides a http status code and a JSON http response body. The JSON body provides:

•

a state field

•

a timestamp field - gives the timestamp in UTC format, captured at the end of the query processing

•

a duration field - gives the number in milliseconds required for processing the query

•

additional sections supplying indications about the various sub-systems' health.

Some sub-systems being shared by all the processes are displayed in the Common section of the response, which comes first. This way, the first items of the response are the same whatever the query process.

Engine Server Health Check

1.5 Authentication

The API is exposed as a REST service and all calls to the API are secured with Calypso Application Server authentication.

A provision has been made to enable or disable authentication for REST API. Users who do now want any authentication in the REST API can set the property DISABLE_HEALTHCHECK_AUTH=true in the user environment property file.

By default, authentication is enabled.

2. HTTP API

To perform a http health check request, you need to call a GET on the related web admin and /api/v1/health / and /actuator/health.

For example, on the Data Server:

% curl --user <user>:<password> --request GET http://localhost:8100/api/v1/health

or, if SSL is enabled:

% curl --insecure --user <user>:<password> --request GET https://localhost:8101/api/v1/health

The application will return a JSON formatted response with the following indications:

HTTP Code	Health Check state field	Description
500	Red	Unhealthy – Severe condition, component restart is required, memory consumption > 99%. For Kubernetes, the component might become eligible for an automatic restart.
200	Amber	Warning - component is reporting a kind of failure that suggests admin take a look, but server restart not necessarily needed, memory consumption >90%, network latency with dataserver >1ms (or whatever is the least required latency).
200	Green	Healthy

HTTP Code

Health Check state field

Description

500

Red

Unhealthy – Severe condition, component restart is required, memory consumption > 99%.

For Kubernetes, the component might become eligible for an automatic restart.

200

Amber

Warning - component is reporting a kind of failure that suggests admin take a look, but server restart not necessarily needed, memory consumption >90%, network latency with dataserver >1ms (or whatever is the least required latency).

200

Green

Healthy

3. API Specification

All resources are append with /api/v1/health/, such as:

http://<host>:<port>/api/v1/health

For example:

•

Data Server health URL: http://localhost:8100/api/v1/health

•

Engine Server health URL: : http://localhost:8140/api/v1/health

The following URI is exposed:

Retrieving the health status of the Engine Server

Resource	http://localhost:8140/api/v1/health
HTTP Method	GET
Input	N/A
Optional Query Parameters	N/A
Output	JSON Object representing health of server, comprising of: State Timestamp Duration System (Memory usage) Calypso Database Server specific checks (unconsumed events in case of Engine Server)
Description	Retrieves health of Engine Server
Sample Invocation	Engine Server health URL: http://localhost:8140/api/v1/health
Sample Request	curl --user calypso_user:calypso --request GET http://localhost:8140/api/v1/health
Sample Output	{ "state": "AMBER", "timestamp": "2020-11-19T07:41:27.921Z", "duration": "243 ms", "Common": { "Dataserver": { "state": "GREEN", "ds-connection": { "value": true } }, "CalypsoDatabase": { "state": "GREEN", "availability": { "value": true } }, "System": { "state": "GREEN", "memory-usage": { "used": "57 %", "total": "2048 MB" } } }, "EngineServer": { "AccountingEngine": { "state": "GREEN", "unconsumed-events": { "count": 0 }, "running": { "value": true } }, "CollateralManagementEngine": { "state": "GREEN", "unconsumed-events": { "count": 0 }, "running": { "value": true } }, "CreEngine": { "state": "GREEN", "unconsumed-events": { "count": 0 }, "running": { "value": true } }, "DiaryEngine": { "state": "GREEN", "unconsumed-events": { "count": 0 }, "running": { "value": true } }, "EcoHedgeEnrichmentEngine": { "state": "GREEN", "unconsumed-events": { "count": 1 }, "running": { "value": true } }, "FTPEngine": { "state": "GREEN", "unconsumed-events": { "count": 0 }, "running": { "value": true } }, "ImportMsg_DTCC_GTR": { "state": "AMBER", "unconsumed-events": { "count": 0 }, "running": { "state": "RED", "value": false } }, "IncomingEngine": { "state": "GREEN", "unconsumed-events": { "count": 0 }, "running": { "value": true\ } }, "InventoryEngine": { "state": "GREEN", "unconsumed-events": { "count": 0 }, "running": { "value": true } }, "LifeCycleEngine": { "state": "GREEN", "unconsumed-events": { "count": 0 }, "running": { "value": true } }, "LiquidationEngine": { "state": "GREEN", "unconsumed-events": { "count": 0 }, "running": { "value": true } }, "MarginCallPositionEngine": { "state": "GREEN", "unconsumed-events": { "count": 0 }, "running": { "value": true } }, "MarginController": { "state": "AMBER", "unconsumed-events": { "count": 0 }, "running": { "state": "RED", "value": false } }, "MatchableBuilderEngine": { "state": "GREEN", "unconsumed-events": { "count": 0 }, "running": { "value": true } }, "MatchingEngine": { "state": "GREEN", "unconsumed-events": { "count": 0 }, "running": { "value": true } }, "MessageEngine": { "state": "GREEN", "unconsumed-events": { "count": 0 }, "running": { "value": true } }, "PositionEngine": { "state": "GREEN", "unconsumed-events": { "count": 0 }, "running": { "value": true } }, "RelationshipManagerEngine": { "state": "GREEN", "unconsumed-events": { "count": 0 }, "running": { "value": true } }, "SftrEngine": { "state": "AMBER", "unconsumed-events": { "count": 0 }, "running": { "state": "RED", "value": false } }, "TaskEngine": { "state": "GREEN", "unconsumed-events": { "count": 0 }, "running": { "value": true } }, "TransferEngine": { "state": "GREEN", "unconsumed-events": { "count": 0 }, "running": { "value": true } }, "UpdateManagerEngine": { "state": "GREEN", "unconsumed-events": { "count": 0 }, "running": { "value": true } }, "EnginesMonitoring": { "state": "GREEN" } } }

2. Retrieving the health status of the Data Server

Resource	http://localhost:8100/api/v1/health
HTTP Method	GET
Input	N/A
Optional Query Parameters	N/A
Output	JSON Object representing health of server, comprising of: State Timestamp Duration System (Memory usage) Calypso Database
Description	Retrieves health of Data Server
Sample Invocation	Data Server health URL: http://localhost:8100/api/v1/health
Sample Request	curl --user calypso_user:calypso --request GET http://localhost:8100/dataserver/api/v1/health
Sample Output	{ "state": "GREEN", "timestamp": "2020-11-19T07:37:41.479Z", "duration": "8 ms", "Common": { "CalypsoDatabase": { "state": "GREEN", "availability": { "value": true } }, "System": { "state": "GREEN", "memory-usage": { "used": "39 %", "total": "2048 MB" } } } }

List of other servers exposing Health Check REST API:

Server Name	Example URL
Auth Server	http://localhost:8090/api/v1/health
Event Server	http://localhost:8080/api/v1/health
Risk Server	http://localhost:8160/api/v1/health
In Memory Risk Server	http://localhost:8500/api/v1/health
CD ISDA Model Server	http://localhost:8380/api/v1/health
Curve Server	http://localhost:8580/api/v1/health
CBSL Server	http://localhost:8420/api/v1/health
Position Keeping Server	http://localhost:8180/api/v1/health
ERS Risk Server	http://localhost:8260/risk-services/api/v1/health
Analysis Server	http://localhost:8620/api/v1/health
eDealing Server	http://localhost:8480/api/v1/health
XVA Server	http://localhost:8360/api/v1/health
ERS Limits Server	http://localhost:8280/api/v1/health
Liquidity Server	http://localhost:8220/api/v1/health
Calypso Services Interface Server	http://localhost:9160/api/v1/health
Entitlement Server	http://localhost:8320/api/v1/health
BO Report Server	http://localhost:8600/api/v1/health
Calypso Services	http://localhost:9140/audit/api/v1/health http://localhost:9140/core/api/v1/health
Calypso Services Collateral MS	http://localhost:9150/audit/api/v1/health http://localhost:9150/core/api/v1/health
ERS Compliance Server	http://localhost:8300/api/v1/health
Regulatory Risk Server	http://localhost:8560/api/v1/health
Scheduler	http://localhost:9180/scheduler-service/api/v1/health
Calypso Messaging Server	http://localhost:9380/api/v1/health

4. Environment Property

The property SERVER_STARTUP_RETRY_COUNT is used to set the retry count after which the server will stop trying to hit the health check URL. This parameter is configurable and can be increased and decreased as required. The default value is 20.