- Feature Name: OpenTelemetry Tracing Integration
- Start Date: 2021-01-12
- RFC PR: (leave this empty)
- Fabric Component: core, sdks
- Fabric Issue: (leave this empty)
This request for comments proposes integrating Hyperledger Fabric, its SDKs, core and chaincode components with the OpenTelemetry project. It introduces the concept of tracing, reporting the execution of chaincode, core components and SDKs to help correlate activities with the chain.
The OpenTelemetry project is a CNCF project that aims to standardize observability. OpenTelemetry is working to create a standard around representing metrics, traces and logs. This proposal aims to bring traceability of code execution in Hyperledger Fabric across peers, orderers and chaincode.
Observability comes from the DevOps world, and is defined in Wikipedia as “a measure of how well internal states of a system can be inferred from knowledge of its external outputs”. Software that is observable exposes its behaviors by sending traces, which represent the path of execution of a particular request. Each trace can be represented as a set of spans that are correlated as child/parent or sequential relationships. Observable software is exposing metrics, representing the internal state of the components, as well as logs, emitted from the execution of the software.
In Hyperledger Fabric, we rely on the OpenTelemetry framework to report traces and correlate them across services.
In practice, this means a request made by a client connected via the Fabric SDK to a node can pass on the context of a trace. Peers and orderers propagate the trace context and create spans indicating their own interaction.
Blockchain operators can reconstitute a graph of the interaction of all the components at play to create a service map. This helps uncover trends and issues of performance, as well as shortening the time it takes to investigate problems. It offers some security capabilities, such as detecting unexpected executions, or react quickly to performance changes. Traces also report as errors when components are unavailable. This allows for monitoring and alerting systems.
Hyperledger Fabric developers can take advantage of those techniques with no code changes. Each SDK execution generates a top-level trace and will report to an endpoint provided by the environment.
Developers may also create a trace before calling out to Fabric in their client code. The current trace information will be sent along with the message to the peer.
The trace information is passed in as an optional gRPC metadata header.
Client SDKs automatically create a trace to capture the call to the chain.
The trace is created with the kind Span.Client.
If a current trace exists, the new client trace adds it as its parent.
The trace and span ID must be sent to the chain as a message header.
The SDK should use the standard environment variable environment to let users define how and if they want to report trace data to an endpoint of their choosing.
Peers and orderers capture and propagate trace information using an optional gRPC metadata header, if enabled.
OpenTelemetry is still relatively young, yet has reached maturity for traces support.
The OpenTelemetry reporting system happens securely over Protobuf, with containers and client applications sending data. This requires that an OpenTelemetry-compatible endpoint is present to receive the data.
This design allows full observability of Hyperledger Fabric, to a degree of detail that will help developers understand the impact of their deployment topology and organize operations using the latest framework. This allows Hyperledger Fabric to report meaningful data just like cloud-native applications.
Alternatively, developers can develop their own homegrown designs to report data along the way, or use logs only to understand how the system performed and investigate issues.
We have brought a similar design to Hyperledger Besu, where we started adding tracing to the HTTP JSON-RPC service used by clients to communicate with the node. We also created traces in critical processes throughout the client to understand the performance of the code running there. We created a complete application that can report the state of the client and the health of its internal mechanisms.
We also have published a webinar with the OpenTelemetry project showcasing how Hyperledger Fabric can rely on OpenTelemetry to report the state of execution and eliminate the black box feeling operators contend with.
In addition to the current integration testing scenarios, it will be required to test the system alongside an instance of the OpenTelemetry collector collecting traces from the client applications. The OpenTelemetry instance can report to a zipkin application all this information so testers can verify the correctness of the traces published.