Monitoring GraphQL Gateways with OpenTelemetry

Valentin Cocaud

The Importance of Monitoring and Tracing for GraphQL Gateways

GraphQL Gateways play a crucial role in modern API architectures, acting as the central point for request routing, schema composition, and performance optimization. As these gateways handle increasing amounts of traffic and complexity, monitoring their performance and behavior becomes essential for maintaining reliable and efficient systems.

Current State of GraphQL Gateway Monitoring

Traditionally, monitoring GraphQL Gateways has been challenging due to several factors:

  1. Distributed Nature: GraphQL Gateways often communicate with multiple subgraphs, making it difficult to trace requests across the entire system.
  2. Query Planning Complexity: The gateway’s query planning phase, which determines how to split and route queries across subgraphs, is a critical but often opaque process that can significantly impact performance.
  3. Performance Bottleneck Identification: Complex queries that span multiple subgraphs make it challenging to identify which part of the execution is causing slowness, whether it’s the gateway’s planning, a specific subgraph, or the network communication.
  4. Error Source Tracing: When errors occur, it’s often difficult to trace whether they originate from the gateway itself, a specific subgraph, or the communication between them, making debugging a complex process.

OpenTelemetry: A Game-Changer for GraphQL Monitoring

OpenTelemetry has emerged as a powerful solution for monitoring GraphQL Gateways, offering:

  1. Standardized Instrumentation: OpenTelemetry provides consistent APIs for collecting metrics, traces, and logs across different programming languages and frameworks.
  2. Rich Context: The ability to propagate context across service boundaries helps track requests through the entire system.
  3. Cross Platform and Languages: OpenTelemetry is available on most languages and platforms, allowing to extend the boundaries of your observable system.
  4. Vendor Agnostic: OpenTelemetry’s vendor-agnostic approach allows you to choose your preferred observability backend.

New Hive Gateway’s OpenTelemetry Integration

While OpenTelemetry is becoming a widely used solution, we listened to many of production users’s feedback and build the best GraphQL integration possible.

Hive Gateway’s integration of OpenTelemetry have been entirely re-written to better use it’s standard SDK features, and allow for better interoperability with other tools (being custom ones or third party libraries). We’re excited to announce significant improvements to Hive Gateway’s

New cross-runtime simplified configuration helper

You can now enable OpenTelemetry support with just one CLI option! You can easily test our new integration by providing the [--opentelemetry option of our hive-gateway CLI. You can either provide a custom endpoint our rely on the standard default OTLP over HTTP exporter.

For more control over the OpenTelemetry setup, we also now provide a cross-runtime openTelemetrySetup function from @graphql-mesh/plugin-opentelemetry/setup or we support standard NodeSDK from @opentelemetry/sdk-node for a NodeJS specific setup.

hive-gateway supergraph --opentelemtry

Sampling

Packed with this new way of configuring OpenTelemetry, we now offer out of the box a Sampling Strategy:

telemetry.ts
import { openTelemetrySetup } from '@graphql-mesh/plugin-opentelemetry'
import { AsyncLocalStorageContextManager } from '@opentelemetry/context-async-hooks'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
 
openTelemetrySetup({
  contextManager: new AsyncLocalStorageContextManager(),
  traces: {
    exporter: new OTLPTraceExporter({ url: process.env('OTLP_EXPORTER_URL') })
  },
  samplingRate: 0.1 // Only 10% of requests will be traced (or if enabled via context propagation)
})

Limits

To improve the safety of your observability infrastructure, OpenTelemetry’s Limits can now be configured:

telemetry.ts
import { openTelemetrySetup } from '@graphql-mesh/plugin-opentelemetry'
import { AsyncLocalStorageContextManager } from '@opentelemetry/context-async-hooks'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
 
openTelemetrySetup({
  contextManager: new AsyncLocalStorageContextManager(),
  traces: {
    exporter: new OTLPTraceExporter({ url: process.env('OTLP_EXPORTER_URL') })
    spanLimits: {
      // limits specific to spans, like `eventCountLimit` or `attributeCountLimit`
    },
  },
  generalLimits: {
    // global default limits, like `attributeValueLengthLimit` and `attributeCountLimit`
  }
})

Diagnostic Logs

The OpenTelemetry Diagnostic logs (the internal OpenTelemetry logging system allowing you to debug your configuration), is now configured to use Hive Gateway logger by default, and follow the OTEL_LOG_LEVEL standard environment variable.

Span parenting and Critical Path Analysis

Spans are now nested within each-other, making it easier to understand execution flow. It also allows the usage of Critical Path Analysis in your visualization tool (like Grafana’s Critical Path Highlighting).

Improved Span coverage

Thanks to a brand new Plugin API, spans are now more precise and covers each phases entirely, including plugins execution time.

You can also instrument plugins if you need to track the performance of a specific plugin.

Support of HTTP batching

Our integration now fully support batched queries with the addition of a new graphql.operation span.

This span represent the processing and execution of a single graphql operation. A single HTTP span can contain multiple graphql operation span in case of a batched HTTP request.

A screenshot demoing batched graphql operation tracing

Support of HTTP Retry

In production, upstream subgraph requests are often retirable in case of failure, to improve success rate of the gateway.

Those retry request are now properly registered and the standard http.request.resend_count attribute is used to indicate a failed retry attempt.

A screenshot demoing failed and retried upstream request with appropriate attribute

Standard OpenTelemetry Context support

The OpenTelemetry integration now fully support the standard Context, which opens a wide range of new customization possibilities we will see later in this post.

Our integration also have an internal Context Manager, so that span parenting is working on all runtimes, even those without AsyncLocalStorage API implementation.

Support of standard instrumentations

Thanks to OpenTelemetry Context, standard instrumentations are now also supported and correctly nested in Hive Gateways’ spans.

This allows you to keep track of performance of low level Node’s internals and usage of third party libraries, such as Databases accesses.

A screenshot demoing support of standard fetch instrumentation

Easier custom spans

You can now easily register custom spans from your custom plugins! Thanks to the standard OpenTelemetry Context, your custom spans will be parented correctly.

gateway.config.ts
import 'telemetry.ts'
import { defineConfig } from '@graphql-hive/gateway'
import { trace } from '@opentelemetry/api'
import { getUserFromRequest } from './authentication'
 
export const gatewayConfig = defineConfig({
  openTelemetry: { traces: true },
 
  genericAuth: {
    mode: 'protect-granular',
    resolveUserFn: ctx =>
      trace.getTracer('gateway').startActiveSpan('auth.validate_user', async span => {
        const token = ctx.request.headers.get('authorization')
        const user = await fetch(`https://auth/validate?token=${token}`).then(r => r.json())
        span.end()
        return user
      })
  }
})

A screenshot showing custom span registered at context building time

Custom Span Attributes

You can now easily add your own custom attributes to any span using the standard @opentelemetry/api module.

The most straight forward usage is to add an attribute to the current span, the graphql.execute span for example:

gateway.config.ts
import 'telemetry.ts'
import { defineConfig } from '@graphql-hive/gateway'
import { trace } from '@opentelemetry/api'
 
export const defineConfig({
  openTelemetry: { traces: true },
  genericAuth: {
    //...
  }
 
  plugins: () => [{
    onExecute({ context }) {
      trace.getActiveSpan()?.setAttribute('auth.user.id', context.user.id ?? '<unauthenticated>')
    }
  }]
})

You can also have access to one of the “root” spans (http, graphql operation and subgraph execution) to add or modify there attributes:

gateway.config.ts
import './telemetry.ts'
import { defineConfig } from '@graphql-hive/gateway'
import { trace } from '@opentelemetry/api'
 
export const gatewayConfig = defineConfig({
  openTelemetry: { traces: true },
 
  genericAuth: {
    mode: 'protect-granular',
    async resolveUserFn(ctx) {
      const token = ctx.request.headers.get('authorization')
      const user = await fetch(`https://auth/validate?token=${token}`).then(r => r.json)
 
      const httpSpan = trace.getSpan(ctx.opentelemetry.httpContext())
      httpSpan?.setAttribute('auth.user.id', user?.id ?? '<unauthenticated>')
 
      return user
    }
  }
})

A screenshot showing an example of custom attribute

Custom Baggage

OpenTelemetry Baggage allows to attach data to the current context, allowing to easily enrich future spans. You can now set Baggage using the new instrumentation Plugin API:

gateway.config.ts
import './telemetry.ts'
import { defineConfig } from '@graphql-hive/gateway'
import { context, propagation } from '@opentelemetry/api'
 
export const gatewayConfig = defineConfig({
  openTelemetry: { traces: true },
  plugins: () => ({
    instrumentation: {
      request({ request }, wrapped) {
        const requestId = request.headers.get('request-id')
        const baggage = propagation.createBaggage({ requestId })
        const ctxWithBaggage = propagation.setBaggage(context.active(), baggage)
 
        // The wrapped function represent the processing of the request. It allows you to wrap
        // the entire request pipeline into an other function call, the OTEL context in this case.
        return context.with(ctxWithBaggage, wrapped)
      }
    }
  })
})

Log Correlation

The new OpenTelemetry integration leverage the new Hive Logger feature with an integration with the beta Logger API of OpenTelemetry.

You can now correlate the logs related to a trace for the best debugging experience!

gateway.config.ts
import { defineConfig } from '@graphql-hive/gateway'
import { Logger } from '@graphql-hive/logger'
import {
  OpenTelemetryLogWriter,
  openTelemetrySetup
} from '@graphql-mesh/plugin-opentelemetry/setup'
import { AsyncLocalStorageContextManager } from '@opentelemetry/context-async-hooks'
import { OTLPLogExporter } from '@opentelemetry/exporter-logs-otlp-http'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
 
openTelemetrySetup({
  contextManager: new AsyncLocalStorageContextManager(),
  traces: {
    exporter: new OTLPTraceExporter({ url: process.env('OTLP_EXPORTER_URL') })
  }
})
 
export const gatewayConfig = defineConfig({
  openTelemetry: { traces: true },
  logging: new Logger({
    writers: [
      new OpenTelemetryLogWriter({
        exporter: new OTLPLogExporter({ url: process.env('OTLP_EXPORTER_URL') })
      })
    ]
  })
})

A unified plugin for Hive Gateway and GraphQL Yoga

Our new OpenTelemetry is not only about Hive Gateway, but is also usable in GraphQL Yoga! You can use @graphql-mesh/plugin-opentelemetry to improve your GraphQL server observability and take profit of all the new features presented in this blog post:

server.ts
import 'telemetry.ts'
import { createServer } from 'node:http'
import { createYoga } from 'graphql-yoga'
import { useOpenTelemetry } from '@graphql-mesh/plugin-opentelemetry'
import { schema } from './schema'
 
const server = createServer(
  createYoga({
    schema,
    plugins: [useOpenTelemetry({ traces: true })]
  })
)
 
server.listen(4000, () => {
  console.info('Server is running on http://localhost:4000/graphql')
})

More coming soon!

Our effort to build the best GraphQL Gateway OpenTelemetry integration is not a short journey, and we are preparing more improvement for the future!

We joined the OpenTelemetry Graphql Working Group to push forward the state of GraphQL observability further.

Some problems specific to GraphQl are yet to be solved, and we can’t wait to share the exiting solutions we will build together:

  • Subscription traces
  • Better error reporting
  • Attribute cardinality optimization
  • Resolver level spans
  • Data access level spans (integration with DataLoader)

Conclusion

The OpenTelemetry integration we’re releasing today lays the foundation for several exciting future improvements. Thanks to OpenTelemetry’s unified context, logs and traces will be automatically correlated, making it easier to debug errors and performance issues. When a problem occurs, you’ll be able to jump from a slow query trace directly to the relevant logs, providing a complete picture of what happened during problematic requests.

Hive Gateway’s integration brings comprehensive monitoring capabilities to your GraphQL infrastructure. By providing detailed traces, metrics, and logs with unified context, it helps you identify and resolve issues faster, whether they’re in the gateway, subgraphs, or the communication between them.

To learn more about Hive Gateway’s OpenTelemetry integration or to get started with monitoring your GraphQL Gateway, visit our documentation or contact our team.

Join our newsletter

Want to hear from us when there's something new?
Sign up and stay up to date!

*By subscribing, you agree with Beehiiv’s Terms of Service and Privacy Policy.