How to debug random unhandled EPIPE error #4470

florian-besser · 2024-08-28T10:47:30Z

Node.js Version

v20.17.0

NPM Version

10.8.1

Operating System

Darwin ip-192-168-201-130.ap-southeast-1.compute.internal 23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6030 arm64

Subsystem

events, process

Description

My application randomly terminates with the following lines being last in the log:

node:events:497
      throw er; // Unhandled 'error' event
      ^

Error: write EPIPE
    at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:95:16)
    at WriteWrap.callbackTrampoline (node:internal/async_hooks:130:17)
Emitted 'error' event on Socket instance at:
    at emitErrorNT (node:internal/streams/destroy:169:8)
    at emitErrorCloseNT (node:internal/streams/destroy:128:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  errno: -32,
  code: 'EPIPE',
  syscall: 'write'
}

Node.js v20.17.0

I am unsure how to best debug this issue as the stack trace doesn't contain any part in our codebase.

We use NestJS, which adds another layer of complexity as the pipe which seemingly couldn't be written to might not even be something in our application code.

Due to the random nature I'm hesitant to run always with NODE_DEBUG=net, as I cannot reproduce the issue reliably.

Minimal Reproduction

Unable, codebase is proprietary and issue appears randomly

Output

node:events:497
      throw er; // Unhandled 'error' event
      ^

Error: write EPIPE
    at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:95:16)
    at WriteWrap.callbackTrampoline (node:internal/async_hooks:130:17)
Emitted 'error' event on Socket instance at:
    at emitErrorNT (node:internal/streams/destroy:169:8)
    at emitErrorCloseNT (node:internal/streams/destroy:128:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  errno: -32,
  code: 'EPIPE',
  syscall: 'write'
}

Node.js v20.17.0

Before You Submit

I have looked for issues that already exist before submitting this
My issue follows the guidelines in the README file, and follows the 'How to ask a good question' guide at https://stackoverflow.com/help/how-to-ask

The text was updated successfully, but these errors were encountered:

RedYetiDev · 2024-08-28T17:08:28Z

Unable, codebase is proprietary and issue appears randomly

It's hard to debug when there isn't any code provided. Maybe try asking on another platform?

gireeshpunathil · 2024-08-29T01:19:49Z

in general, installing an error handler and logging incidents for all your streams is a best practice, that way we will know which stream broke.

florian-besser · 2024-08-29T01:56:14Z

@gireeshpunathil we do add error handler to all our streams, unfortunately this seems to not trigger in the case above.

For example when we open a WebSocket we add:

ws.addEventListener('error', (event: WebSocket.ErrorEvent) => {
      logger.error(event.error, `WebSocket error encountered`);
    });

In the case above tough we see no log lines, meaning some other stream seems to have faulted.

As we don't control all code opening streams (such as e.g. code from NestJS or some other library we use) we can't just "check / instrument all streams".

Thus the ask: How do we debug NodeJS errors like this one? If I may be so bold: A system that throws errors that are not debuggable isn't a great system - as any library / developer error has the power to break the system in a way that can't be traced.

Surely there's some method to at least get a complete stacktrace as to what part failed to handle the error?

RedYetiDev · 2024-08-29T02:02:45Z

If you increase your stack trace limit, the error will provide more context.

gireeshpunathil · 2024-08-29T02:42:34Z

A system that throws errors that are not debuggable isn't a great system - as any library / developer error has the power to break the system in a way that can't be traced.

agree, but that is where NODE_DEBUG=net comes to help, but your scenario cannot use it due to randomness of the breakage. I would still recommend this method and run it in non-production, until you get a crash even if it takes a long time. there is another method (low level) of using linux strace to see which file descriptor experienced EPIPE and map it back to the socket, but its heavy-weight compared to the NODE_DEBUG and generates a lot of output.

florian-besser · 2024-08-30T08:04:24Z

I followed your suggestion and managed to get a nice snippet:

NET 20: _final: not ended, call shutdown()
NET 20: afterShutdown destroyed=false
NET 20: destroy
NET 20: close
NET 20: close handle
NET 20: emit close
NET 20: createConnection [
  [Object: null prototype] {
    protocol: undefined,
    hostname: 'jaeger-collector.elastic.svc.cluster.local',
    port: '4318',
    path: null,
    method: 'POST',
    headers: {
      'Content-Type': 'application/x-protobuf',
      'User-Agent': 'OTel-OTLP-Exporter-JavaScript/0.41.2'
    },
    agent: Agent {
      _events: [Object: null prototype],
      _eventsCount: 2,
      _maxListeners: undefined,
      defaultPort: 80,
      protocol: 'http:',
      options: [Object: null prototype],
      requests: [Object: null prototype] {},
      sockets: [Object: null prototype],
      freeSockets: [Object: null prototype] {},
      keepAliveMsecs: 1000,
      keepAlive: true,
      maxSockets: Infinity,
      maxFreeSockets: 256,
      scheduling: 'lifo',
      maxTotalSockets: Infinity,
      totalSocketCount: 0,
      [Symbol(shapeMode)]: false,
      [Symbol(kCapture)]: false
    },
    host: 'jaeger-collector.elastic.svc.cluster.local',
    keepAlive: true,
    noDelay: true,
    servername: 'jaeger-collector.elastic.svc.cluster.local',
    _agentKey: 'jaeger-collector.elastic.svc.cluster.local:4318:',
    encoding: null,
    keepAliveInitialDelay: 1000
  },
  [Function (anonymous)],
  [Symbol(normalizedArgs)]: true
]
NET 20: pipe false null
NET 20: connect: find host jaeger-collector.elastic.svc.cluster.local
NET 20: connect: dns options { family: undefined, hints: 32 }
NET 20: connect: autodetecting
NET 20: _read - n 16384 isConnecting? true hasHandle? true
NET 20: _read wait for connection
NET 20: connect/multiple: only one address found, switching back to single connection
NET 20: connect: attempting to connect to 172.20.165.67:4318 (addressType: 4)
NET 20: afterConnect
NET 20: _read - n 16384 isConnecting? false hasHandle? true
NET 20: Socket._handle.readStart
NET 20: _read - n 16384 isConnecting? false hasHandle? true
NET 20: destroy
NET 20: close
NET 20: close handle
Uncaught Exception Error: write EPIPE
    at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:95:16)
    at WriteWrap.callbackTrampoline (node:internal/async_hooks:130:17)
npm notice
npm notice New patch version of npm available! 10.8.2 -> 10.8.3
npm notice Changelog: https://github.com/npm/cli/releases/tag/v10.8.3
npm notice To update run: npm install -g [email protected]
npm notice

Judging from that I'm going to debug our connection to Jaeger (we send telemetry there, and any failures of sending telemetry should not impact stability)

gireeshpunathil · 2024-08-30T08:11:21Z

this is great finding! looks like we already read 32KB data before the pipe broke.

so you are right, telemetry shouldn't break the app, so worthwhile to see i) why the remote is breaking up in the middle, ii) if the jaeger connection has an error handler.

RedYetiDev added the needs more info issues that need more info from the author label Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to debug random unhandled EPIPE error #4470

How to debug random unhandled EPIPE error #4470

florian-besser commented Aug 28, 2024

RedYetiDev commented Aug 28, 2024

gireeshpunathil commented Aug 29, 2024

florian-besser commented Aug 29, 2024

RedYetiDev commented Aug 29, 2024

gireeshpunathil commented Aug 29, 2024

florian-besser commented Aug 30, 2024

gireeshpunathil commented Aug 30, 2024

How to debug random unhandled EPIPE error #4470

How to debug random unhandled EPIPE error #4470

Comments

florian-besser commented Aug 28, 2024

Node.js Version

NPM Version

Operating System

Subsystem

Description

Minimal Reproduction

Output

Before You Submit

RedYetiDev commented Aug 28, 2024

gireeshpunathil commented Aug 29, 2024

florian-besser commented Aug 29, 2024

RedYetiDev commented Aug 29, 2024

gireeshpunathil commented Aug 29, 2024

florian-besser commented Aug 30, 2024

gireeshpunathil commented Aug 30, 2024