Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to debug random unhandled EPIPE error #4470

Open
2 tasks done
florian-besser opened this issue Aug 28, 2024 · 7 comments
Open
2 tasks done

How to debug random unhandled EPIPE error #4470

florian-besser opened this issue Aug 28, 2024 · 7 comments
Labels
needs more info issues that need more info from the author

Comments

@florian-besser
Copy link

Node.js Version

v20.17.0

NPM Version

10.8.1

Operating System

Darwin ip-192-168-201-130.ap-southeast-1.compute.internal 23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6030 arm64

Subsystem

events, process

Description

My application randomly terminates with the following lines being last in the log:

node:events:497
      throw er; // Unhandled 'error' event
      ^

Error: write EPIPE
    at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:95:16)
    at WriteWrap.callbackTrampoline (node:internal/async_hooks:130:17)
Emitted 'error' event on Socket instance at:
    at emitErrorNT (node:internal/streams/destroy:169:8)
    at emitErrorCloseNT (node:internal/streams/destroy:128:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  errno: -32,
  code: 'EPIPE',
  syscall: 'write'
}

Node.js v20.17.0

I am unsure how to best debug this issue as the stack trace doesn't contain any part in our codebase.

We use NestJS, which adds another layer of complexity as the pipe which seemingly couldn't be written to might not even be something in our application code.

Due to the random nature I'm hesitant to run always with NODE_DEBUG=net, as I cannot reproduce the issue reliably.

Minimal Reproduction

Unable, codebase is proprietary and issue appears randomly

Output

node:events:497
      throw er; // Unhandled 'error' event
      ^

Error: write EPIPE
    at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:95:16)
    at WriteWrap.callbackTrampoline (node:internal/async_hooks:130:17)
Emitted 'error' event on Socket instance at:
    at emitErrorNT (node:internal/streams/destroy:169:8)
    at emitErrorCloseNT (node:internal/streams/destroy:128:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  errno: -32,
  code: 'EPIPE',
  syscall: 'write'
}

Node.js v20.17.0

Before You Submit

  • I have looked for issues that already exist before submitting this
  • My issue follows the guidelines in the README file, and follows the 'How to ask a good question' guide at https://stackoverflow.com/help/how-to-ask
@RedYetiDev
Copy link
Member

Unable, codebase is proprietary and issue appears randomly

It's hard to debug when there isn't any code provided. Maybe try asking on another platform?

@gireeshpunathil
Copy link
Member

in general, installing an error handler and logging incidents for all your streams is a best practice, that way we will know which stream broke.

@florian-besser
Copy link
Author

@gireeshpunathil we do add error handler to all our streams, unfortunately this seems to not trigger in the case above.

For example when we open a WebSocket we add:

ws.addEventListener('error', (event: WebSocket.ErrorEvent) => {
      logger.error(event.error, `WebSocket error encountered`);
    });

In the case above tough we see no log lines, meaning some other stream seems to have faulted.

As we don't control all code opening streams (such as e.g. code from NestJS or some other library we use) we can't just "check / instrument all streams".

Thus the ask: How do we debug NodeJS errors like this one? If I may be so bold: A system that throws errors that are not debuggable isn't a great system - as any library / developer error has the power to break the system in a way that can't be traced.

Surely there's some method to at least get a complete stacktrace as to what part failed to handle the error?

@RedYetiDev
Copy link
Member

If you increase your stack trace limit, the error will provide more context.

@gireeshpunathil
Copy link
Member

A system that throws errors that are not debuggable isn't a great system - as any library / developer error has the power to break the system in a way that can't be traced.

agree, but that is where NODE_DEBUG=net comes to help, but your scenario cannot use it due to randomness of the breakage. I would still recommend this method and run it in non-production, until you get a crash even if it takes a long time. there is another method (low level) of using linux strace to see which file descriptor experienced EPIPE and map it back to the socket, but its heavy-weight compared to the NODE_DEBUG and generates a lot of output.

@RedYetiDev RedYetiDev added the needs more info issues that need more info from the author label Aug 29, 2024
@florian-besser
Copy link
Author

I followed your suggestion and managed to get a nice snippet:

NET 20: _final: not ended, call shutdown()
NET 20: afterShutdown destroyed=false
NET 20: destroy
NET 20: close
NET 20: close handle
NET 20: emit close
NET 20: createConnection [
  [Object: null prototype] {
    protocol: undefined,
    hostname: 'jaeger-collector.elastic.svc.cluster.local',
    port: '4318',
    path: null,
    method: 'POST',
    headers: {
      'Content-Type': 'application/x-protobuf',
      'User-Agent': 'OTel-OTLP-Exporter-JavaScript/0.41.2'
    },
    agent: Agent {
      _events: [Object: null prototype],
      _eventsCount: 2,
      _maxListeners: undefined,
      defaultPort: 80,
      protocol: 'http:',
      options: [Object: null prototype],
      requests: [Object: null prototype] {},
      sockets: [Object: null prototype],
      freeSockets: [Object: null prototype] {},
      keepAliveMsecs: 1000,
      keepAlive: true,
      maxSockets: Infinity,
      maxFreeSockets: 256,
      scheduling: 'lifo',
      maxTotalSockets: Infinity,
      totalSocketCount: 0,
      [Symbol(shapeMode)]: false,
      [Symbol(kCapture)]: false
    },
    host: 'jaeger-collector.elastic.svc.cluster.local',
    keepAlive: true,
    noDelay: true,
    servername: 'jaeger-collector.elastic.svc.cluster.local',
    _agentKey: 'jaeger-collector.elastic.svc.cluster.local:4318:',
    encoding: null,
    keepAliveInitialDelay: 1000
  },
  [Function (anonymous)],
  [Symbol(normalizedArgs)]: true
]
NET 20: pipe false null
NET 20: connect: find host jaeger-collector.elastic.svc.cluster.local
NET 20: connect: dns options { family: undefined, hints: 32 }
NET 20: connect: autodetecting
NET 20: _read - n 16384 isConnecting? true hasHandle? true
NET 20: _read wait for connection
NET 20: connect/multiple: only one address found, switching back to single connection
NET 20: connect: attempting to connect to 172.20.165.67:4318 (addressType: 4)
NET 20: afterConnect
NET 20: _read - n 16384 isConnecting? false hasHandle? true
NET 20: Socket._handle.readStart
NET 20: _read - n 16384 isConnecting? false hasHandle? true
NET 20: destroy
NET 20: close
NET 20: close handle
Uncaught Exception Error: write EPIPE
    at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:95:16)
    at WriteWrap.callbackTrampoline (node:internal/async_hooks:130:17)
npm notice
npm notice New patch version of npm available! 10.8.2 -> 10.8.3
npm notice Changelog: https://github.com/npm/cli/releases/tag/v10.8.3
npm notice To update run: npm install -g [email protected]
npm notice

Judging from that I'm going to debug our connection to Jaeger (we send telemetry there, and any failures of sending telemetry should not impact stability)

@gireeshpunathil
Copy link
Member

this is great finding! looks like we already read 32KB data before the pipe broke.

so you are right, telemetry shouldn't break the app, so worthwhile to see i) why the remote is breaking up in the middle, ii) if the jaeger connection has an error handler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs more info issues that need more info from the author
Projects
None yet
Development

No branches or pull requests

4 participants
@gireeshpunathil @florian-besser @RedYetiDev and others