Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PROTON-2818: Move epoll proctor connection logic to a task thread. #427

Closed
wants to merge 2 commits into from

Conversation

cliffjansen
Copy link
Contributor

This patch moves connection logic that can possibly block from the calling thread to the first actual scheduling of the task itself. It otherwise makes no change to the existing logic. Including in the following cases:

The trimmed "if (conn->disconnected)..." bits are code simplifications where the value of "disconnected" cannot have changed.

Similarly, the schedule_if_inactive() calls are trimmed as the inactive test is necessarily false with the newly added task in the task list.

@@ -1139,6 +1141,21 @@ static pn_event_batch_t *pconnection_process(pconnection_t *pc, uint32_t events,
}
if (sched_ready) schedule_done(&pc->task);

if (pc->first_schedule) {
// Normal case: resumed logic from pn_proactor_connect2.
// But possible tie: pn_connection_wake() or pn_proactor_disconnect().
Copy link
Member

@astitcher astitcher Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? "possibility of"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a typo but obviously not clear. While the original logic is preserved as much as possible, the dropping of the task lock and context switch allows competitor threads that were not possible prior to this change. Either of those two calls are possible from an arbitrary thread between the setting of first_schedule and arriving at this code.

A comment which doesn't make sense on its own is obviously not helpful. I will try to rework the comments and code structure for clarity on their own.

Copy link
Member

@astitcher astitcher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general if this works and passes the tests it's fine. But I feel that we've added (yet another) state in the implicit lifecycle state machine of connections and raw_connections. It would be easier to understand the lifecycle IMO if this state machine was explicit instead of being represented by ostensibly orthogonal booleans which I suspect are not truely orthogonal.

Comment on lines 425 to 436
if (rc->first_schedule) {
// Normal case: resumed logic from pn_proactor_raw_connect.
// But possible tie: pn_raw_connection_wake()
// Defer wake check until getaddrinfo is done.
rc->first_schedule = false;
assert(!events); // No socket yet.
praw_connection_first_connect_lh(rc); // Drops and reacquires lock.
if (rc->psocket.epoll_io.fd != -1 && !pni_task_wake_pending(&rc->task)) {
unlock(&rc->task.mutex);
return NULL;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this code should be the first piece of code in the possibilities, as it is the first that should happen; currently the logi cto kick off the connect is first; but now the logic to do the lookup must be earlier in the lifecycle of the connection so for clarity it should be the first condition in the sequence (unless the semantics mean that this doesn't work for some reason).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully addressed in the reworked version.

Agreed about the lifecycle state issue. This proposed fix strives for the minimal code logic changes to move the blocking activity to a different thread. The subsequent "real fix" for the parent JIRA will necessarily introduce a new state (presumably with early cancel option compared to the current blocked-until-done). The initiating of the getaddrinfo call will also presumably be sensibly moved back to the pn_xxx_connect call to avoid a pointless thread switch and the first_call boolean will have no purpose.

@astitcher
Copy link
Member

@cliffjansen Thanks for these changes - this is definitely clearer than the original change.
I'd still love a clearer lifecycle state machine but this will do for now.

@astitcher
Copy link
Member

Merged

@astitcher astitcher closed this May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants