Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape path prefix and restrict it to be a pathname in Speculation Rules #951

Merged

Conversation

jeremyroman
Copy link
Contributor

@jeremyroman jeremyroman commented Jan 20, 2024

On unusual deployment setups (but notably WordPress Playground), the path prefix might contain one of the URL pattern special characters, such as :.

Escape these characters in the path prefix, and since some of these also confuse the shorthand constructor string syntax, use the longhand syntax which specifies that the presence of a : here does not indicate a protocol/scheme.

Summary

Fixes #950.

Relevant technical choices

  • Prefixes are escaped in PLSR_URL_Pattern_Prefixer::get_default_contexts(), so that pattern syntax works in the particular excluded patterns.
  • All replacements are made, consistent with https://urlpattern.spec.whatwg.org/#escape-a-pattern-string
  • If the prefix contains a character that causes a component change out of path (:, ? or #, though the latter two should be percent-encoded anyway since they'd have a problem in plain URLs, too), it's enclosed in braces to ensure it's treated literally by the URL pattern parser.

Checklist

  • PR has either [Focus] or Infrastructure label.
  • PR has a [Type] label.
  • PR has a milestone or the no milestone label.

On unusual deployment setups (but notably WordPress Playground), the
path prefix might contain one of the URL pattern special characters,
such as :.

Escape these characters in the path prefix, and since some of these also
confuse the shorthand constructor string syntax, use the longhand syntax
which specifies that the presence of a : here does not indicate a
protocol/scheme.
@jeremyroman
Copy link
Contributor Author

@felixarntz Noticed this as part of the internal email thread talking about this plugin. It's been at least a decade since I wrote PHP, but this seems to work for me locally and I think I managed to match WordPress coding style. PTAL?

@mukeshpanchal27 mukeshpanchal27 added [Type] Bug An existing feature is broken [Focus] JS & CSS no milestone PRs that do not have a defined milestone for release [Plugin] Speculative Loading Issues for the Speculative Loading plugin (formerly Speculation Rules) labels Jan 22, 2024
Copy link
Member

@felixarntz felixarntz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jeremyroman. PHP code wise this looks good indeed, though I have a few minor suggestions and questions to better understand the changes.

Comment on lines 43 to 50
$href_exclude_paths = array_unique(
array_map(
array( $prefixer, 'prefix_path_pattern' ),
array_merge(
$base_href_exclude_paths,
array_merge(
$base_href_exclude_paths,
array_map(
array( $prefixer, 'prefix_path_pattern' ),
$href_exclude_paths
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify the rationale for this change? I'm not sure I understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, this is simply because the previous code applies $prefixer->prefix_path_pattern to the base exclusions twice -- once at lines 22 and 23, and again here. This kinda works when the home_url is a prefix of site_url, because of the str_starts_with check within, though even that would fall apart in some funny cases, like where the user installed WP in a directory itself named /wp-admin/ or something like that.

However because these paths are also sent by default to the plsr_speculation_rules_href_exclude_paths filter, this diff isn't really sufficient to resolve that issue, and regardless this is mostly orthogonal to the goal of this PR so I'll revert it and let that question be potentially resolved elsewhere in the future.

Comment on lines -89 to +90
'home' => trailingslashit( wp_parse_url( home_url( '/' ), PHP_URL_PATH ) ),
'site' => trailingslashit( wp_parse_url( site_url( '/' ), PHP_URL_PATH ) ),
'home' => self::escape_pattern_string( trailingslashit( wp_parse_url( home_url( '/' ), PHP_URL_PATH ) ) ),
'site' => self::escape_pattern_string( trailingslashit( wp_parse_url( site_url( '/' ), PHP_URL_PATH ) ) ),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This escapes the "path prefixes", which makes sense. But wouldn't we also want to escape any other parts of the path string in the same way?

Any string that a 3P developer may provide via the plsr_speculation_rules_href_exclude_paths filter below could also contain one of those characters that would need to be escaped as far as I can tell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, such strings may in fact be patterns and not literal paths, so we probably do want the special characters to have their special meanings there.

For example, the readme.txt example is '/cart/*', and in this case the developer probably did mean "all paths within /cart/", not "the literal path /cart/*".

However, if WordPress is serving from a directory named /* (however odd a choice that may be), then we need to escape the asterisk because it's not intended as a wildcard.

Comment on lines 104 to 115
$replacements = array(
'+' => '\\+',
'*' => '\\*',
'?' => '\\?',
':' => '\\:',
'{' => '\\{',
'}' => '\\}',
'(' => '\\(',
')' => '\\)',
'\\' => '\\\\',
);
return strtr( $str, $replacements );
Copy link
Member

@felixarntz felixarntz Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether there's a native PHP function that does this 🤔

But if not, I think we could simplify this to be a regular expression, since the replacement pattern is always the same (prepend with a single backslash, i.e. \\):

Suggested change
$replacements = array(
'+' => '\\+',
'*' => '\\*',
'?' => '\\?',
':' => '\\:',
'{' => '\\{',
'}' => '\\}',
'(' => '\\(',
')' => '\\)',
'\\' => '\\\\',
);
return strtr( $str, $replacements );
// Escape any of these individual characters by prefixing with a single backslash.
return preg_replace( '/[+*?:{}()\\\\]{1}/', '\\\\$0', $str );

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I could find. addslashes and addcslashes both do it for different sets of special characters, but strtr seemed to be the most common approach here.

I can use preg_replace (the {1} is implied I think but otherwise I'd come up with the same line of code) if you'd prefer.

I'd avoided it because the PHP manual and the results of my Stack Overflow searches both advised not using it where something weaker than regular expressions would suffice, for performance reasons (presumably compiling the pattern has a cost).

I think both are similarly readable but happy to defer to your expertise if preg_replace is preferred.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addcslashes() function allows you to specify the characters to escape, too. So I think this is better:

return addcslashes( $str, '+*?:{}()\\' );

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! When I first read the manual page I misunderstood the section about how characters that have special C meaning, like \n and \r, and thought it only applied to characters with meaningful C escape sequences. But that's not the case, so addcslashes is unambiguously the best solution.

Comment on lines 75 to 85
'href_matches' => array( 'pathname' => $prefixer->prefix_path_pattern( '/*' ) ),
),
// Except for WP login and admin URLs.
array(
'not' => array(
'href_matches' => $href_exclude_paths,
'href_matches' => array_map(
static function ( string $path ): array {
return array( 'pathname' => $path );
},
$href_exclude_paths
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you're saying in the PR description, this is potentially controversial. I'd like to understand whether this is truly needed? Wouldn't the escaping alone solve the problem (potentially being expanded to the full path string, rather than just the "path prefix")?

Maybe limiting this to paths makes sense, but at the moment this would prevent limiting by query parameters for instance, which is relevant for some WordPress sites (most importantly for the small subset of sites that doesn't use "pretty permalinks").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point. Unfortunately there are two different kinds of escaping needed here (at least, unless we fix the URL pattern parsing logic to be a little smarter).

The backslash escaping gets us out of the wrong tokenizer behavior (treating : as the start of a name token), but we still end up with the higher level parser seeing the : and thinking "hey! this must be an absolute URL and the stuff to the left of this is a protocol". So it still ends up thinking we have a scheme of "/scope" when we pass it "/scope:0.123/". The URL parser is...smarter about this.

I was trying to force it to be a pathname in this way, but upon a second look I can work around it in a different way, by using a group to force it to not consider this a component boundary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whatwg/urlpattern#210 has some more context if you're curious, but even if we do make this syntax legal it won't be accepted by Chromium for at least two milestones, so still worth this workaround.

* @param string $str String to be escaped.
* @return string String with backslashes added where required.
*/
public static function escape_pattern_string( string $str ): string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now this method is only used within this class. If it remains like that, it should probably become private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Maybe it would be useful to some other plugin that did want to do this escaping too, but that's not immediately foreseen.

Co-authored-by: Weston Ruter <[email protected]>
@jeremyroman
Copy link
Contributor Author

Anything else I need to change for this to be mergeable?

// one) affects the meaning of the * wildcard, so is left outside the braces.
$context_path = $this->contexts[ $context ];
$escaped_context_path = $context_path;
if ( strcspn( $context_path, ':?#' ) !== strlen( $context_path ) ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever. I would have done str_contains( $context_path, ':' ) || str_contains( $context_path, '?' ) || str_contains( $context_path, '#' ) but clearly what you've done is faster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or rather:

Suggested change
if ( strcspn( $context_path, ':?#' ) !== strlen( $context_path ) ) {
if ( preg_match( '/[:?#]/', $context_path ) ) {

This would be the more common way to do it (and perhaps more readable?) There are 31 uses of strcspn() in core, whereas there are 635 of preg_replace().

I don't feel strongly either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably being a C++ developer by day, but my mild preference is for the most efficient option if the options are similar in readability (though obviously neither is going to have a massive effect on your WordPress performance).

Happy to change if you felt more strongly; it seems not, so leaving this as-is for now.

Copy link
Member

@westonruter westonruter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit and a comment.

Copy link
Member

@adamsilverstein adamsilverstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@westonruter westonruter merged commit 9d6cab4 into WordPress:feature/speculationrules Jan 29, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no milestone PRs that do not have a defined milestone for release [Plugin] Speculative Loading Issues for the Speculative Loading plugin (formerly Speculation Rules) [Type] Bug An existing feature is broken
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants