Escape path prefix and restrict it to be a pathname in Speculation Rules #951

jeremyroman · 2024-01-20T01:09:01Z

On unusual deployment setups (but notably WordPress Playground), the path prefix might contain one of the URL pattern special characters, such as :.

Escape these characters in the path prefix, and since some of these also confuse the shorthand constructor string syntax, use the longhand syntax which specifies that the presence of a : here does not indicate a protocol/scheme.

Summary

Fixes #950.

Relevant technical choices

Prefixes are escaped in PLSR_URL_Pattern_Prefixer::get_default_contexts(), so that pattern syntax works in the particular excluded patterns.
All replacements are made, consistent with https://urlpattern.spec.whatwg.org/#escape-a-pattern-string
If the prefix contains a character that causes a component change out of path (:, ? or #, though the latter two should be percent-encoded anyway since they'd have a problem in plain URLs, too), it's enclosed in braces to ensure it's treated literally by the URL pattern parser.

Checklist

PR has either [Focus] or Infrastructure label.
PR has a [Type] label.
PR has a milestone or the no milestone label.

On unusual deployment setups (but notably WordPress Playground), the path prefix might contain one of the URL pattern special characters, such as :. Escape these characters in the path prefix, and since some of these also confuse the shorthand constructor string syntax, use the longhand syntax which specifies that the presence of a : here does not indicate a protocol/scheme.

jeremyroman · 2024-01-20T01:11:26Z

@felixarntz Noticed this as part of the internal email thread talking about this plugin. It's been at least a decade since I wrote PHP, but this seems to work for me locally and I think I managed to match WordPress coding style. PTAL?

felixarntz

Thanks @jeremyroman. PHP code wise this looks good indeed, though I have a few minor suggestions and questions to better understand the changes.

felixarntz · 2024-01-22T17:28:55Z

modules/js-and-css/speculation-rules/helper.php

 	$href_exclude_paths = array_unique(
-		array_map(
-			array( $prefixer, 'prefix_path_pattern' ),
-			array_merge(
-				$base_href_exclude_paths,
+		array_merge(
+			$base_href_exclude_paths,
+			array_map(
+				array( $prefixer, 'prefix_path_pattern' ),
 				$href_exclude_paths
 			)
 		)


Can you clarify the rationale for this change? I'm not sure I understand.

Ah, sorry, this is simply because the previous code applies $prefixer->prefix_path_pattern to the base exclusions twice -- once at lines 22 and 23, and again here. This kinda works when the home_url is a prefix of site_url, because of the str_starts_with check within, though even that would fall apart in some funny cases, like where the user installed WP in a directory itself named /wp-admin/ or something like that.

However because these paths are also sent by default to the plsr_speculation_rules_href_exclude_paths filter, this diff isn't really sufficient to resolve that issue, and regardless this is mostly orthogonal to the goal of this PR so I'll revert it and let that question be potentially resolved elsewhere in the future.

felixarntz · 2024-01-22T17:32:03Z

modules/js-and-css/speculation-rules/class-plsr-url-pattern-prefixer.php

-			'home' => trailingslashit( wp_parse_url( home_url( '/' ), PHP_URL_PATH ) ),
-			'site' => trailingslashit( wp_parse_url( site_url( '/' ), PHP_URL_PATH ) ),
+			'home' => self::escape_pattern_string( trailingslashit( wp_parse_url( home_url( '/' ), PHP_URL_PATH ) ) ),
+			'site' => self::escape_pattern_string( trailingslashit( wp_parse_url( site_url( '/' ), PHP_URL_PATH ) ) ),


This escapes the "path prefixes", which makes sense. But wouldn't we also want to escape any other parts of the path string in the same way?

Any string that a 3P developer may provide via the plsr_speculation_rules_href_exclude_paths filter below could also contain one of those characters that would need to be escaped as far as I can tell.

Well, such strings may in fact be patterns and not literal paths, so we probably do want the special characters to have their special meanings there.

For example, the readme.txt example is '/cart/*', and in this case the developer probably did mean "all paths within /cart/", not "the literal path /cart/*".

However, if WordPress is serving from a directory named /* (however odd a choice that may be), then we need to escape the asterisk because it's not intended as a wildcard.

felixarntz · 2024-01-22T17:52:00Z

modules/js-and-css/speculation-rules/class-plsr-url-pattern-prefixer.php

+		$replacements = array(
+			'+'  => '\\+',
+			'*'  => '\\*',
+			'?'  => '\\?',
+			':'  => '\\:',
+			'{'  => '\\{',
+			'}'  => '\\}',
+			'('  => '\\(',
+			')'  => '\\)',
+			'\\' => '\\\\',
+		);
+		return strtr( $str, $replacements );


I wonder whether there's a native PHP function that does this 🤔

But if not, I think we could simplify this to be a regular expression, since the replacement pattern is always the same (prepend with a single backslash, i.e. \\):

Suggested change

$replacements = array(

'+' => '\\+',

'*' => '\\*',

'?' => '\\?',

':' => '\\:',

'{' => '\\{',

'}' => '\\}',

'(' => '\$',

')' => '\$',

'\\' => '\\\\',

);

return strtr( $str, $replacements );

// Escape any of these individual characters by prefixing with a single backslash.

return preg_replace( '/[+*?:{}()\\\\]{1}/', '\\\\$0', $str );

Not that I could find. addslashes and addcslashes both do it for different sets of special characters, but strtr seemed to be the most common approach here.

I can use preg_replace (the {1} is implied I think but otherwise I'd come up with the same line of code) if you'd prefer.

I'd avoided it because the PHP manual and the results of my Stack Overflow searches both advised not using it where something weaker than regular expressions would suffice, for performance reasons (presumably compiling the pattern has a cost).

I think both are similarly readable but happy to defer to your expertise if preg_replace is preferred.

The addcslashes() function allows you to specify the characters to escape, too. So I think this is better:

return addcslashes( $str, '+*?:{}()\\' );

You're right! When I first read the manual page I misunderstood the section about how characters that have special C meaning, like \n and \r, and thought it only applied to characters with meaningful C escape sequences. But that's not the case, so addcslashes is unambiguously the best solution.

felixarntz · 2024-01-22T17:55:39Z

modules/js-and-css/speculation-rules/helper.php

+						'href_matches' => array( 'pathname' => $prefixer->prefix_path_pattern( '/*' ) ),
 					),
 					// Except for WP login and admin URLs.
 					array(
 						'not' => array(
-							'href_matches' => $href_exclude_paths,
+							'href_matches' => array_map(
+								static function ( string $path ): array {
+									return array( 'pathname' => $path );
+								},
+								$href_exclude_paths
+							),


As you're saying in the PR description, this is potentially controversial. I'd like to understand whether this is truly needed? Wouldn't the escaping alone solve the problem (potentially being expanded to the full path string, rather than just the "path prefix")?

Maybe limiting this to paths makes sense, but at the moment this would prevent limiting by query parameters for instance, which is relevant for some WordPress sites (most importantly for the small subset of sites that doesn't use "pretty permalinks").

Yeah, that's a good point. Unfortunately there are two different kinds of escaping needed here (at least, unless we fix the URL pattern parsing logic to be a little smarter).

The backslash escaping gets us out of the wrong tokenizer behavior (treating : as the start of a name token), but we still end up with the higher level parser seeing the : and thinking "hey! this must be an absolute URL and the stuff to the left of this is a protocol". So it still ends up thinking we have a scheme of "/scope" when we pass it "/scope:0.123/". The URL parser is...smarter about this.

I was trying to force it to be a pathname in this way, but upon a second look I can work around it in a different way, by using a group to force it to not consider this a component boundary.

whatwg/urlpattern#210 has some more context if you're curious, but even if we do make this syntax legal it won't be accepted by Chromium for at least two milestones, so still worth this workaround.

felixarntz · 2024-01-22T17:56:41Z

modules/js-and-css/speculation-rules/class-plsr-url-pattern-prefixer.php

+	 * @param string $str String to be escaped.
+	 * @return string String with backslashes added where required.
+	 */
+	public static function escape_pattern_string( string $str ): string {


Right now this method is only used within this class. If it remains like that, it should probably become private.

Done. Maybe it would be useful to some other plugin that did want to do this escaping too, but that's not immediately foreseen.

…code review comments).

modules/js-and-css/speculation-rules/class-plsr-url-pattern-prefixer.php

Co-authored-by: Weston Ruter <[email protected]>

jeremyroman · 2024-01-26T22:08:21Z

Anything else I need to change for this to be mergeable?

westonruter · 2024-01-26T22:37:31Z

modules/js-and-css/speculation-rules/class-plsr-url-pattern-prefixer.php

+		// one) affects the meaning of the * wildcard, so is left outside the braces.
+		$context_path         = $this->contexts[ $context ];
+		$escaped_context_path = $context_path;
+		if ( strcspn( $context_path, ':?#' ) !== strlen( $context_path ) ) {


Clever. I would have done str_contains( $context_path, ':' ) || str_contains( $context_path, '?' ) || str_contains( $context_path, '#' ) but clearly what you've done is faster.

Or rather:

Suggested change

if ( strcspn( $context_path, ':?#' ) !== strlen( $context_path ) ) {

if ( preg_match( '/[:?#]/', $context_path ) ) {

This would be the more common way to do it (and perhaps more readable?) There are 31 uses of strcspn() in core, whereas there are 635 of preg_replace().

I don't feel strongly either way.

It's probably being a C++ developer by day, but my mild preference is for the most efficient option if the options are similar in readability (though obviously neither is going to have a massive effect on your WordPress performance).

Happy to change if you felt more strongly; it seems not, so leaving this as-is for now.

modules/js-and-css/speculation-rules/class-plsr-url-pattern-prefixer.php

westonruter

Just a nit and a comment.

…efixer.php Co-authored-by: Weston Ruter <[email protected]>

adamsilverstein

Looks good!

jeremyroman requested a review from felixarntz as a code owner January 20, 2024 01:09

mukeshpanchal27 added [Type] Bug An existing feature is broken [Focus] JS & CSS no milestone PRs that do not have a defined milestone for release [Plugin] Speculative Loading Issues for the Speculative Loading plugin (formerly Speculation Rules) labels Jan 22, 2024

felixarntz reviewed Jan 22, 2024

View reviewed changes

Use braces to deal with component confusion issue instead (and other …

3c8d640

…code review comments).

westonruter reviewed Jan 23, 2024

View reviewed changes

modules/js-and-css/speculation-rules/class-plsr-url-pattern-prefixer.php Outdated Show resolved Hide resolved

Use addcslashes

3a16b14

Co-authored-by: Weston Ruter <[email protected]>

westonruter reviewed Jan 26, 2024

View reviewed changes

modules/js-and-css/speculation-rules/class-plsr-url-pattern-prefixer.php Outdated Show resolved Hide resolved

westonruter requested changes Jan 26, 2024

View reviewed changes

Update modules/js-and-css/speculation-rules/class-plsr-url-pattern-pr…

78fae85

…efixer.php Co-authored-by: Weston Ruter <[email protected]>

adamsilverstein approved these changes Jan 29, 2024

View reviewed changes

westonruter approved these changes Jan 29, 2024

View reviewed changes

westonruter merged commit 9d6cab4 into WordPress:feature/speculationrules Jan 29, 2024
7 checks passed

westonruter mentioned this pull request Feb 5, 2024

Speculation rules: URL patterns need escaping to work correctly when special characters are present #950

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Escape path prefix and restrict it to be a pathname in Speculation Rules #951

Escape path prefix and restrict it to be a pathname in Speculation Rules #951

jeremyroman commented Jan 20, 2024 •

edited

Loading

jeremyroman commented Jan 20, 2024

felixarntz left a comment

felixarntz Jan 22, 2024

jeremyroman Jan 23, 2024

felixarntz Jan 22, 2024

jeremyroman Jan 23, 2024

felixarntz Jan 22, 2024 •

edited

Loading

jeremyroman Jan 23, 2024

westonruter Jan 23, 2024

jeremyroman Jan 23, 2024

felixarntz Jan 22, 2024

jeremyroman Jan 23, 2024

jeremyroman Jan 23, 2024

felixarntz Jan 22, 2024

jeremyroman Jan 23, 2024

jeremyroman commented Jan 26, 2024

westonruter Jan 26, 2024

westonruter Jan 26, 2024

jeremyroman Jan 29, 2024

westonruter left a comment

adamsilverstein left a comment

	if ( strcspn( $context_path, ':?#' ) !== strlen( $context_path ) ) {
	if ( preg_match( '/[:?#]/', $context_path ) ) {

Escape path prefix and restrict it to be a pathname in Speculation Rules #951

Escape path prefix and restrict it to be a pathname in Speculation Rules #951

Conversation

jeremyroman commented Jan 20, 2024 • edited Loading

Summary

Relevant technical choices

Checklist

jeremyroman commented Jan 20, 2024

felixarntz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixarntz Jan 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremyroman commented Jan 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonruter left a comment

Choose a reason for hiding this comment

adamsilverstein left a comment

Choose a reason for hiding this comment

jeremyroman commented Jan 20, 2024 •

edited

Loading

felixarntz Jan 22, 2024 •

edited

Loading