Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: count string by codepoint #44

Merged
merged 4 commits into from
Aug 14, 2024

Conversation

yumetodo
Copy link
Contributor

@yumetodo yumetodo commented Aug 13, 2024

Abstruct

Unicode says that there are 4 ways to count string length. https://unicode.org/faq/char_combmark.html#7

This commit supports counting by Code points.

Motivation

When we write text something like Japanese, surrogate pair will be used as usual. In such context, restricting string length is painful without considering surrogate pair.

@yumetodo yumetodo changed the title feat: count string by codepoint Draft! feat: count string by codepoint Aug 13, 2024
@yumetodo yumetodo force-pushed the feat/count_by_codepoint branch 2 times, most recently from 1ece1c7 to ac71454 Compare August 13, 2024 04:09
@yumetodo yumetodo changed the title Draft! feat: count string by codepoint feat: count string by codepoint Aug 13, 2024
Unicode says that there are 4 ways to count string length.
https://unicode.org/faq/char_combmark.html#7

This commit supports counting by Code points.
Copy link
Member

@azu azu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: new Intl.Segmenter("ja-JP", { granularity: "grapheme" }) is more precise, but also more complex to implement due to language dependencies.
(probably, Intl.Segmenter is slower than others)

https://blog.jxck.io/entries/2017-03-02/unicode-in-javascript.html#unicode-text-segmentation
https://github.com/tc39/proposal-intl-segmenter
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter

* By default or set to "code units", count string by UTF-16 code unit(= using `String.prototype.length`).
* If set to "codepoints", count string by codepoint.
*/
countBy?: "code units" | "codepoints";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
countBy?: "code units" | "codepoints";
countBy?: "codeunits" | "codepoints";

I think it would be better to align without spaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const reporter: TextlintRuleReporter<Options> = (context, options = {}) => {
const maxLength = options.max ?? defaultOptions.max;
const skipPatterns = options.skipPatterns ?? options.exclusionPatterns ?? defaultOptions.skipPatterns;
const skipUrlStringLink = options.skipUrlStringLink ?? defaultOptions.skipUrlStringLink;
const strLen =
options.countBy == null || options.countBy === "code units" ? (s: string) => s.length : strLenByCodePoint;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create a function like strLenByCodeUnits and use it?

const countBy = options?.countBy ?? defaultOptions.countBy;
const strLen = countBy === "codeunits" ? strLenByCodeUnits : strLenByCodePoint;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yumetodo
Copy link
Contributor Author

@azu Thank you for your review! I applied your suggestions.

FYI: new Intl.Segmenter("ja-JP", { granularity: "grapheme" }) is more precise, but also more complex to implement due to language dependencies.

I just now noticed the API. When we pass undefined as locale, it will cause unstable lint result. So, we need to decide what is to be specified and how to specify it.

However, I think it's out of this PR's scope. countBy? can be extendable to some thing like countBy?: "codeunits" | "codepoints" | "grapheme";.

@azu
Copy link
Member

azu commented Aug 14, 2024

However, I think it's out of this PR's scope. countBy? can be extendable to some thing like countBy?: "codeunits" | "codepoints" | "grapheme";.

Yes, I agree.

@azu azu merged commit a6873ea into textlint-rule:master Aug 14, 2024
2 checks passed
@azu azu added the Type: Feature New Feature label Aug 14, 2024
@yumetodo yumetodo deleted the feat/count_by_codepoint branch August 14, 2024 06:50
@azu
Copy link
Member

azu commented Aug 14, 2024

https://github.com/textlint-rule/textlint-rule-sentence-length/releases/tag/v5.2.0 released

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature New Feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants