Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shorter encoding using toString #2

Open
837951602 opened this issue Dec 30, 2016 · 10 comments
Open

Shorter encoding using toString #2

837951602 opened this issue Dec 30, 2016 · 10 comments
Assignees

Comments

@837951602
Copy link
Contributor

Let's say we encoding 'hhhhh'. I directly type 'hhhhh' and get 16679 chars. Even if I select 'All new browsers', 6014 chars are generated. However 1889567..toString(18) also be 'hhhhh' and it's length is 3532 chars(All env.) and 1611 chars(new browser),

@fasttime
Copy link
Owner

fasttime commented Dec 30, 2016

Thanks for reporting this issue. I'm sure JScrewIt could do much better in merging adjacent characters like in the case you describe.

The reason why I've always shunned this optimization is that it's hard to get it right without trying a potentially very large number of functions, parameters, and most of all start and end points. To summarize:

  • atob, btoa, toString, toUpperCase and unescape can all handle multiple characters at the same time. When in doubt what to choose, try all available options.
  • toString accepts a radix parameter that can make a difference: 16223185..toString(31) evaluates to "hhhhh" and is shorter than 1889567..toString(18).
  • In longer strings, it is not evident what adjacent characters work well when merged together: is "hasty" best written as a toString expression? What about "hash"? Note that the complexity grows quadratically with the length of the string.

As a first step, maybe trying to encode the whole string as a toString expression would be fine. I'll be thinking about this.

@837951602
Copy link
Contributor Author

'toString' costs 3241 and numbers are cheap that 999999999999999 costs 705. If there's two high-cost letters near, merging them will always give positive effect. 'g' can be treated as half an high-cost letter.
unescape('%aa')+'bb'+unescape('cc') equals to unescape('%aa'+'bb'.replace('%','%25')+'cc'). As numbera are cheap, the first be better only if there're too many '%'s.
Low-cost characters may be left away. Won't differ too much.

@fasttime
Copy link
Owner

Thanks a lot for your ideas. It is important to me that new optimizations make encoding not just better, but as good as possible, while preserving linear complexity. JScrewIt can calculate the length of each single character quite efficiently, so there is no need for approximate cost estimations.

Since you are the first person to ask for this long-due optimization, I've decided I'm going to start working on this after the next release, so don't expect anything before version 2.6.0. I'll be posting updates here as the work progresses.

@fasttime fasttime self-assigned this Feb 1, 2017
@fasttime
Copy link
Owner

fasttime commented Feb 5, 2017

I've started to work on this issue. The next release branch already contains the logic to encode multiple characters in a toString group. This looks all very promising but there is still a lot to be done.

@837951602
Copy link
Contributor Author

837951602 commented Feb 6, 2017

11e-13 is +(+!![]+(+(+!![]+[+!![]]+(!![]+[])[!![]+!![]+!![]]+(!![]+!![])+(+[]))+[])[+!![]]+(+!![])+(!![]+[])[!![]+!![]+!![]]+(+((+(+!![]+[+!![]]+(!![]+[])[!![]+!![]+!![]]+(!![]+!![])+(+[]))+[])[+!![]]+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+!![]))+[])[!![]+!![]]+(+!![])+(!![]+!![])) while +'11e-13' is +(+!![]+[+!![]]+(!![]+[])[!![]+!![]+!![]]+(+((+(+!![]+[+!![]]+(!![]+[])[!![]+!![]+!![]]+(!![]+!![])+(+[]))+[])[+!![]]+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+!![]))+[])[!![]+!![]]+(+!![])+(!![]+!![]+!![])) and +'0.0000000000011' is +(+[]+(+(+!![]+[+!![]]+(!![]+[])[!![]+!![]+!![]]+(!![]+!![])+(+[]))+[])[+!![]]+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+[])+(+!![])+(+!![]))

@fasttime
Copy link
Owner

fasttime commented Feb 6, 2017

This happens because 11e-13 is being encoded as +"1.1e-12", in line with its canonical representation. There will be some improvements to numeric encoding in the next patch release, but the case you note still needs some research, particularly because I'm not sure that all engines would handle all edge cases correctly.

@837951602
Copy link
Contributor Author

If 1.1e21 can be written into 11e20, why not 1.1e-12 into 11e-13?
Maybe a new issue is a better choice if the part of number need deeper talk?

@fasttime
Copy link
Owner

fasttime commented Feb 7, 2017

@837951602 Sure, feel free to open a new issue.

@fasttime
Copy link
Owner

toString optimization is now complete in branch 2.6. Not perfect, but complete. I'll be surely working to improve this and add support for more optimizations like atob clusters in future versions. Also, numeric encoding has become better, and all tests are green, phew.

@fasttime
Copy link
Owner

I've just completed a new optimizer: the complex optimizer uses the same principle as the toString optimizer, and has now replaced the older complex encoding mechanism. The optimizations are still the same, for example having "Number" expressed as 0..constructor.name when it is convenient.

Although this change is mostly a streamlined new implementation of existing functionality, there are a few exceptional cases where the output length would actually improve (e.g. with input "Stringpool"), and that's because the complex optimizer does not take precedence over other optimizations, unlike the old implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants