Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet-compression for lgbm #15

Draft
wants to merge 42 commits into
base: main
Choose a base branch
from
Draft

parquet-compression for lgbm #15

wants to merge 42 commits into from

Conversation

YYYasin19
Copy link
Contributor

@YYYasin19 YYYasin19 commented Mar 5, 2023

initial try for a parquet compression implementation

this is a draft!

core idea:

  • create dataframe / table-like structure from tree information
  • encode with a custom pyarrow schema
  • write arrow Table into parquet-File, piggybacking (🐷 ) off of pyarrow/parquet's compression tools and efficient data format
  • instead of writing to a file, write into bytes object that then gets put into the pickle

todos:

  • implement dumping
  • implement loading
  • remove pandas inter-dependency
  • fix tests
  • validate effect of compression on the table (i.e. not on the pickle, though that should help as well)
  • speeeeeeed

@github-actions
Copy link

github-actions bot commented Mar 6, 2023

(benchmark 4547923027 / attempt 1)
Base results / Our results / Change

Model Size Dump Time Load Time
sklearn rf 20M 20.8 MiB / 3.0 MiB / 6.87 x 0.02 s / 0.05 s / 2.22 x 0.02 s / 0.04 s / 1.99 x
sklearn rf 20M lzma 6.5 MiB / 2.0 MiB / 3.26 x 13.29 s / 1.37 s / 0.10 x 0.63 s / 0.23 s / 0.36 x
sklearn rf 200M 212.3 MiB / 30.6 MiB / 6.94 x 0.18 s / 0.41 s / 2.21 x 0.21 s / 0.40 s / 1.91 x
sklearn rf 200M lzma 47.4 MiB / 14.6 MiB / 3.24 x 112.79 s / 21.01 s / 0.19 x 4.52 s / 1.62 s / 0.36 x
sklearn rf 1G 1157.5 MiB / 166.8 MiB / 6.94 x 1.28 s / 2.06 s / 1.61 x 1.25 s / 1.82 s / 1.46 x
sklearn rf 1G lzma 258.1 MiB / 98.1 MiB / 2.63 x 587.17 s / 130.46 s / 0.22 x 26.33 s / 9.61 s / 0.36 x
sklearn gb 2M 2.2 MiB / 1.1 MiB / 2.08 x 0.04 s / 0.47 s / 11.31 x 0.07 s / 0.22 s / 3.25 x
sklearn gb 2M lzma 0.6 MiB / 0.2 MiB / 3.81 x 1.14 s / 0.64 s / 0.56 x 0.13 s / 0.28 s / 2.18 x
lgbm gbdt 2M 2.6 MiB / 0.5 MiB / 5.02 x 0.12 s / 0.47 s / 3.95 x 0.01 s / 0.50 s / 34.10 x
lgbm gbdt 2M lzma 0.9 MiB / 0.5 MiB / 1.72 x 1.77 s / 0.49 s / 0.28 x 0.09 s / 0.37 s / 4.23 x
lgbm gbdt 5M 5.3 MiB / 1.0 MiB / 5.37 x 0.22 s / 0.74 s / 3.33 x 0.03 s / 0.69 s / 25.00 x
lgbm gbdt 5M lzma 1.7 MiB / 0.9 MiB / 1.81 x 4.33 s / 0.97 s / 0.22 x 0.17 s / 0.74 s / 4.46 x
lgbm gbdt 20M 22.7 MiB / 3.2 MiB / 7.06 x 0.90 s / 2.78 s / 3.07 x 0.12 s / 2.71 s / 22.19 x
lgbm gbdt 20M lzma 6.3 MiB / 3.1 MiB / 2.05 x 23.85 s / 3.83 s / 0.16 x 0.67 s / 2.66 s / 3.95 x
lgbm gbdt 100M 101.1 MiB / 10.5 MiB / 9.61 x 3.96 s / 12.34 s / 3.12 x 0.65 s / 11.50 s / 17.75 x
lgbm gbdt 100M lzma 25.6 MiB / 10.0 MiB / 2.55 x 107.94 s / 17.34 s / 0.16 x 2.81 s / 12.62 s / 4.49 x
lgbm rf 10M 10.9 MiB / 0.9 MiB / 12.56 x 0.45 s / 1.10 s / 2.42 x 0.05 s / 0.69 s / 13.89 x
lgbm rf 10M lzma 0.7 MiB / 0.8 MiB / 0.88 x 2.45 s / 1.34 s / 0.55 x 0.14 s / 0.72 s / 5.22 x

@YYYasin19
Copy link
Contributor Author

Current state in performance:

Timer unit: 1e-06 s

Total time: 4.75943 s
File: /Users/ytatar/projects/2302-pickle-compression/slim_trees/lgbm_booster.py
Function: _decompress_booster_handle at line 178

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   178                                           # @profile
   179                                           def _decompress_booster_handle(
   180                                                   compressed_state: Tuple[str, bytes, bytes, bytes, str]
   181                                           ) -> str:
   182        10          4.0      0.4      0.0      (
   183        10          0.0      0.0      0.0          front_str,
   184        10          0.0      0.0      0.0          trees_df_bytes,
   185        10          0.0      0.0      0.0          nodes_df_bytes,
   186        10          0.0      0.0      0.0          leaf_value_bytes,
   187        10          0.0      0.0      0.0          back_str,
   188        10          2.0      0.2      0.0      ) = compressed_state
   189        10          4.0      0.4      0.0      assert type(front_str) == str
   190        10          4.0      0.4      0.0      assert type(back_str) == str
   191                                           
   192        10      34970.0   3497.0      0.7      trees_df = pq_bytes_to_df(trees_df_bytes)
   193        10    2020509.0 202050.9     42.5      nodes_df = pq_bytes_to_df(nodes_df_bytes).groupby("tree_idx").agg(lambda x: list(x))
   194        10          2.0      0.2      0.0      leaf_values_df = (
   195        10     357533.0  35753.3      7.5          pq_bytes_to_df(leaf_value_bytes).groupby("tree_idx")["leaf_value"].apply(list)
   196                                               )
   197                                               # merge trees_df, nodes_df, and leaf_values_df on tree_idx
   198        10      11741.0   1174.1      0.2      trees_df = trees_df.merge(nodes_df, on="tree_idx")
   199        10       9704.0    970.4      0.2      trees_df = trees_df.merge(leaf_values_df, on="tree_idx")
   200                                           
   201                                               # handle = front_str
   202                                           
   203        10          3.0      0.3      0.0      tree_strings = [front_str]
   204                                           
   205                                               # TODO: directly go over trees and nodes
   206     20000     620494.0     31.0     13.0      for i, tree in trees_df.iterrows():

   222                                           
   223     20000      79986.0      4.0      1.7          num_leaves = int(tree["num_leaves"])
   224     20000       2578.0      0.1      0.1          num_nodes = num_leaves - 1
   225                                           
   226     20000      26030.0      1.3      0.5          tree_strings.append(f"""Tree={i}
   227     20000      61575.0      3.1      1.3  num_leaves={int(tree["num_leaves"])}
   228     20000      59082.0      3.0      1.2  num_cat={tree['num_cat']}
   229     20000     148784.0      7.4      3.1  split_feature={' '.join([str(x) for x in tree["split_feature"]])}
   230     20000       4304.0      0.2      0.1  split_gain={("0" * num_nodes)[:-1]}
   231     20000     357808.0     17.9      7.5  threshold={' '.join([str(x) for x in tree['threshold']])}
   232     20000     149148.0      7.5      3.1  decision_type={' '.join([str(x) for x in tree["decision_type"]])}
   233     20000     151402.0      7.6      3.2  left_child={" ".join([str(x) for x in tree["left_child"]])}
   234     20000     151562.0      7.6      3.2  right_child={" ".join([str(x) for x in tree["right_child"]])}
   235     20000     369298.0     18.5      7.8  leaf_value={" ".join([str(x) for x in tree["leaf_value"]])}
   236     20000       4519.0      0.2      0.1  leaf_weight={("0 " * num_leaves)[:-1]}
   237     20000       4008.0      0.2      0.1  leaf_count={("0 " * num_leaves)[:-1]}
   238     20000       3483.0      0.2      0.1  internal_value={("0 " * num_nodes)[:-1]}
   239     20000       3363.0      0.2      0.1  internal_weight={("0 " * num_nodes)[:-1]}
   240     20000       3289.0      0.2      0.1  internal_count={("0 " * num_nodes)[:-1]}
   241     20000      61646.0      3.1      1.3  is_linear={tree['is_linear']}
   242     20000      57860.0      2.9      1.2  shrinkage={tree['shrinkage']}
   243                                           
   244                                           
  
   276                                           
   277        10          1.0      0.1      0.0      tree_strings.append(back_str)
   278                                           
   279                                               # handle += back_str
   280        10       4732.0    473.2      0.1      return "".join(tree_strings)

@YYYasin19
Copy link
Contributor Author

I seem to be missing some values that are required for inference, hence all the 0.0 prediction values. Otherwise, this should work fine. (ex. num_cat still)

Copy link
Member

@pavelzw pavelzw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Here a few comments.

Don't forget to delete benchmark.py.lprof before merging.

benchmark.py Outdated Show resolved Hide resolved
benchmark.py Outdated Show resolved Hide resolved
environment.yml Show resolved Hide resolved
slim_trees/lgbm_booster.py Outdated Show resolved Hide resolved
slim_trees/lgbm_booster.py Outdated Show resolved Hide resolved
slim_trees/lgbm_booster.py Outdated Show resolved Hide resolved
slim_trees/lgbm_booster.py Outdated Show resolved Hide resolved
slim_trees/utils.py Show resolved Hide resolved
tests/test_lgbm_compression.py Outdated Show resolved Hide resolved
model_parser.py Outdated Show resolved Hide resolved
@pavelzw
Copy link
Member

pavelzw commented Mar 20, 2023

LGBM gbdt large: 3.05x -> 7.15x
LGBM gbdt large LZMA: 2.29x -> 2.54x

While this is a great performance improvement without using LZMA at the end, the improvement is not that big when only using LZMA. Maybe it makes sense to add an option not to use the pyarrow way but to use the old implementation instead if pyarrow is not installed?

But I suggest moving this discussion to when the PR is almost ready to merge.

@pavelzw pavelzw mentioned this pull request Mar 20, 2023
@YYYasin19
Copy link
Contributor Author

YYYasin19 commented Mar 24, 2023

The table suggests that the factor improves especially in larger models. Unfortunately, we have lost some performance again, I seem to remember partially up to 11x.

stream = pa.BufferOutputStream()
pq.write_table(
table, stream, compression="lz4"
) # TODO: investigate different effects of compression
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also test some other compression methods other than lz4. During my sklearn tests I noticed that no also had very nice performance when using lzma afterwards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked out zstd, and it, randomly, performed better here. 🤷

@pavelzw
Copy link
Member

pavelzw commented Mar 28, 2023

Old:
Base results / Our results / Change

Model Size Dump Time Load Time
lgbm gbdt 2M 2.6 MiB / 1.0 MiB / 2.78 x 0.09 s / 0.25 s / 2.78 x 0.01 s / 0.15 s / 12.14 x
lgbm gbdt 2M lzma 0.9 MiB / 0.5 MiB / 1.90 x 1.31 s / 0.51 s / 0.39 x 0.08 s / 0.22 s / 2.73 x
lgbm gbdt 5M 5.3 MiB / 1.9 MiB / 2.81 x 0.18 s / 0.51 s / 2.82 x 0.03 s / 0.32 s / 12.54 x
lgbm gbdt 5M lzma 1.7 MiB / 0.8 MiB / 1.96 x 3.12 s / 1.07 s / 0.34 x 0.15 s / 0.38 s / 2.55 x
lgbm gbdt 20M 22.7 MiB / 7.6 MiB / 3.00 x 0.72 s / 2.11 s / 2.95 x 0.11 s / 1.32 s / 11.83 x
lgbm gbdt 20M lzma 6.3 MiB / 3.0 MiB / 2.09 x 16.68 s / 4.86 s / 0.29 x 0.59 s / 1.54 s / 2.61 x
lgbm gbdt 100M 101.1 MiB / 33.0 MiB / 3.06 x 3.23 s / 9.39 s / 2.91 x 0.53 s / 116.98 s / 219.18 x
lgbm gbdt 100M lzma 25.6 MiB / 10.6 MiB / 2.41 x 83.14 s / 23.69 s / 0.28 x 2.44 s / 98.06 s / 40.16 x
lgbm rf 10M 10.9 MiB / 3.2 MiB / 3.46 x 0.36 s / 0.69 s / 1.92 x 0.04 s / 0.56 s / 12.54 x
lgbm rf 10M lzma 0.7 MiB / 0.4 MiB / 1.85 x 1.87 s / 0.95 s / 0.51 x 0.12 s / 0.58 s / 4.84 x

New:
Base results / Our results / Change

Model Size Dump Time Load Time
lgbm gbdt 2M 2.6 MiB / 0.5 MiB / 5.02 x 0.09 s / 0.37 s / 4.08 x 0.01 s / 0.30 s / 24.32 x
lgbm gbdt 2M lzma 0.9 MiB / 0.5 MiB / 1.72 x 1.33 s / 0.41 s / 0.31 x 0.08 s / 0.26 s / 3.31 x
lgbm gbdt 5M 5.3 MiB / 1.0 MiB / 5.37 x 0.18 s / 0.56 s / 3.15 x 0.02 s / 0.45 s / 17.93 x
lgbm gbdt 5M lzma 1.7 MiB / 0.9 MiB / 1.81 x 3.26 s / 0.78 s / 0.24 x 0.15 s / 0.50 s / 3.38 x
lgbm gbdt 20M 22.7 MiB / 3.2 MiB / 7.06 x 0.71 s / 2.16 s / 3.03 x 0.11 s / 1.84 s / 16.68 x
lgbm gbdt 20M lzma 6.3 MiB / 3.1 MiB / 2.05 x 17.84 s / 3.02 s / 0.17 x 0.59 s / 1.89 s / 3.19 x
lgbm gbdt 100M 101.1 MiB / 10.5 MiB / 9.61 x 3.17 s / 9.45 s / 2.98 x 0.51 s / 7.94 s / 15.59 x
lgbm gbdt 100M lzma 25.6 MiB / 10.0 MiB / 2.55 x 81.52 s / 13.03 s / 0.16 x 2.43 s / 8.11 s / 3.34 x
lgbm rf 10M 10.9 MiB / 0.9 MiB / 12.56 x 0.36 s / 0.88 s / 2.45 x 0.04 s / 0.46 s / 10.66 x
lgbm rf 10M lzma 0.7 MiB / 0.8 MiB / 0.88 x 1.91 s / 1.09 s / 0.57 x 0.12 s / 0.51 s / 4.25 x

@jonashaag
Copy link
Contributor

As discussed with @pavelzw today, how about splitting the non-Parquet parts into a separate PR? It's annoying that Parquet doesn't improve the compression ratio despite the effort that you have put into this @YYYasin19. But let's make the best out of it by keeping the other parts!

"""
stream = pa.BufferOutputStream()
pq.write_table(
table, stream, compression="zstd", compression_level=8
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea for the future might be to try out different values here. Just checked and found out that disabling zstd here allows lzma to compress more later resulting in a slight overall improvement by around ~5% for the trees_df --> might generalize to other dataframes as well

@pavelzw pavelzw marked this pull request as draft August 4, 2023 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants