parquet-compression for lgbm #15

YYYasin19 · 2023-03-05T11:23:12Z

initial try for a parquet compression implementation

this is a draft!

core idea:

create dataframe / table-like structure from tree information
encode with a custom pyarrow schema
write arrow Table into parquet-File, piggybacking (🐷 ) off of pyarrow/parquet's compression tools and efficient data format
instead of writing to a file, write into bytes object that then gets put into the pickle

todos:

implement dumping
implement loading
remove pandas inter-dependency
fix tests
validate effect of compression on the table (i.e. not on the pickle, though that should help as well)
speeeeeeed

github-actions · 2023-03-06T14:34:19Z

(benchmark 4547923027 / attempt 1)
Base results / Our results / Change

Model	Size	Dump Time	Load Time
sklearn rf 20M	20.8 MiB / 3.0 MiB / 6.87 x	0.02 s / 0.05 s / 2.22 x	0.02 s / 0.04 s / 1.99 x
sklearn rf 20M lzma	6.5 MiB / 2.0 MiB / 3.26 x	13.29 s / 1.37 s / 0.10 x	0.63 s / 0.23 s / 0.36 x
sklearn rf 200M	212.3 MiB / 30.6 MiB / 6.94 x	0.18 s / 0.41 s / 2.21 x	0.21 s / 0.40 s / 1.91 x
sklearn rf 200M lzma	47.4 MiB / 14.6 MiB / 3.24 x	112.79 s / 21.01 s / 0.19 x	4.52 s / 1.62 s / 0.36 x
sklearn rf 1G	1157.5 MiB / 166.8 MiB / 6.94 x	1.28 s / 2.06 s / 1.61 x	1.25 s / 1.82 s / 1.46 x
sklearn rf 1G lzma	258.1 MiB / 98.1 MiB / 2.63 x	587.17 s / 130.46 s / 0.22 x	26.33 s / 9.61 s / 0.36 x
sklearn gb 2M	2.2 MiB / 1.1 MiB / 2.08 x	0.04 s / 0.47 s / 11.31 x	0.07 s / 0.22 s / 3.25 x
sklearn gb 2M lzma	0.6 MiB / 0.2 MiB / 3.81 x	1.14 s / 0.64 s / 0.56 x	0.13 s / 0.28 s / 2.18 x
lgbm gbdt 2M	2.6 MiB / 0.5 MiB / 5.02 x	0.12 s / 0.47 s / 3.95 x	0.01 s / 0.50 s / 34.10 x
lgbm gbdt 2M lzma	0.9 MiB / 0.5 MiB / 1.72 x	1.77 s / 0.49 s / 0.28 x	0.09 s / 0.37 s / 4.23 x
lgbm gbdt 5M	5.3 MiB / 1.0 MiB / 5.37 x	0.22 s / 0.74 s / 3.33 x	0.03 s / 0.69 s / 25.00 x
lgbm gbdt 5M lzma	1.7 MiB / 0.9 MiB / 1.81 x	4.33 s / 0.97 s / 0.22 x	0.17 s / 0.74 s / 4.46 x
lgbm gbdt 20M	22.7 MiB / 3.2 MiB / 7.06 x	0.90 s / 2.78 s / 3.07 x	0.12 s / 2.71 s / 22.19 x
lgbm gbdt 20M lzma	6.3 MiB / 3.1 MiB / 2.05 x	23.85 s / 3.83 s / 0.16 x	0.67 s / 2.66 s / 3.95 x
lgbm gbdt 100M	101.1 MiB / 10.5 MiB / 9.61 x	3.96 s / 12.34 s / 3.12 x	0.65 s / 11.50 s / 17.75 x
lgbm gbdt 100M lzma	25.6 MiB / 10.0 MiB / 2.55 x	107.94 s / 17.34 s / 0.16 x	2.81 s / 12.62 s / 4.49 x
lgbm rf 10M	10.9 MiB / 0.9 MiB / 12.56 x	0.45 s / 1.10 s / 2.42 x	0.05 s / 0.69 s / 13.89 x
lgbm rf 10M lzma	0.7 MiB / 0.8 MiB / 0.88 x	2.45 s / 1.34 s / 0.55 x	0.14 s / 0.72 s / 5.22 x

YYYasin19 · 2023-03-06T16:12:19Z

Current state in performance:

Timer unit: 1e-06 s

Total time: 4.75943 s
File: /Users/ytatar/projects/2302-pickle-compression/slim_trees/lgbm_booster.py
Function: _decompress_booster_handle at line 178

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   178                                           # @profile
   179                                           def _decompress_booster_handle(
   180                                                   compressed_state: Tuple[str, bytes, bytes, bytes, str]
   181                                           ) -> str:
   182        10          4.0      0.4      0.0      (
   183        10          0.0      0.0      0.0          front_str,
   184        10          0.0      0.0      0.0          trees_df_bytes,
   185        10          0.0      0.0      0.0          nodes_df_bytes,
   186        10          0.0      0.0      0.0          leaf_value_bytes,
   187        10          0.0      0.0      0.0          back_str,
   188        10          2.0      0.2      0.0      ) = compressed_state
   189        10          4.0      0.4      0.0      assert type(front_str) == str
   190        10          4.0      0.4      0.0      assert type(back_str) == str
   191                                           
   192        10      34970.0   3497.0      0.7      trees_df = pq_bytes_to_df(trees_df_bytes)
   193        10    2020509.0 202050.9     42.5      nodes_df = pq_bytes_to_df(nodes_df_bytes).groupby("tree_idx").agg(lambda x: list(x))
   194        10          2.0      0.2      0.0      leaf_values_df = (
   195        10     357533.0  35753.3      7.5          pq_bytes_to_df(leaf_value_bytes).groupby("tree_idx")["leaf_value"].apply(list)
   196                                               )
   197                                               # merge trees_df, nodes_df, and leaf_values_df on tree_idx
   198        10      11741.0   1174.1      0.2      trees_df = trees_df.merge(nodes_df, on="tree_idx")
   199        10       9704.0    970.4      0.2      trees_df = trees_df.merge(leaf_values_df, on="tree_idx")
   200                                           
   201                                               # handle = front_str
   202                                           
   203        10          3.0      0.3      0.0      tree_strings = [front_str]
   204                                           
   205                                               # TODO: directly go over trees and nodes
   206     20000     620494.0     31.0     13.0      for i, tree in trees_df.iterrows():

   222                                           
   223     20000      79986.0      4.0      1.7          num_leaves = int(tree["num_leaves"])
   224     20000       2578.0      0.1      0.1          num_nodes = num_leaves - 1
   225                                           
   226     20000      26030.0      1.3      0.5          tree_strings.append(f"""Tree={i}
   227     20000      61575.0      3.1      1.3  num_leaves={int(tree["num_leaves"])}
   228     20000      59082.0      3.0      1.2  num_cat={tree['num_cat']}
   229     20000     148784.0      7.4      3.1  split_feature={' '.join([str(x) for x in tree["split_feature"]])}
   230     20000       4304.0      0.2      0.1  split_gain={("0" * num_nodes)[:-1]}
   231     20000     357808.0     17.9      7.5  threshold={' '.join([str(x) for x in tree['threshold']])}
   232     20000     149148.0      7.5      3.1  decision_type={' '.join([str(x) for x in tree["decision_type"]])}
   233     20000     151402.0      7.6      3.2  left_child={" ".join([str(x) for x in tree["left_child"]])}
   234     20000     151562.0      7.6      3.2  right_child={" ".join([str(x) for x in tree["right_child"]])}
   235     20000     369298.0     18.5      7.8  leaf_value={" ".join([str(x) for x in tree["leaf_value"]])}
   236     20000       4519.0      0.2      0.1  leaf_weight={("0 " * num_leaves)[:-1]}
   237     20000       4008.0      0.2      0.1  leaf_count={("0 " * num_leaves)[:-1]}
   238     20000       3483.0      0.2      0.1  internal_value={("0 " * num_nodes)[:-1]}
   239     20000       3363.0      0.2      0.1  internal_weight={("0 " * num_nodes)[:-1]}
   240     20000       3289.0      0.2      0.1  internal_count={("0 " * num_nodes)[:-1]}
   241     20000      61646.0      3.1      1.3  is_linear={tree['is_linear']}
   242     20000      57860.0      2.9      1.2  shrinkage={tree['shrinkage']}
   243                                           
   244                                           
  
   276                                           
   277        10          1.0      0.1      0.0      tree_strings.append(back_str)
   278                                           
   279                                               # handle += back_str
   280        10       4732.0    473.2      0.1      return "".join(tree_strings)

YYYasin19 · 2023-03-17T16:08:58Z

I seem to be missing some values that are required for inference, hence all the 0.0 prediction values. Otherwise, this should work fine. (ex. num_cat still)

pavelzw

Looking good! Here a few comments.

Don't forget to delete benchmark.py.lprof before merging.

benchmark.py

environment.yml

slim_trees/lgbm_booster.py

slim_trees/utils.py

tests/test_lgbm_compression.py

model_parser.py

tests/test_lgbm_compression.py

pavelzw · 2023-03-20T12:58:55Z

LGBM gbdt large: 3.05x -> 7.15x
LGBM gbdt large LZMA: 2.29x -> 2.54x

While this is a great performance improvement without using LZMA at the end, the improvement is not that big when only using LZMA. Maybe it makes sense to add an option not to use the pyarrow way but to use the old implementation instead if pyarrow is not installed?

But I suggest moving this discussion to when the PR is almost ready to merge.

YYYasin19 · 2023-03-24T17:00:56Z

The table suggests that the factor improves especially in larger models. Unfortunately, we have lost some performance again, I seem to remember partially up to 11x.

pavelzw · 2023-03-25T21:32:23Z

slim_trees/utils.py

+    stream = pa.BufferOutputStream()
+    pq.write_table(
+        table, stream, compression="lz4"
+    )  # TODO: investigate different effects of compression


Let's also test some other compression methods other than lz4. During my sklearn tests I noticed that no also had very nice performance when using lzma afterwards.

I checked out zstd, and it, randomly, performed better here. 🤷

pavelzw · 2023-03-28T15:39:40Z

Old:
Base results / Our results / Change

Model	Size	Dump Time	Load Time
lgbm gbdt 2M	2.6 MiB / 1.0 MiB / 2.78 x	0.09 s / 0.25 s / 2.78 x	0.01 s / 0.15 s / 12.14 x
lgbm gbdt 2M lzma	0.9 MiB / 0.5 MiB / 1.90 x	1.31 s / 0.51 s / 0.39 x	0.08 s / 0.22 s / 2.73 x
lgbm gbdt 5M	5.3 MiB / 1.9 MiB / 2.81 x	0.18 s / 0.51 s / 2.82 x	0.03 s / 0.32 s / 12.54 x
lgbm gbdt 5M lzma	1.7 MiB / 0.8 MiB / 1.96 x	3.12 s / 1.07 s / 0.34 x	0.15 s / 0.38 s / 2.55 x
lgbm gbdt 20M	22.7 MiB / 7.6 MiB / 3.00 x	0.72 s / 2.11 s / 2.95 x	0.11 s / 1.32 s / 11.83 x
lgbm gbdt 20M lzma	6.3 MiB / 3.0 MiB / 2.09 x	16.68 s / 4.86 s / 0.29 x	0.59 s / 1.54 s / 2.61 x
lgbm gbdt 100M	101.1 MiB / 33.0 MiB / 3.06 x	3.23 s / 9.39 s / 2.91 x	0.53 s / 116.98 s / 219.18 x
lgbm gbdt 100M lzma	25.6 MiB / 10.6 MiB / 2.41 x	83.14 s / 23.69 s / 0.28 x	2.44 s / 98.06 s / 40.16 x
lgbm rf 10M	10.9 MiB / 3.2 MiB / 3.46 x	0.36 s / 0.69 s / 1.92 x	0.04 s / 0.56 s / 12.54 x
lgbm rf 10M lzma	0.7 MiB / 0.4 MiB / 1.85 x	1.87 s / 0.95 s / 0.51 x	0.12 s / 0.58 s / 4.84 x

New:
Base results / Our results / Change

Model	Size	Dump Time	Load Time
lgbm gbdt 2M	2.6 MiB / 0.5 MiB / 5.02 x	0.09 s / 0.37 s / 4.08 x	0.01 s / 0.30 s / 24.32 x
lgbm gbdt 2M lzma	0.9 MiB / 0.5 MiB / 1.72 x	1.33 s / 0.41 s / 0.31 x	0.08 s / 0.26 s / 3.31 x
lgbm gbdt 5M	5.3 MiB / 1.0 MiB / 5.37 x	0.18 s / 0.56 s / 3.15 x	0.02 s / 0.45 s / 17.93 x
lgbm gbdt 5M lzma	1.7 MiB / 0.9 MiB / 1.81 x	3.26 s / 0.78 s / 0.24 x	0.15 s / 0.50 s / 3.38 x
lgbm gbdt 20M	22.7 MiB / 3.2 MiB / 7.06 x	0.71 s / 2.16 s / 3.03 x	0.11 s / 1.84 s / 16.68 x
lgbm gbdt 20M lzma	6.3 MiB / 3.1 MiB / 2.05 x	17.84 s / 3.02 s / 0.17 x	0.59 s / 1.89 s / 3.19 x
lgbm gbdt 100M	101.1 MiB / 10.5 MiB / 9.61 x	3.17 s / 9.45 s / 2.98 x	0.51 s / 7.94 s / 15.59 x
lgbm gbdt 100M lzma	25.6 MiB / 10.0 MiB / 2.55 x	81.52 s / 13.03 s / 0.16 x	2.43 s / 8.11 s / 3.34 x
lgbm rf 10M	10.9 MiB / 0.9 MiB / 12.56 x	0.36 s / 0.88 s / 2.45 x	0.04 s / 0.46 s / 10.66 x
lgbm rf 10M lzma	0.7 MiB / 0.8 MiB / 0.88 x	1.91 s / 1.09 s / 0.57 x	0.12 s / 0.51 s / 4.25 x

jonashaag · 2023-04-25T10:37:01Z

As discussed with @pavelzw today, how about splitting the non-Parquet parts into a separate PR? It's annoying that Parquet doesn't improve the compression ratio despite the effort that you have put into this @YYYasin19. But let's make the best out of it by keeping the other parts!

YYYasin19 · 2023-05-16T17:27:54Z

slim_trees/utils.py

+    """
+    stream = pa.BufferOutputStream()
+    pq.write_table(
+        table, stream, compression="zstd", compression_level=8


One idea for the future might be to try out different values here. Just checked and found out that disabling zstd here allows lzma to compress more later resulting in a slight overall improvement by around ~5% for the trees_df --> might generalize to other dataframes as well

YYYasin19 requested a review from pavelzw as a code owner March 5, 2023 11:23

YYYasin19 force-pushed the model-parsing branch from 7409a2b to f28be47 Compare March 5, 2023 12:22

YYYasin19 added 14 commits March 6, 2023 15:26

add: first try at model parsing

f10b51e

try pyarrow

bd952cd

start with pyarrow encoding

0069ca0

somewhat optimize i guess

8e6177e

encode & decode pyarrow table to parquet

cc0cffb

cherry-pick result

2fb8b84

save

3594a41

update name

92178e1

remove explicit schema selection (temp?)

439af86

rename compression_utils to compression

78449aa

refactor: utils component

21071de

store leaf_values seperately

43b9f0c

add: uncompress tree

91be951

commit

5b44aba

YYYasin19 force-pushed the model-parsing branch from 9db0e33 to 5b44aba Compare March 6, 2023 14:27

fix: deps

e3b9217

YYYasin19 added 5 commits March 6, 2023 15:34

all models pls

37ed31f

suppress warnings

cc8228b

cleanup

660450b

update benchmark :(

7209db2

add: benchmark file for optimizing performance of handle generation

54e2889

push comments

fc09de9

jonashaag mentioned this pull request Mar 15, 2023

Jonas‘ optimization ideas #7

Open

YYYasin19 added 2 commits March 17, 2023 16:34

add linear features parsing and adapt tests

0372263

add linear features parsing and adapt tests

5dbf70e

pavelzw reviewed Mar 19, 2023

View reviewed changes

Merge branch 'main' into model-parsing

b28176d

pavelzw reviewed Mar 20, 2023

View reviewed changes

tests/test_lgbm_compression.py Outdated Show resolved Hide resolved

pavelzw mentioned this pull request Mar 20, 2023

Make lgbm large actually large, rename load_rf_lgbm #35

Merged

Merge branch 'main' into model-parsing

2eb48fa

pavelzw mentioned this pull request Mar 20, 2023

Make dtype casts safe #38

Merged

YYYasin19 added 7 commits March 24, 2023 16:49

it works (tm)

45c61f2

review changes

e76045e

review changes part 2

3fd5a45

apparently needed bc. CI is slow af

c78c807

merge

c94c29c

re-rename

9052d40

still too slow

9b9f146

Fix typo

7cfbd3c

pavelzw reviewed Mar 25, 2023

View reviewed changes

YYYasin19 and others added 7 commits March 25, 2023 22:35

add pandas dependency

2aabb6a

add lenght validation

6149ba4

try higher compression level for lz4 (16)

b8fc383

fix: int arg

fcfa7b3

switch to zstd, seems better here?

db30b19

remove linear tree

a7266c9

Merge branch 'main' into model-parsing

49834ac

pavelzw force-pushed the model-parsing branch from 42d1a4c to 49834ac Compare March 28, 2023 21:45

YYYasin19 mentioned this pull request May 16, 2023

add: refactoring, attributes to parse linear trees as well #50

Merged

3 tasks

YYYasin19 commented May 16, 2023

View reviewed changes

pavelzw marked this pull request as draft August 4, 2023 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet-compression for lgbm #15

parquet-compression for lgbm #15

YYYasin19 commented Mar 5, 2023 •

edited

Loading

github-actions bot commented Mar 6, 2023 •

edited

Loading

YYYasin19 commented Mar 6, 2023

YYYasin19 commented Mar 17, 2023

pavelzw left a comment

pavelzw commented Mar 20, 2023

YYYasin19 commented Mar 24, 2023 •

edited

Loading

pavelzw Mar 25, 2023

YYYasin19 Mar 26, 2023

pavelzw commented Mar 28, 2023

jonashaag commented Apr 25, 2023

YYYasin19 May 16, 2023

parquet-compression for lgbm #15

Are you sure you want to change the base?

parquet-compression for lgbm #15

Conversation

YYYasin19 commented Mar 5, 2023 • edited Loading

github-actions bot commented Mar 6, 2023 • edited Loading

YYYasin19 commented Mar 6, 2023

YYYasin19 commented Mar 17, 2023

pavelzw left a comment

Choose a reason for hiding this comment

pavelzw commented Mar 20, 2023

YYYasin19 commented Mar 24, 2023 • edited Loading

pavelzw Mar 25, 2023

Choose a reason for hiding this comment

YYYasin19 Mar 26, 2023

Choose a reason for hiding this comment

pavelzw commented Mar 28, 2023

jonashaag commented Apr 25, 2023

YYYasin19 May 16, 2023

Choose a reason for hiding this comment

YYYasin19 commented Mar 5, 2023 •

edited

Loading

github-actions bot commented Mar 6, 2023 •

edited

Loading

YYYasin19 commented Mar 24, 2023 •

edited

Loading