created first version of api. ffi calls and example missing CU-861n1bn07 #13

CallMeMSL · 2023-05-18T20:24:45Z

I created the first version of the new API and would like some feedback. The booster is built with the TypeState pattern, so logic errors when building the booster are caught by the compiler.
I also added add_pramas as a builder method, so that you can either have different params or datasets after duplicating a builder.

ffi code and tests are still missing.
You can also ignore the changes in old_booster.rs and old_dataset.rs, refactoring got a little messy.

leofidus

I think this is definitely on the right way

leofidus · 2023-05-20T20:09:19Z

src/booster/builder.rs

+/// Builder for the Booster.
+///
+/// Uses TypeState Pattern to make sure that Training Data is added
+/// so that Validation can be synced properly and params are present for training.
+#[derive(Default, Clone)]
+pub struct BoosterBuilder<T: Clone, P: Clone> {
+    train_data: T,
+    val_data: Vec<DataSet>,
+    params: P, // after #3 should this be a struct
+}


It's a neat pattern, and it looks like Rustdoc doesn't have issues with it either.

leofidus · 2023-05-20T20:11:19Z

src/booster/builder.rs

+impl<P: Clone> BoosterBuilder<TrainDataAdded, P> {
+    pub fn add_val_data(mut self, val: DataSet) -> Self {
+        self.val_data.push(val);
+        self
+    }
+}


People tend to add validation data after training data, but is there a reason this isn't just implemented on BoosterBuilder<T,P>?

I guess it helps with validation to restrict it

My reason is that I'm not sure if loading the datasets in fit() is the best way to approach it. Restricting it like that always keeps the possibility open to load the dataset at the add_val_data call directly. This would make duplicate() also a lot more efficient.

leofidus · 2023-05-20T20:14:35Z

src/booster/mod.rs

+pub struct Booster {
+    handle: lightgbm_sys::BoosterHandle,
+    train_data: DataSet,
+    validation_data: Vec<DataSet>,
+}


Are these meant to be LoadedDatasets? I would assume fit loads the data

What about Boosters that were trained in advance and are loaded from file? What would their train_data and validation_data be? (and does it even make sense to hold onto these potentially huge datasets?)

Yes, you're right. This is a refactoring artifact and should just be the Dataset pointers. Now that you say it, it probably doesn't make sense to add them to the booster, if we build them with fit() anyway.

leofidus · 2023-05-20T20:24:33Z

src/dataset/mod.rs

+pub enum DataFormat {
+    File {
+        path: String,
+    },
+    Vecs {
+        x: InputMatrix,
+        y: OutputVec,
+    },
+    #[cfg(feature = "dataframe")]
+    DataFrame {
+        df: DataFrame,
+        y_column: String,
+    },
+}


I guess this is to make Datasets clonable?

It feels like it makes datasets and error handling a bit more complicated, compared to just directly loading them. It would also prevent future load_* functions that only take a reference to properly laid-out data (maybe loading nalgebra arrays,if the support that?).

I think (suspect) you can implement clone on Dataset by calling LGBM_DatasetCreateByReference(h_old, rows, &mut h_new) followed by LGBM_DatasetAddFeaturesFrom(h_new, h_old). But maybe that's completely wrong, the documentation is incredibly vague.

matthiasvedder

Looks very promising overall.

matthiasvedder · 2023-05-22T09:07:24Z

src/booster/builder.rs

+#[derive(Clone)]
+pub struct TrainDataAdded(DataSet); // this should not implement default, so it can safely be used for construction
+#[derive(Default, Clone)]
+pub struct TrainDataNotAdded;


Thinking about the resulting API, I was wondering whether these structs could have names where the important part stands out more? Like WithTrainData and NoTrainData. The difference is the very first word, instead of (not) having a Not added in the middle of a fairly long type name.

Same for Params.

The problem with the TypeState Pattern is, that you can't set error messages. However, structs appear in the error message when you try to call a function from a different implementation. Example:

25 | pub struct BoosterBuilder<T: Clone, P: Clone> { | --------------------------------------------- method `fit` not found for this struct ... 109 | let builder = Booster::builder().fit(); | ^^^ method not found in `BoosterBuilder<TrainDataNotAdded, ParamsNotAdded>` | = note: the method was found for - `BoosterBuilder<TrainDataAdded, ParamsAdded>`

Since this is the only point where the user actually encounters the structs, I named them so that the Error message sounds natural.

But your suggestion would work as well.

matthiasvedder · 2023-05-22T09:14:34Z

src/booster/builder.rs

+    /// Returns the Builder and a clone from it. Useful if you want to train 2 models with
+    /// only a couple differences
+    pub fn duplicate(self) -> (Self, Self) {
+        (self.clone(), self)


Can you elaborate why this function returns two instances of Self and why self is the second one?

Calling this like let (other, me) = me.duplicate(); feels a bit weird at first glance.

(self, self.clone()) would work as well.

I added this function, so that you can call it after you defined everything that 2 boosters have in common and then add the differences, like this:

let (bst_low_lr, bst_high_lr) = Booster::builder() .add_train_data(dataset) .add_val_data(another_dataset) .add_val_data(also_a_dataset) .duplicate(); let bst_low_lr = bst_low_lr.add_params(params_a).fit()?; let bst_high_lr = bst_high_lr.add_params(params_b).fit()?;

Your example does look clean.

It would restrict us to 2 boosters. I don't know if comparing 3 or more boosters does make any sense.

Eventually, examples like these should be part of the docs, they help understand the API better.

If you want more than 2 boosters, you'd probably use clone again, for example if you have a Vec of params you want to test you could do

let src_bst = Booster::builder() .add_train_data(dataset) .add_val_data(another_dataset) .add_val_data(also_a_dataset); let boosters = params.map(|p| src_bst.clone().add_params(p).fit()) .filter_map(|booster| booster.ok());

duplicate() is a bit of a special case, but I think it's nice to have.

…st feedback.

CallMeMSL · 2023-05-26T14:51:12Z

I think the rewrite is so far done, that we can accept this pr. Any feedback?

matthiasvedder · 2023-05-30T06:05:01Z

src/booster/mod.rs

+    validation_data: Vec<LoadedDataSet>,
 }

+// exchange params method as well? does this make sense?


leftover comment from the development stage?

matthiasvedder · 2023-05-30T06:05:27Z

src/booster/mod.rs

+    /// # Ok(())}
+    /// ```
+    pub fn predict(&self, x: &Matrixf64) -> Result<Matrixf64, LgbmError> {
+        let prediction_params = ""; // do we need this?


matthiasvedder · 2023-05-30T06:08:17Z

src/booster/mod.rs

+            .collect())
+    }
+
+    /// this should take &mut self, because it changes the model


not really a doc comment.

matthiasvedder · 2023-05-30T06:09:30Z

src/booster/mod.rs

+     /// This should not reset the already existing submodels.
+     /// Pass an empty array as validation data, if you don't want to validate the train results.
+     /// TODO validate this after implemented
+    pub fn finetune(


What should happen with this code? Delete it?

matthiasvedder · 2023-05-30T06:14:09Z

src/dataset/mod.rs

+/// The DatasetHandle is returned by the lightgbm ffi.
 pub struct LoadedDataSet {
    pub(crate) handle: DatasetHandle,
+    dataset: DataSet, // this can maybe be removed


Since clippy warns about it, it should be removed.

leofidus · 2023-07-24T07:37:54Z

Task linked: CU-861n1bn07 LightGBM API Rewrite

created first version of api. ffi calls and example missing

c431431

CallMeMSL requested a review from leofidus May 18, 2023 20:24

CallMeMSL added 3 commits May 19, 2023 16:51

added dataset ffi

7732775

fixed build

5057c1a

fix couple errors

c1d5dec

leofidus reviewed May 20, 2023

View reviewed changes

matthiasvedder reviewed May 22, 2023

View reviewed changes

CallMeMSL added 23 commits May 25, 2023 12:01

added ffi calls, documentation (without tests). integrated pull reque…

40cef9e

…st feedback.

use num_iterations

2269ad5

implement predict

2089b13

implemented eval

3bffc00

clippy fix

dbdb85f

remove old files

52c1c1d

cargo fmt

96ae0f9

fix polars code

f36d4e5

add booster drop

e4f5521

implemented simple builder test

02fa6a9

fmt

938a822

added evaluation to test

6b82f0a

a bit more complex test

9f4a440

added (doc)tests in builder.rs

6258205

fix test

a85ffc5

added booster tests

d3b53e0

fixed module acess, cleaned doctests up

bb37b56

added dataset docs and a few constructors

5252bc4

make it nicer for my ide

445e54d

add extern

40e52ad

add df test as cfg_attr

550a7e6

lets try this

fd359cb

change df doctest

0835f50

CallMeMSL added 3 commits May 26, 2023 17:10

added simple df loading test

b915914

added actual loading tests

2ccada8

cargo fmt

4d1c750

matthiasvedder reviewed May 30, 2023

View reviewed changes

CallMeMSL changed the title ~~created first version of api. ffi calls and example missing~~ created first version of api. ffi calls and example missing CU-861n1bn07 Jul 24, 2023

created first version of api. ffi calls and example missing CU-861n1bn07 #13

Are you sure you want to change the base?

created first version of api. ffi calls and example missing CU-861n1bn07 #13

Uh oh!

Conversation

CallMeMSL commented May 18, 2023

Uh oh!

leofidus left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthiasvedder left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CallMeMSL commented May 26, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leofidus commented Jul 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants