Skip to content

too many repeats breaks the empirical_cdf algorithm? #19

@mikoontz

Description

@mikoontz

I get an error when running the mltools:empirical_cdf() function on a wild-caught data set-- I think related to there being some repeat values in the columns? (though I can't always reproduce the error when there are some repeats!)

Is this a fundamental limitation of the algorithm? Or is there something else more insidious that might be at work?

Here's a reproducible example (and the location in the function code where it seems to break):

library(data.table)
library(mltools)

set.seed(123)
data <- as.matrix(data.frame(x = c(rep(0, 4), 1), y = c(rep(0, 2), 4, 2, 1), z = rnorm(n = 5)))
dt <- data.table(data)

The data look like this:

> dt
   x y           z
1: 0 0 -0.56047565
2: 0 0 -0.23017749
3: 0 4  1.55870831
4: 0 2  0.07050839
5: 1 1  0.12928774

Implementing the ecdf looks like this:

(mltools_package <- empirical_cdf(dt, ubounds = dt)$CDF)

And the error (from {data.table}) looks like this:

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in 17 rows; more than 10 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

This appears to break during the rolling join (

binned <- uboundDT[binned, on=col, roll=-Inf, nomatch=0]
).

Copypasting your function code to my script and (naively!) adding "allow.cartesian=TRUE" to the rolling join line gives CDF results greater than 1, so it doesn't seem to be a super simple fix.

Is this just a limitation of this particular algorithm?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions