Skip to content

Column names of spectra matrix - consistent behaviour of as.data.frame()/model.matrix() and as.matrix(x)/x[[]] and as.wide.df() #232

@cbeleites

Description

@cbeleites

For some models, e.g. prcomp(), hyperSpec objects can be used to fit the model, but not for prediction.

The underlying "mechanism" of the issue is that prcomp() calls as.data.frame() and then model.matrix() to extract the relevant columns of the training data, whereas predict.prcomp() expects a data.frame or matrix.

PCA <- prcomp (~ spc, flu)

prediction needs a data.frame or matrix with column names spc<wl>:

head(rownames(PCA$rotation))
#> [1] "spc405"   "spc405.5" "spc406"   "spc406.5" "spc407"   "spc407.5"

thus,

predict (PCA, flu)
#> Error in predict.prcomp(PCA, flu): 'newdata' must be a matrix or data frame

neither as.data.frame() nor as.matrix() work:

predict(PCA, as.data.frame(flu))
#> Error in predict.prcomp(PCA, as.data.frame(flu)): 'newdata' does not have named columns matching one or more of the original columns
predict(PCA, as.matrix(flu))
#> Error in predict.prcomp(PCA, as.matrix(flu)): 'newdata' does not have named columns matching one or more of the original columns

as.wide.df() can produce the right type of data.frame, though:

predict(PCA, as.wide.df(flu, wl.prefix = "spc"))
#>          PC1         PC2       PC3        PC4         PC5           PC6
#> 1 -2981.2355   3.3656437 -3.036947   5.186403   9.1582295 -3.301036e-13
#> 2 -1847.6202   0.1474497 -6.973606   1.617421 -11.5188178  4.123699e-13
#> 3  -572.9198   4.3060755  6.771751 -15.374293   0.5490714  4.745072e-13
#> 4   613.5676  -5.4921238 15.612317   8.056681  -1.9075938  7.245697e-14
#> 5  1758.9646 -18.2575816 -7.749809  -2.473730   2.9604848  1.027175e-15
#> 6  3029.2434  15.9305364 -4.623706   2.987518   0.7586259 -2.891355e-14

slightly different colnames for as.matrix() would make that work as well:

matrix_flu <- flu[[]]
colnames(matrix_flu) <- paste0("spc", colnames(matrix_flu))
predict (PCA, matrix_flu)
#>             PC1         PC2       PC3        PC4         PC5           PC6
#> [1,] -2981.2355   3.3656437 -3.036947   5.186403   9.1582295 -3.301036e-13
#> [2,] -1847.6202   0.1474497 -6.973606   1.617421 -11.5188178  4.123699e-13
#> [3,]  -572.9198   4.3060755  6.771751 -15.374293   0.5490714  4.745072e-13
#> [4,]   613.5676  -5.4921238 15.612317   8.056681  -1.9075938  7.245697e-14
#> [5,]  1758.9646 -18.2575816 -7.749809  -2.473730   2.9604848  1.027175e-15
#> [6,]  3029.2434  15.9305364 -4.623706   2.987518   0.7586259 -2.891355e-14

What to do?

  1. Should we change the default in as.wide.df() to wl.prefix = "spc", so predict(PCA, as.wide.df(flu) works by default?
  2. For models that are trained only on the spectra matrix, making as.matrix() and x[[]] output colnames prefixed with spc would make predict(PCA, flu[[]]) work.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions