nn.GPU by nicholas-leonard · Pull Request #835 · torch/nn

nicholas-leonard · 2016-05-27T18:00:27Z

This PR adds nn.GPU which can be used to distribute modules across multiple devices.

OMG why is it in nn and not in cunn? As discussed with @soumith, putting in nn means that it can be used in CPU-only environments, which is common for production.

The unit tests are located in cunn to avoid a pcall(function() require cunn end) in the nn unit tests (see PR torch/cunn#282).

iamalbert · 2016-05-27T19:51:12Z

doc/simple.md

+   :add(nn.GPU(nn.Linear(10000,10000), 1))
+   :add(nn.GPU(nn.Linear(10000,10000), 2))
+   :add(nn.GPU(nn.Linear(10000,10000), 3))
+   :add(nn.GPU(nn.Linear(10000,10000), 4, cutorch.getDevice()))


I am wondering if this line can run in CPU-only environments

the cutorch.getDevice will not. But it isn't mandatory to use cutorch.getDevice(), this is just an example.

nicholas-leonard · 2016-06-06T17:20:23Z

I have tested the nn.GPU implementation on a language model task using 4 GPUs and it works.

iamalbert · 2016-06-06T17:40:05Z

This is cool but the name and the order of parameters is .. a little confusing

:add(nn.OnGPU(1, nn.Linear(10000,10000))
:add(nn.OnGPU(2, nn.Linear(10000,10000))
:add(nn.OnGPU(3, nn.Linear(10000,10000))
:add(nn.OnGPU(4, cutorch.getDevice(), nn.Linear(10000,10000)))

may be clearer than

:add(nn.GPU(nn.Linear(10000,10000), 1))
:add(nn.GPU(nn.Linear(10000,10000), 2))
:add(nn.GPU(nn.Linear(10000,10000), 3))
:add(nn.GPU(nn.Linear(10000,10000), 4, cutorch.getDevice()))

how do you think?

nicholas-leonard · 2016-06-06T19:41:52Z

@iamalbert I prefer nn.GPU, but I like your order. Except for the optional outdevice argument which I think should still be last (as it is optional):

:add(nn.GPU(1, nn.Linear(10000,10000))
:add(nn.GPU(2, nn.Linear(10000,10000))
:add(nn.GPU(3, nn.Linear(10000,10000))
:add(nn.GPU(4, nn.Linear(10000,10000)), cutorch.getDevice())

szagoruyko · 2016-06-07T02:16:47Z

The order is module, device in ModelParallel already, it would be confusing if we introduce another order

nicholas-leonard · 2016-06-07T04:58:15Z

@szagoruyko Good point. The argument of backwards compatibility always wins (plus I am lazy). I will leave it as it is.

szagoruyko · 2016-06-07T06:05:29Z

@nicholas-leonard there is a couple of places like self.output = output that break tensor sharing and optnet, can we avoid it?

nicholas-leonard · 2016-06-07T14:32:05Z

@szagoruyko how does it break tensor sharing (I don't know anyone that shares outputs). As for optnet, what does it do that cannot support self.output = output? because I am pretty sure I do this in many other places.

fmassa · 2016-06-15T04:12:43Z

Hi,

Sorry for the delay in replying.

The current underlying requirement in optnet is that the tensor/storage objects corresponding to the output and gradInput of each module doesn't change across runs of forward/backward.
There are two cases actually:

when one wants to create graph visualizations, the tensor shouldn't change.
Thus doing something like the following will currently not work (although there is a pending PR which tries to fix it, but I'm not 100% sure it doesn't have bad side effects):

function MySelectModule:updateOutput(input)
  self.output = input[1] -- creates a new tensor which shares the storage
  return self.output
end

when we care only about optimizing for memory, optnet only looks for the storages. So a module which allocates new memory for the output/gradInput during each forward/backward won't be able to reuse buffers, but I think that it shouldn't cause any problems wrt the correctness of the entire network forward/backward.

About this PR, as it's a generic module that supports both tensors and tables, using set is not an option. But from a quick look I have the impression that this module could potentially work as is with optnet, but I haven't tried it to check.

nicholas-leonard · 2016-06-23T03:24:21Z

So then I guess this PR is ready to merge then :)

nicholas-leonard · 2016-06-28T21:30:00Z

soumith · 2016-07-01T23:09:18Z

GPU.lua

+   self.modules[1] = module
+
+   if module:type() == 'torch.CudaTensor' then
+      self:cuda()


this :cuda() is no executing in the context of "device". needs a fix

It is : https://github.com/nicholas-leonard/nn/blob/db7b97209e3a4b540d51447be6ebfaf3b8fc28a9/GPU.lua#L158-L162

soumith · 2016-07-02T12:44:22Z

Thanks for your patience Nicholas!

szagoruyko · 2016-07-02T13:00:32Z

just realized, we need to add support for other cuda datatypes than float

soumith · 2016-07-02T13:01:30Z

he seems to be checking for Cuda*Tensor, is that not sufficient?

nicholas-leonard mentioned this pull request May 27, 2016

nn.GPU torch/cunn#282

Merged

iamalbert reviewed May 27, 2016
View reviewed changes

nicholas-leonard force-pushed the GPU branch 2 times, most recently from b1d68d6 to dba06e5 Compare June 3, 2016 21:04

nicholas-leonard mentioned this pull request Jun 14, 2016

Multi gpu nce Element-Research/rnn#271

Merged

1 task

GPU

db7b972

nicholas-leonard force-pushed the GPU branch from dba06e5 to db7b972 Compare June 14, 2016 21:07

nicholas-leonard mentioned this pull request Jun 14, 2016

GPUParallelTable (work in progress) #850

Closed

soumith reviewed Jul 1, 2016
View reviewed changes

nicholas-leonard added 2 commits July 1, 2016 19:57

fixes

4c53ee2

small fixes

ef2f327

soumith merged commit 07d3bdd into torch:master Jul 2, 2016

Conversation

nicholas-leonard commented May 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iamalbert May 27, 2016

Choose a reason for hiding this comment

Uh oh!

nicholas-leonard May 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicholas-leonard commented Jun 6, 2016

Uh oh!

iamalbert commented Jun 6, 2016

Uh oh!

nicholas-leonard commented Jun 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szagoruyko commented Jun 7, 2016

Uh oh!

nicholas-leonard commented Jun 7, 2016

Uh oh!

szagoruyko commented Jun 7, 2016

Uh oh!

nicholas-leonard commented Jun 7, 2016

Uh oh!

fmassa commented Jun 15, 2016

Uh oh!

nicholas-leonard commented Jun 23, 2016

Uh oh!

nicholas-leonard commented Jun 28, 2016

Uh oh!

soumith Jul 1, 2016

Choose a reason for hiding this comment

Uh oh!

nicholas-leonard Jul 1, 2016

Choose a reason for hiding this comment

Uh oh!

soumith commented Jul 2, 2016

Uh oh!

szagoruyko commented Jul 2, 2016

Uh oh!

soumith commented Jul 2, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nicholas-leonard commented May 27, 2016 •

edited

Loading

nicholas-leonard May 27, 2016 •

edited

Loading

nicholas-leonard commented Jun 6, 2016 •

edited

Loading