While I still think there is work to do making the CUDA code faster, the core and bindings are probably mature enough to make a proper python package.