-
Notifications
You must be signed in to change notification settings - Fork 212
Only support dumping named, non-ivar Data #764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
JRuby does not expose the `initialize` method for a given Data instance to native code, so there's no equivalent to calling `rb_struct_initialize`. This change moves that initialization to a Ruby `__send__` and removes the native stub method `init_data`. Fixes broken JRuby `Data` logic after ruby#692.
|
cc @nevans @tenderlove @eregon for discussion |
|
This is actually quite similar to the first approach & commit in #692: 788b844. That was refined in later commits, I believe due to trying to support more cases of Data classes.
Foo = Data.define(:width, :height, :area) do
def initialize(width:, height:)
super(width: width, height: height, area: width * height)
end
endThis wouldn't work when deserializing with
@nevans I believe you have more background on those and maybe other issues, could you share them here? |
But we do support loading them: see This may not be the best use case, but I've personally used this as a strategy to quickly import old or external data into a different context, by simply scrubbing the class names from the tags. Obviously there are other approaches, with other tradeoffs, but this works. |
This is only for legacy support. We should not use this to justify adding support for dumping anonymous Data now.
Such data should be imported with an older, compatible version that supports it, and then re-dumped as the new supported form. We can't endlessly support old data formats when they become fully incompatible with modern dump output. |
|
I don't have much to add to @eregon's comments (and the comment he linked to). Obviously there are always some caveats and issues when dealing with serialization and deserialization of any custom objects. But, IMO, deserialization should at least be able to cope with custom I doubt that setting ivars on data objects is very common. But it should be supported (just as it is for Struct or any other object). |
No other type of object is born frozen, so this argument is moot. Since Data must be initialized once and immediately frozen, and there's no protocol to hook into this and set up instance variables, Data with instance variables cannot be properly dumped and loaded. |
There's no deprecation warning in the code. Changing something like this deserves to have a deprecation period, IMO. |
Since there is no way to generically invoke
Same argument applies here. Custom initialize logic that hooks into Data's initialize chain cannot be generically re-invoked during loading, so such objects cannot be dumped. |
If you mean deprecation warning for Data with instance variables, I would argue it should never have been supported or released. |
Maybe it can, maybe it can't. Psych will dump any regular old ruby object you give it. And it'll handle ivars on them in almost the exact same way it does with all the core object types: by using |
Because we cannot hook into the initialize sequence from the Psych loading code without invoking that sequence during allocation, and custom initialize methods that do more than just pass through Data fields cannot be generically invoked for all cases. |
|
Is there some particular philosophical reason why YAML shouldn't be able to support what Marshal does (this includes ivars and bypassing |
Marshal has custom protocols built into each runtime for accessing and initializing the internal state of objects during allocation. Psych does not. Until that changes, Psych should only support "simple" Data types. |
|
Here's another way to look at this: The contract of Data is that it is constructed, initialized, and frozen in one step when calling Psych cannot hook into this sequence because there's no protocol for it to do so. The previous logic relied on lifecycle hacks, primarily from native code, to manipulate an uninitialized Data object before it could actually initialize itself. This breaks the contract of Data. We must not:
A possible workaround for this would be to implement a Psych protocol for custom Data subclasses that want to be able to dump and load, similar to what Marshal provides for other custom Ruby types. class MyData << Data.define(:foo, :bar)
def initialize(foo, baz)
@baz = baz
@this = self
super(foo, foo + 1)
end
def psych_dump
{foo: foo, baz: @baz}
end
def self.psych_load(hash)
MyData.new(hash[:foo], hash[:baz])
end
end(Edit: added more logic to show incoming ivars as part of the protocol and move the ivar initialization above the This would be equivalent to how many other classes use custom Marshal logic when their custom initialize logic cannot be replicated in a simple dump/load sequence. |
|
Also, just so my biases are laid bare:
Data should have had the same constraints as a Record in Java, and then we wouldn't have to debate how to support those weird edge cases. |
|
I'm not sure how or why Data's "contract" is any different from any other object in this regard? It's not normal to bypass Now I have the problem that I can't update any data members with any existing ruby method... but there is a method that's part of Ruby's public C API that can do it. |
Because Data is supposed to be born frozen, while no other object has this as part of their contract (except for objects that don't allow instance variable state, which also supports my argument). Your example never invokes the D object's initialize, so it is not valid. And because you cannot invoke D's initialize in a generic way, there's no way to proceed from here. Hacking around in the C code to call CRuby-specific internal methods that initialize D without calling rb_hash_foreach(argv[0], struct_hash_set_i, (VALUE)&arg);
// Freeze early before potentially raising, so that we don't leave an
// unfrozen copy on the heap, which could get exposed via ObjectSpace.
OBJ_FREEZE(self);
if (arg.unknown_keywords != Qnil) {
rb_exc_raise(rb_keyword_error_new("unknown", arg.unknown_keywords));
}An unfrozen Data object should never be referenced or manipulated from Ruby, and therefore there's no way to support initializing instance variables in a generic way during loading. |
|
Additional argument against using In JRuby, Data and Struct are completely disjoint types, including at the native level, and there is no equivalent hack to early-initialize a Data object's fields, because that violates the contract of Data. |
|
Further justification: the reason you cannot initialize a Data yourself is because it was designed that way. static VALUE
rb_data_initialize_m(int argc, const VALUE *argv, VALUE self)The initialization sequence of Data was never designed to be invoked from outside of this C code, and therefore there's no way to initialize a Data object other than to follow the existing protocol. That protocol does not support initializing instance variables on an already-visibile-but-not-yet-initialized Data object unless you are inside a custom initialize. Since we can't generically invoke all possible custom initialize methods, and can't consistently pass through instance variables to allow such initializaiton, those use cases cannot be supported. CRuby could certainly change the Data initialization protocol to support generic object serialization frameworks, but as it exists today we cannot do this without violating the explicitly defined lifecycle of Data. |
class C
attr_reader :a
def initialize
@a = "a"
freeze
end
endThis class has it as part of its contract, in the exact same way as Data: it's the last thing done in the initializer. I agree that using Wait a minute... I could've sworn I'd tried this before. |
Nothing suprising there. Just calling I was pretty certain I'd tried that when first implementing #692, and got exceptions. |
|
First off, I think it's a bug that a Data class has a visible allocate method, and I would argue that should be fixed. That explicitly violates the contract mentioned in the C code I pasted above. An unfrozen Data should never be allowed "into the wild" because it can leak out to other consumers and end up with state it was never designed to have. If we assume that it's not a bug and won't be removed, then your hack of using the In the interest of getting this working on JRuby, I'd be fine with you pushing a PR to remove all custom C code related to this feature and instead using the Method trick. I don't feel great about it, with regards to the Data contract, but until there's a formal way to support serialization of Data (other than Marshal) it's perhaps good enough. So:
I still believe we should not introduce new problems by allowing anonymous Data to be dumped (or loaded) because that clearly is not valid; the resulting obj.class from one anonymous Data instance in the yaml stream will never be
This is the gross case and should only be used when an object has a custom initialize method (since that's the only way they could set instance variables). So you need to compare
Plain old Data objects that do not have custom initializers and do not set instance variables should not be penalized by those that do. Just construct them normally, without registering during loading. |
|
I justed tested on 3.2.6, so I must've misremembered some experiment or made some other mistake when I first implemented #692. With this data class definition, demonstrating ivars and different kwargs from super: irb(main):001* D = Data.define(:foo, :bar, :baz) do
irb(main):002* attr_reader :ivar
irb(main):003* def initialize(foo:, bar: foo * 2)
irb(main):004* @ivar = "wat"
irb(main):005* super(foo:, bar:, baz: "etc")
irb(main):006* end
irb(main):007> endTo demonstrate the "normal" way of using it: And to demonstrate that explicitly calling As for any special "contracts" may have that set it apart from other classes and objects, I still don't see why this should be any different from e.g other core classes like Array or String with instance variables, or any other custom class that calls "freeze" at the end of their initialize method. |
Why make the comparison when |
|
FWIW the hack doesn't work on JRuby either, since we never expected users to be able to call I will look into fixing this on the JRuby side. |
The performance of |
|
I suppose the question is how much is the performance for this: data = klass.allocate
ivars&.each do |ivar, v|
data.instance_variable_set ivar, v
end
DATA_INITIALIZE.bind_call(o, **members)vs this: if (ivars.nil? || ivars.empty?) && DATA_INITIALIZE.equal?(klass.instance_method(:initialize))
klass.new(**members)
else
data = klass.allocate
ivars.each do |ivar, v|
data.instance_variable_set ivar, v
end
DATA_INITIALIZE.bind_call(data, **members)
data
endAnd how significant is that performance difference? Without running benchmarks, I'm inclined to create the simpler PR. And I need to punt on the performance benchmarks for now (because I have other things I was supposed to be working on today!) |
|
Maybe the |
Actually I think you're right, assuming the method hack will never be used for "simple" Data types. Data types without instance variables dump as Data types that have instance variables should get dumped as |
|
@eregon Right. Maybe the TruffleRuby case was what I was remembering? I've definitely looked at TruffleRuby's Data implementation.
IMO, TruffleRuby should have a (de-optimized) fallback for |
I don't particularly like the hack but I'd be willing to add this if necessary.
You can't generically invoke a subclass's custom |
|
@headius Actually, I think that calling |
Actually this wasn't quite accurate. JRuby does define an So your example works fine on JRuby if I only use the |
This commit removes support for the following features: * Dumping an anonymous Data. This is unsupported for the same reasons as dumping anonymous modules or classes: there's no way to reference the proper Data class, and defining a new anonymous Data for every such element would not round-trip properly. * Dumping a Data with instance variables. Instance variables on a Data can only be set during the initialize chain when first constructed, and such a chain is custom for each such Data class. As there is no way to modify a Data after constructed and no way to accurately invoke the custom initialize method, all Data with instance variables are rejected. Data is also no longer registered while loading, since it's not possible to have a self-referential Data instance without a custom initialize that sets an instance variable to self. The combination of these removed features simplifies Data dumping and loading and eliminates all allocate+initialize hacks.
|
I repushed with the removed tests I forgot to include before, but it sounds like this PR may be rejected in favor of a less hacky version of the old logic. @nevans I can put that together if you are busy with other things today. |
|
JRuby doesn't actually even support Data subclasses with instance variables right now: This is a bug. I have the rest of the implementation passing on CRuby, though. |
|
FWIW I believe we still need buy-in from @eregon to go forward with any solution, so I'm not closing this PR just yet. I will push a separate PR that uses the method hack. |
|
Sorry, was in a meeting. Just pushed my own branch with very minimal change. It's maybe not much different from yours. But we'd need some way to make it work for TruffleRuby too. |
@headius I think this contract is something you would like, but I don't think any such contract is established.
> Data.define(:a,:b) { def initialize(**); @a = 42; end }.new.frozen?
=> false(and that could use In many comments you say there is no way to initialize a Data instance fields, but there is, it's You also mention that the Data subclass array = [1, 2]
d = MyData.new(x: array)
array << 3 # perfectly valid, and it would be surprising if that state is lost
reloaded = Psych.load Psych.dump([array, d])
reloaded[0].should.equal?(reloaded[1].x)
reloaded[0].should == [1, 2, 3]As @nevans says, for regular objects, whether they freeze in constructor or not, it's the same, it's not OK to call TLDR: I see your POV, but I don't think it's good to limit ourselves to such a small subset of Data classes. The current behavior (+ #761) works for AFAIK all Data classes, and that certainly seems nice from the POV of a Psych user. |
|
@eregon I think you missed all of the discussion above where I concede several of these points. |
Data.instance_method(:initialize).bind_call(d, {foo: 1})Aside: interesting that that works, it probably shouldn't because it should only accept keyword arguments. |
|
Mmh, OK, I'll add For context there is truffleruby/truffleruby#3892 which I think is too bad of a performance a trade-off. But I think perf-wise it is likely quite a bit worse than #761 which does not use a Hash and does not do Hash lookups but just 2 arrays (members & values). |
|
Let's merge #765 instead. |
This PR is meant to be applied on top of #763. It could be reformulated to skip that commit altogether.
In this PR I remove the following features:
The previous logic supported this case by defining a new Data class every time an anonymous instance was encountered in the YAML stream. As this is clearly not round-tripping the object back to its original class, and would not even deserialize two of the same anonymous data objects as the same class, it is not valid.
This change has precedent: we do not support dumping singleton objects or anonymous Structs, since there's no way to restore the reference to the original class.
The previous logic attempted to support this case by allocating an uninitialized Data object and later calling
initializevia internal mechanisms (either a C call torb_struct_initializeor a reflective call via__send__. Because these hacks violate the contract of Data (initialized once and immediately frozen) and because a custom Data subclass with its owninitializeand instance variable setup can't be generically constructed, such Datas cannot be dumped.Removal of these two features greatly simplifies Data dumping and loading:
I have removed the related tests and this passes on all implementations.