Hi all,
First of all, nice tool, congrats!
I've been checking it recently and I've had some issues when using it with PyArrow > 2.0.0. Take a serialization of a random array:
import fleetfmt
import pathlib
import numpy as np
aaa = np.random.rand(10000,10000)
print("Writing numpy content to a new Fleet file.")
with pathlib.Path("test.fleet").open('wb') as fhandle, fleetfmt.FileWriter(fhandle) as writer:
for i,value in enumerate(aaa):
writer.append(i, value)
print("Done.")
This raises warnings related to pyarrow.serialize and pyarrow.deserialize:
FutureWarning: 'pyarrow.serialize' is deprecated as of 2.0.0 and will be removed in a future version. Use pickle or the pyarrow IPC functionality instead.
This can be solved making a few small changes in the code. In writer.py:
import pickle
[...]
hbuf = SCHEMA_HEAD_SERDES.to_bytes(len(sbuf)) # Line 97
[...]
buf = pickle.dumps(record,protocol=5) # Line 106
head = RECORD_HEAD_SERDES.to_bytes(len(buf)) # Line 107
[...]
kbuf = pickle.dumps(self._keymap,protocol=5) # Line 119
In reader.py:
import pickle
[...]
rec = pickle.loads(buf) # Line 83
[...]
self._schema = pa.ipc.read_schema(wrap) # Line 96
And in base.py:
import pickle
[...]
keymapdes = pickle.loads(keymapser) # Line 63
Furthermore, this produces an increase in performance. Comparison of running timeit for the aforementioned 10000x10000 array with pickle and pyarrow:
315 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # With pyarrow
161 ms ± 446 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # With pickle protocol 5