-
Notifications
You must be signed in to change notification settings - Fork 24
seeding and workers #347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
seeding and workers #347
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
ChristopherMancuso
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked this out using the api and the cli and both work for me. To test I used
from pecanpy import pecanpy as node2vec
import numpy as np
emb_results = {}
for idx, num_workers in enumerate([1,1,10,10]):
# initialize node2vec object, similarly for SparseOTF and DenseOTF
g = node2vec.PreComp(p=0.5, q=1, workers=num_workers, verbose=True)
# alternatively, can specify ``extend=True`` for using node2vec+
edge_fp = "demo/BIOGRID.el"
# load graph from edgelist file
g.read_edg(edge_fp, weighted=False, directed=False)
# precompute and save 2nd order transition probs (for PreComp only)
g.preprocess_transition_probs()
# generate random walks, which could then be used to train w2v
walks = g.simulate_walks(num_walks=10, walk_length=80)
# alternatively, generate the embeddings directly using ``embed``
emd = g.embed()
print(emd)
emb_results[f"run_{idx}"] = emd
print(emb_results["run_0"])
print(np.array_equal(emb_results["run_0"],emb_results["run_1"]))
print(np.array_equal(emb_results["run_0"],emb_results["run_2"]))
print(np.array_equal(emb_results["run_0"],emb_results["run_3"]))
print(np.array_equal(emb_results["run_1"],emb_results["run_2"]))
print(np.array_equal(emb_results["run_1"],emb_results["run_3"]))
print(np.array_equal(emb_results["run_2"],emb_results["run_3"]))
then ran
pecanpy --input demo/BIOGRID.el --output demo/BIOGRID1_r1.emb --mode SparseOTF --workers 1
pecanpy --input demo/BIOGRID.el --output demo/BIOGRID1_r2.emb --mode SparseOTF --workers 1
pecanpy --input demo/BIOGRID.el --output demo/BIOGRID10_r1.emb --mode SparseOTF --workers 10
pecanpy --input demo/BIOGRID.el --output demo/BIOGRID10_r2.emb --mode SparseOTF --workers 10
then ran
import numpy as np
data1 = np.loadtxt("demo/BIOGRID1_r1.emb",skiprows=1)
data2 = np.loadtxt("demo/BIOGRID1_r2.emb",skiprows=1)
data3 = np.loadtxt("demo/BIOGRID10_r1.emb",skiprows=1)
data4 = np.loadtxt("demo/BIOGRID10_r2.emb",skiprows=1)
print(np.array_equal(data1,data2))
print(np.array_equal(data1,data3))
print(np.array_equal(data1,data4))
print(np.array_equal(data2,data3))
print(np.array_equal(data2,data4))
print(np.array_equal(data3,data4))
Only when workers was 1 were the arrays equal to each other.
|
See #70 Adding Line 86 in 71dd988
|
|
@RemyLau I used |
|
@RemyLau the I tested the changes @kmanpearl made as well and as long as workers=1 the embeddings are reproducible now and there was no way to have that happen before, so whatever changes she made help in that regard. Is I still believe these changes should be merged soon she can use this for her project. |
That's interesting.. We should add this to the test in that case. I've previously checked and confirmed the reproducibility running under single thread using the karate network (see test script), the resulted embeddings from two runs are identical. If, for some reason, this is not true on other cases, then I think there is a deeper issue we need to figure out. My guess is that previously when you ran the program twice for the reproducibility check, you did not explicitly set the |
|
@RemyLau I did use a |
for more information, see https://pre-commit.ci
What: ensure that when
workers=1the run is deterministic.Why: reproducibility
How: set numba workers and generate/use random seeds
Change implmentations:
numba.get_thread_idbecause it is not deterministic_get_random_seeds()which uses therandom_stateattribute to create an array of seeds that are used in the_random_walks()method.set_num_threads(workers)to ensure that numba parallelization usesworkers