Skip to content

Conversation

@kmanpearl
Copy link

What: ensure that when workers=1 the run is deterministic.

Why: reproducibility

How: set numba workers and generate/use random seeds

Change implmentations:

  1. Removed numba.get_thread_id because it is not deterministic
  2. Replaced with _get_random_seeds() which uses the random_state attribute to create an array of seeds that are used in the _random_walks() method.
  3. Added set_num_threads(workers) to ensure that numba parallelization uses workers
  4. Changed default value or random_seed because a value is required for this fix to work

Copy link

@ChristopherMancuso ChristopherMancuso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked this out using the api and the cli and both work for me. To test I used

from pecanpy import pecanpy as node2vec
import numpy as np

emb_results = {}
for idx, num_workers in enumerate([1,1,10,10]):
    # initialize node2vec object, similarly for SparseOTF and DenseOTF
    g = node2vec.PreComp(p=0.5, q=1, workers=num_workers, verbose=True)
    # alternatively, can specify ``extend=True`` for using node2vec+

    edge_fp = "demo/BIOGRID.el"

    # load graph from edgelist file
    g.read_edg(edge_fp, weighted=False, directed=False)
    # precompute and save 2nd order transition probs (for PreComp only)
    g.preprocess_transition_probs()

    # generate random walks, which could then be used to train w2v
    walks = g.simulate_walks(num_walks=10, walk_length=80)

    # alternatively, generate the embeddings directly using ``embed``
    emd = g.embed()
    print(emd)
    emb_results[f"run_{idx}"] = emd

print(emb_results["run_0"])

print(np.array_equal(emb_results["run_0"],emb_results["run_1"]))
print(np.array_equal(emb_results["run_0"],emb_results["run_2"]))
print(np.array_equal(emb_results["run_0"],emb_results["run_3"]))
print(np.array_equal(emb_results["run_1"],emb_results["run_2"]))
print(np.array_equal(emb_results["run_1"],emb_results["run_3"]))
print(np.array_equal(emb_results["run_2"],emb_results["run_3"]))

then ran

pecanpy --input demo/BIOGRID.el --output demo/BIOGRID1_r1.emb --mode SparseOTF --workers 1
pecanpy --input demo/BIOGRID.el --output demo/BIOGRID1_r2.emb --mode SparseOTF --workers 1
pecanpy --input demo/BIOGRID.el --output demo/BIOGRID10_r1.emb --mode SparseOTF --workers 10
pecanpy --input demo/BIOGRID.el --output demo/BIOGRID10_r2.emb --mode SparseOTF --workers 10

then ran

import numpy as np

data1 = np.loadtxt("demo/BIOGRID1_r1.emb",skiprows=1)
data2 = np.loadtxt("demo/BIOGRID1_r2.emb",skiprows=1)
data3 = np.loadtxt("demo/BIOGRID10_r1.emb",skiprows=1)
data4 = np.loadtxt("demo/BIOGRID10_r2.emb",skiprows=1)

print(np.array_equal(data1,data2))
print(np.array_equal(data1,data3))
print(np.array_equal(data1,data4))
print(np.array_equal(data2,data3))
print(np.array_equal(data2,data4))
print(np.array_equal(data3,data4))

Only when workers was 1 were the arrays equal to each other.

@RemyLau RemyLau added the duplicate This issue or pull request already exists label May 1, 2025
@RemyLau
Copy link
Contributor

RemyLau commented May 1, 2025

See #70

Adding _get_random_seeds won't help with getting the embedding results to be deterministic, as the randomness comes from gensim's word2vec (#70 (comment)). The random walk generation is already reproducible given a random state as tested by a unit test:

class TestWalk(unittest.TestCase):

@kmanpearl
Copy link
Author

kmanpearl commented May 1, 2025

@RemyLau I used random_state and set PYTHONHASHSEED but I could not get the same results twice. With the changes I implemented now I do as long as workers=1.

@ChristopherMancuso
Copy link

@RemyLau the I tested the changes @kmanpearl made as well and as long as workers=1 the embeddings are reproducible now and there was no way to have that happen before, so whatever changes she made help in that regard. Is I still believe these changes should be merged soon she can use this for her project.

@RemyLau
Copy link
Contributor

RemyLau commented May 3, 2025

@RemyLau the I tested the changes @kmanpearl made as well and as long as workers=1 the embeddings are reproducible now and there was no way to have that happen before, so whatever changes she made help in that regard. Is I still believe these changes should be merged soon she can use this for her project.

That's interesting.. We should add this to the test in that case. I've previously checked and confirmed the reproducibility running under single thread using the karate network (see test script), the resulted embeddings from two runs are identical. If, for some reason, this is not true on other cases, then I think there is a deeper issue we need to figure out.

My guess is that previously when you ran the program twice for the reproducibility check, you did not explicitly set the --random_state parameter. Is that true? This option, by default, is set to None, which does not ensure the same random seed is used across runs. This choice is made intentionally to not make the embeddings deterministic. However, if reproducibility is desired, the random state could be explicitly set (along with setting the number of threads to one) to achieve that.

@kmanpearl
Copy link
Author

@RemyLau I did use a random_seed and also set PYTHONHASHSEED before running the script. I basically just ran all the lines in demo/reporoducibility.sh individually and my two embedding matrices were not the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

duplicate This issue or pull request already exists

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants