seeding and workers #347

kmanpearl · 2025-04-23T17:59:26Z

What: ensure that when workers=1 the run is deterministic.

Why: reproducibility

How: set numba workers and generate/use random seeds

Change implmentations:

Removed numba.get_thread_id because it is not deterministic
Replaced with _get_random_seeds() which uses the random_state attribute to create an array of seeds that are used in the _random_walks() method.
Added set_num_threads(workers) to ensure that numba parallelization uses workers
Changed default value or random_seed because a value is required for this fix to work

for more information, see https://pre-commit.ci

ChristopherMancuso

I checked this out using the api and the cli and both work for me. To test I used

from pecanpy import pecanpy as node2vec
import numpy as np

emb_results = {}
for idx, num_workers in enumerate([1,1,10,10]):
    # initialize node2vec object, similarly for SparseOTF and DenseOTF
    g = node2vec.PreComp(p=0.5, q=1, workers=num_workers, verbose=True)
    # alternatively, can specify ``extend=True`` for using node2vec+

    edge_fp = "demo/BIOGRID.el"

    # load graph from edgelist file
    g.read_edg(edge_fp, weighted=False, directed=False)
    # precompute and save 2nd order transition probs (for PreComp only)
    g.preprocess_transition_probs()

    # generate random walks, which could then be used to train w2v
    walks = g.simulate_walks(num_walks=10, walk_length=80)

    # alternatively, generate the embeddings directly using ``embed``
    emd = g.embed()
    print(emd)
    emb_results[f"run_{idx}"] = emd

print(emb_results["run_0"])

print(np.array_equal(emb_results["run_0"],emb_results["run_1"]))
print(np.array_equal(emb_results["run_0"],emb_results["run_2"]))
print(np.array_equal(emb_results["run_0"],emb_results["run_3"]))
print(np.array_equal(emb_results["run_1"],emb_results["run_2"]))
print(np.array_equal(emb_results["run_1"],emb_results["run_3"]))
print(np.array_equal(emb_results["run_2"],emb_results["run_3"]))

then ran

pecanpy --input demo/BIOGRID.el --output demo/BIOGRID1_r1.emb --mode SparseOTF --workers 1
pecanpy --input demo/BIOGRID.el --output demo/BIOGRID1_r2.emb --mode SparseOTF --workers 1
pecanpy --input demo/BIOGRID.el --output demo/BIOGRID10_r1.emb --mode SparseOTF --workers 10
pecanpy --input demo/BIOGRID.el --output demo/BIOGRID10_r2.emb --mode SparseOTF --workers 10

then ran

import numpy as np

data1 = np.loadtxt("demo/BIOGRID1_r1.emb",skiprows=1)
data2 = np.loadtxt("demo/BIOGRID1_r2.emb",skiprows=1)
data3 = np.loadtxt("demo/BIOGRID10_r1.emb",skiprows=1)
data4 = np.loadtxt("demo/BIOGRID10_r2.emb",skiprows=1)

print(np.array_equal(data1,data2))
print(np.array_equal(data1,data3))
print(np.array_equal(data1,data4))
print(np.array_equal(data2,data3))
print(np.array_equal(data2,data4))
print(np.array_equal(data3,data4))

Only when workers was 1 were the arrays equal to each other.

RemyLau · 2025-05-01T13:06:26Z

See #70

Adding _get_random_seeds won't help with getting the embedding results to be deterministic, as the randomness comes from gensim's word2vec (#70 (comment)). The random walk generation is already reproducible given a random state as tested by a unit test:

PecanPy/test/test_walk.py

Line 86 in 71dd988

class TestWalk(unittest.TestCase):

kmanpearl · 2025-05-01T13:25:46Z

@RemyLau I used random_state and set PYTHONHASHSEED but I could not get the same results twice. With the changes I implemented now I do as long as workers=1.

ChristopherMancuso · 2025-05-01T13:35:26Z

@RemyLau the I tested the changes @kmanpearl made as well and as long as workers=1 the embeddings are reproducible now and there was no way to have that happen before, so whatever changes she made help in that regard. Is I still believe these changes should be merged soon she can use this for her project.

RemyLau · 2025-05-03T14:06:03Z

@RemyLau the I tested the changes @kmanpearl made as well and as long as workers=1 the embeddings are reproducible now and there was no way to have that happen before, so whatever changes she made help in that regard. Is I still believe these changes should be merged soon she can use this for her project.

That's interesting.. We should add this to the test in that case. I've previously checked and confirmed the reproducibility running under single thread using the karate network (see test script), the resulted embeddings from two runs are identical. If, for some reason, this is not true on other cases, then I think there is a deeper issue we need to figure out.

My guess is that previously when you ran the program twice for the reproducibility check, you did not explicitly set the --random_state parameter. Is that true? This option, by default, is set to None, which does not ensure the same random seed is used across runs. This choice is made intentionally to not make the embeddings deterministic. However, if reproducibility is desired, the random state could be explicitly set (along with setting the number of threads to one) to achieve that.

kmanpearl · 2025-05-05T14:44:13Z

@RemyLau I did use a random_seed and also set PYTHONHASHSEED before running the script. I basically just ran all the lines in demo/reporoducibility.sh individually and my two embedding matrices were not the same.

for more information, see https://pre-commit.ci

seeding and workers

33f5d4c

kmanpearl requested a review from ChristopherMancuso April 23, 2025 17:59

pre-commit-ci bot and others added 8 commits April 23, 2025 17:59

[pre-commit.ci] auto fixes from pre-commit.com hooks

29eb4f9

for more information, see https://pre-commit.ci

changed cli random state default

faceca6

removed comment

767fa5e

removed comments

3d235a3

[pre-commit.ci] auto fixes from pre-commit.com hooks

60fa12f

for more information, see https://pre-commit.ci

temp print for version control

4514569

merge

e8d924b

[pre-commit.ci] auto fixes from pre-commit.com hooks

c15c83c

for more information, see https://pre-commit.ci

ChristopherMancuso approved these changes Apr 29, 2025

View reviewed changes

RemyLau added the duplicate This issue or pull request already exists label May 1, 2025

kmanpearl and others added 2 commits May 5, 2025 08:55

removed default seed

3b3915f

[pre-commit.ci] auto fixes from pre-commit.com hooks

6908fe7

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

seeding and workers #347

seeding and workers #347

Uh oh!

kmanpearl commented Apr 23, 2025

Uh oh!

ChristopherMancuso left a comment

Uh oh!

RemyLau commented May 1, 2025 •

edited

Loading

Uh oh!

kmanpearl commented May 1, 2025 •

edited

Loading

Uh oh!

ChristopherMancuso commented May 1, 2025

Uh oh!

RemyLau commented May 3, 2025

Uh oh!

kmanpearl commented May 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

seeding and workers #347

Are you sure you want to change the base?

seeding and workers #347

Uh oh!

Conversation

kmanpearl commented Apr 23, 2025

Uh oh!

ChristopherMancuso left a comment

Choose a reason for hiding this comment

Uh oh!

RemyLau commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kmanpearl commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChristopherMancuso commented May 1, 2025

Uh oh!

RemyLau commented May 3, 2025

Uh oh!

kmanpearl commented May 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RemyLau commented May 1, 2025 •

edited

Loading

kmanpearl commented May 1, 2025 •

edited

Loading