Add source from dataframe or list #30

brian-arnold · 2025-07-01T00:04:18Z

Here are two additional ways to specify initial input data from a DataFrame or a List. I tried to have the structure/code mimic that of GlobSource as closely as possible, but it sort of looks like code is being duplicated.

I'm also not sure if the hashing will work as intended? For instance, How do we know if the elements in the lists are Pathlike and the corresponding files should be hashed?

In any case, this code works on simple tests that I can formalize at some point.

…to sync stream

…ipeline

…separator in hash

…ure removal

…e base

…ibutes

eywalker

Only a few minor changes requested -- otherwise it looks amazing!

eywalker · 2025-07-03T21:53:29Z

src/orcapod/core/sources.py

+            raise ValueError(f"Columns not found in DataFrame: {missing_columns}")
+
+        if tag_function is None:
+            tag_function = self.__class__.default_tag_function


in this case, it would make sense for the expected_tag_keys to be set to row_index

eywalker · 2025-07-03T21:54:23Z

src/orcapod/core/sources.py

+                "It generates its own stream from the DataFrame."
+            )
+        # Claim uniqueness only if the default tag function is used
+        if self.tag_function == self.__class__.default_tag_function:


eywalker · 2025-07-03T21:55:19Z

src/orcapod/core/sources.py

+        # Convert DataFrame to hashable representation
+        df_subset = self.dataframe[self.columns]
+        df_content = df_subset.to_dict('records')
+        df_hashable = tuple(tuple(sorted(record.items())) for record in df_content)


Very nice -- amazing reproducibility on the data frame!

eywalker · 2025-07-03T21:56:38Z

src/orcapod/core/sources.py

+        in the packet, with the corresponding row values as the packet values.
+    data : pd.DataFrame
+        The pandas DataFrame to source data from
+    tag_function : Callable[[pd.Series, int], Tag] | None, default=None


Would you mind adding a feature where if a list of strings are passed, they are interpreted as columns whose values should be used as the tags?

eywalker · 2025-07-03T21:58:06Z

src/orcapod/core/sources.py

+        self.expected_tag_keys = expected_tag_keys
+
+        if tag_function is None:
+            tag_function = self.__class__.default_tag_function


if using default tag function, let's let the expected_tag_keys be updated to element_index

eywalker · 2025-07-03T22:00:54Z

src/orcapod/core/sources.py

+    def claims_unique_tags(
+        self, *streams: "SyncStream", trigger_run: bool = True
+    ) -> bool | None:
+        if len(streams) != 0:


Note to self: probably we should extract the stream input check as a separate function as this check is repeated multiple times not only here but in many places throughout operators and sources. Perhaps pre-forward check should be formalized as a step.

Add working pipeline system implementation

…to pipeline

review-notebook-app · 2025-07-03T23:32:21Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

eywalker added 26 commits June 19, 2025 06:58

refactor: update dj package to use new name

67f6a90

feat: add ability to skip computation in pod

ac228b0

refactor: major change of structure and implementation of pipeline

450ec90

refactor: implement ContentHashableBase

09eb947

refactor: significantly clean up label logic

bd3c7a8

optim: avoid len call by using list comprehension

90b9dad

refactor: place Operator back in base

1e61259

refactor: place operator in base and add additional operator methods …

df58134

…to sync stream

wip: change to content identifable base

6e4d4bd

Merge branch 'main' of https://github.com/walkerlab/orcabridge into p…

e8efa44

…ipeline

style: apply ruff formatting

5fb2435

refactor: clean up test of name orcabridge

09f59cb

test: remove filepath specification

c5fcb3d

fix: remove orcabridge reference

22215ca

refactor: rename module to match class

56d559a

refactor: move core to legacy_core

59ad526

fix: update reference to core

3e0cdf4

refactor: rename semantic arrow hasher module to generic arrow hashers

50e0772

refactor: rename variables to typespec

33103b8

feat: collect refined hashing functions

e35b024

feat: collect semantic type hashsers into a module

02412d0

refactor: make file hasher return bytes

1e90679

feat: add new defaut object hasher

78fdead

test: update ref

3dcaa0b

fix: handle type vars in process_structure

89ddd76

wip: use new schema system

905f915

brian-arnold requested a review from eywalker July 1, 2025 00:04

eywalker added 3 commits July 1, 2025 00:52

feat: add field source tracking

a3ba172

feat: support map and join on packets with source info

d3b66de

fix: keep all columns internally

0bafbaa

eywalker added 23 commits July 1, 2025 07:09

fix: failure to reset cache due to mro mixup

e689d0d

style: apply ruff format

6222064

fix: legacy_core imports

cbe82ab

wip: arrow logical serialization

caca67b

refactor: utils renaming and relocation

7bc98e1

fix: cleanup imports and fix issue in recursive structure processing

51f3da2

refactor: add more robust arrow serialization strategy and use @ for …

3d54067

…separator in hash

feat: logical serialization for arrow table

1ac2be6

feat: update versioned arrow hasher to use new serialization

dab3378

wip: delta table store implementation

4f07927

feat: better handling of stores and add flushing to stores and pipeline

1b7519e

feat: integrate actual saving to parquet into simple in memory store

07fd76e

refactor: cleanup improt and comment out old packet converter for fut…

8411b40

…ure removal

fix: attach label on kernel invocation to the invocation object

d90e5c6

fix: invoke superclass init

fe35aba

feat: expose explicit check for assigned label on content identifiabl…

ef301b3

…e base

feat: add label on wrapped invocation

ead6704

doc: add tutorial notebook

cbb8754

refactor: clean up store package

73b2638

feat: improve pipeline usability with typechecks and convenience attr…

555a751

…ibutes

fix: use new store name

083134b

test: update to use new package name

7e33bae

fix: wrong import

5641810

eywalker requested changes Jul 3, 2025

View reviewed changes

brian-arnold and others added 3 commits July 3, 2025 15:10

Merge pull request #27 from eywalker/pipeline

00b4066

Add working pipeline system implementation

doc: handle typing corner cases

c66920c

Merge branch 'pipeline' of https://github.com/walkerlab/orcabridge in…

58d7e40

…to pipeline

brian-arnold force-pushed the brian branch from 0a3955b to 58d7e40 Compare July 3, 2025 23:32

Add ListSource and DataFrameSource

608428f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add source from dataframe or list #30

Add source from dataframe or list #30

Uh oh!

brian-arnold commented Jul 1, 2025

Uh oh!

eywalker left a comment

Uh oh!

eywalker Jul 3, 2025

Uh oh!

eywalker Jul 3, 2025

Uh oh!

eywalker Jul 3, 2025

Uh oh!

eywalker Jul 3, 2025

Uh oh!

eywalker Jul 3, 2025

Uh oh!

eywalker Jul 3, 2025

Uh oh!

review-notebook-app bot commented Jul 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add source from dataframe or list #30

Are you sure you want to change the base?

Add source from dataframe or list #30

Uh oh!

Conversation

brian-arnold commented Jul 1, 2025

Uh oh!

eywalker left a comment

Choose a reason for hiding this comment

Uh oh!

eywalker Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

eywalker Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

eywalker Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

eywalker Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

eywalker Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

eywalker Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

review-notebook-app bot commented Jul 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants