Skip to content

Conversation

@mlell
Copy link

@mlell mlell commented Jan 4, 2026

While CSVReader has numerous options to skip header or footer lines, or skip lines based on prefixes, it is cumbersome to read lines with varying lines before or after the actual statements. Users would need to override CSVReader.read(). Simply extending using super().read(...) is possible but requires creating a temporary file because read() needs a physical file path. Simply adding an open(str) -> file-like method to CSVReader allows to keep the useful functionality of read() while still being able to preprocess the file, for example, skipping a variable number of lines based on a pattern.

This would fix #196.

@johannesjh
Copy link
Contributor

I second the proposition. It would also help with #197

Instead of strictly requiring a file-like object, the importer could allow more general Iterable[String] type of objects. Users could still pass file-like objects, but also generator expressions. Generator expressions could be a convenient and idiomatic way of manipulating / filtering CSV lines.

This method uses the class members header, footer, names, dialect,
comments, and order. Overriding this method causes those members
to be ignored unless the overriding method explicitly uses them.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change does not belong to this commit or this PR. I also find it unnecessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale was that there would be two ways to customize importer behaviour which partially overlap. Overriding "open()" would render the member "encoding" non-functional, for example. Will remove it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in a5561e3

"""Open the CSV file for reading.
This method can be overridden in subclasses to customize raw file reading,
for example to skip lines before processing or to handle special file formats.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "special file formats" mean here? This class supports a declarative way to define an importer for CSV files. Why would you want to use it for something else?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this is unclearly phrased. I wanted to refer to the fact that CSV has a large number of strange variants.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there are grey areas I think, like compressed CSV or XLSX which are not strictly CSV but are conceptually similar and can easily be converted

Comment on lines 465 to 479
class CustomReader(CSVReader):
first = Column("First")
second = Column("Second")

def open(self, filepath):
"""Skip lines until we find the column headers."""
fd = super().open(filepath)
# Read lines until we find one containing "First"
for line in fd:
if "First" in line:
# Create a new file-like object with the header line and remaining content
remaining = fd.read()
fd.close()
return io.StringIO(line + remaining)
return fd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the iterlines() idea exposed above, this would become simply:

from itertools import dropwhile

class Reader(CSVReader):
    first = Column("First")
    second = Column("Second")

    def iterlines(self, fd):
        return dropwhile(lambda line: "First" not in line, fd)

which seems much better to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reader class is now:

        class Reader(CSVReader):
            first = Column("First")
            second = Column("Second")


            def open(self, filepath):
                """Skip lines until we find the column headers."""
                lines = super().open(filepath)
                return dropwhile(lambda line: "First" not in line, lines)

@mlell mlell force-pushed the dev_override_open branch from 113d4f3 to fd4b7bf Compare January 7, 2026 11:19
This allows subclasses to customize raw file reading.
@mlell mlell force-pushed the dev_override_open branch from fd4b7bf to a5561e3 Compare January 7, 2026 11:26
@mlell
Copy link
Author

mlell commented Jan 7, 2026

  • Changed the return type of open() to be an iterable (e.g., a generator) to allow for e.g. generator expressions
  • Removed the change to the docstring of read()
  • Updated the unit test to use the dropwhile approach

While it is true that this is a reader for CSV data only, I felt that there might be some formats, like compressed data or XLSX, where it might be useful to first open in binary mode, preprocess and then hand over to read(). This would not be possible with the iterlines(). Approach. However, you might have experience with more different bank statement formats that I have, if you think noone is going to need this, the iterlines() approach would be simpler.

Copy link
Collaborator

@dnicolodi dnicolodi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better but not needs some tweaks.

Comment on lines +280 to +288
"""Open the CSV file for reading.
This method can be overridden in subclasses to customize raw file reading,
for example to pre-proceess text lines before import. Note that to skip
a fixed number of lines at the file beginning or end, setting the class
members "header" or "footer" is the easier approach.
This method uses the class member 'encoding'. Overriding this method causes
that member to be ignored unless the overriding method explicitly uses it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Open the CSV file for reading.
This method can be overridden in subclasses to customize raw file reading,
for example to pre-proceess text lines before import. Note that to skip
a fixed number of lines at the file beginning or end, setting the class
members "header" or "footer" is the easier approach.
This method uses the class member 'encoding'. Overriding this method causes
that member to be ignored unless the overriding method explicitly uses it.
"""Open filepath and return an iterable yielding lines of CSV-formatted text.
This method can be overridden in subclasses to customize file reading, for
example to pre-proceess text lines before import. To simply skip a fixed
number of lines at the beginning or end, consider setting the ``header``
or ``footer`` class variables instead.

# warnings.warn('skiplines is deprecated, use header instead', DeprecationWarning)
self.header = self.skiplines

def open(self, filepath):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def open(self, filepath):
def open(self, filepath, encoding):

Returns:
An iterable providing lines of CSV-formatted text.
"""
with open(filepath, encoding = self.encoding) as fd:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with open(filepath, encoding = self.encoding) as fd:
with open(filepath, encoding=encoding) as fd:

# Return data rows.
for x in reader:
yield row(x)
lines = self.open(filepath)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lines = self.open(filepath)
lines = self.open(filepath, self.encoding)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Preprocessing a file for csvbase.Importer requires acrobatics with temporary files

3 participants