-
Notifications
You must be signed in to change notification settings - Fork 55
Allow to customize raw CSV file reading #199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
I second the proposition. It would also help with #197 Instead of strictly requiring a file-like object, the importer could allow more general |
beangulp/importers/csvbase.py
Outdated
| This method uses the class members header, footer, names, dialect, | ||
| comments, and order. Overriding this method causes those members | ||
| to be ignored unless the overriding method explicitly uses them. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change does not belong to this commit or this PR. I also find it unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rationale was that there would be two ways to customize importer behaviour which partially overlap. Overriding "open()" would render the member "encoding" non-functional, for example. Will remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in a5561e3
beangulp/importers/csvbase.py
Outdated
| """Open the CSV file for reading. | ||
| This method can be overridden in subclasses to customize raw file reading, | ||
| for example to skip lines before processing or to handle special file formats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "special file formats" mean here? This class supports a declarative way to define an importer for CSV files. Why would you want to use it for something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this is unclearly phrased. I wanted to refer to the fact that CSV has a large number of strange variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there are grey areas I think, like compressed CSV or XLSX which are not strictly CSV but are conceptually similar and can easily be converted
beangulp/importers/csvbase_test.py
Outdated
| class CustomReader(CSVReader): | ||
| first = Column("First") | ||
| second = Column("Second") | ||
|
|
||
| def open(self, filepath): | ||
| """Skip lines until we find the column headers.""" | ||
| fd = super().open(filepath) | ||
| # Read lines until we find one containing "First" | ||
| for line in fd: | ||
| if "First" in line: | ||
| # Create a new file-like object with the header line and remaining content | ||
| remaining = fd.read() | ||
| fd.close() | ||
| return io.StringIO(line + remaining) | ||
| return fd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the iterlines() idea exposed above, this would become simply:
from itertools import dropwhile
class Reader(CSVReader):
first = Column("First")
second = Column("Second")
def iterlines(self, fd):
return dropwhile(lambda line: "First" not in line, fd)which seems much better to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reader class is now:
class Reader(CSVReader):
first = Column("First")
second = Column("Second")
def open(self, filepath):
"""Skip lines until we find the column headers."""
lines = super().open(filepath)
return dropwhile(lambda line: "First" not in line, lines)113d4f3 to
fd4b7bf
Compare
This allows subclasses to customize raw file reading.
fd4b7bf to
a5561e3
Compare
While it is true that this is a reader for CSV data only, I felt that there might be some formats, like compressed data or XLSX, where it might be useful to first open in binary mode, preprocess and then hand over to |
dnicolodi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better but not needs some tweaks.
| """Open the CSV file for reading. | ||
| This method can be overridden in subclasses to customize raw file reading, | ||
| for example to pre-proceess text lines before import. Note that to skip | ||
| a fixed number of lines at the file beginning or end, setting the class | ||
| members "header" or "footer" is the easier approach. | ||
| This method uses the class member 'encoding'. Overriding this method causes | ||
| that member to be ignored unless the overriding method explicitly uses it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| """Open the CSV file for reading. | |
| This method can be overridden in subclasses to customize raw file reading, | |
| for example to pre-proceess text lines before import. Note that to skip | |
| a fixed number of lines at the file beginning or end, setting the class | |
| members "header" or "footer" is the easier approach. | |
| This method uses the class member 'encoding'. Overriding this method causes | |
| that member to be ignored unless the overriding method explicitly uses it. | |
| """Open filepath and return an iterable yielding lines of CSV-formatted text. | |
| This method can be overridden in subclasses to customize file reading, for | |
| example to pre-proceess text lines before import. To simply skip a fixed | |
| number of lines at the beginning or end, consider setting the ``header`` | |
| or ``footer`` class variables instead. |
| # warnings.warn('skiplines is deprecated, use header instead', DeprecationWarning) | ||
| self.header = self.skiplines | ||
|
|
||
| def open(self, filepath): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def open(self, filepath): | |
| def open(self, filepath, encoding): |
| Returns: | ||
| An iterable providing lines of CSV-formatted text. | ||
| """ | ||
| with open(filepath, encoding = self.encoding) as fd: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| with open(filepath, encoding = self.encoding) as fd: | |
| with open(filepath, encoding=encoding) as fd: |
| # Return data rows. | ||
| for x in reader: | ||
| yield row(x) | ||
| lines = self.open(filepath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| lines = self.open(filepath) | |
| lines = self.open(filepath, self.encoding) |
While CSVReader has numerous options to skip header or footer lines, or skip lines based on prefixes, it is cumbersome to read lines with varying lines before or after the actual statements. Users would need to override CSVReader.read(). Simply extending using
super().read(...)is possible but requires creating a temporary file because read() needs a physical file path. Simply adding anopen(str) -> file-likemethod to CSVReader allows to keep the useful functionality ofread()while still being able to preprocess the file, for example, skipping a variable number of lines based on a pattern.This would fix #196.