Skip to content

Conversation

@sebkur
Copy link

@sebkur sebkur commented Aug 11, 2020

I've been looking into using your library and trying things on a recent Wikipedia dump, it didn't work.

Turns out that

<text xml:space="preserve">

looked like this there:

<text bytes="57805" xml:space="preserve">

So I made the code that looks for the start of the actual wiki text a bit more robust by starting at <text and then seeking forward to the next >.

Instead of relying on the <text ...> node starting to match exactly
'<text xml:space="preserve">', make this step a bit more robust by
looking for the next '>' following the '<text ' opening.
@sebkur
Copy link
Author

sebkur commented Aug 12, 2020

Inspected the code further I just realized that there already is some code for handling the bytes=… part, it seems it just didn't work in my case. I'll push some changes and also add a test case to make sure things work.

@sebkur
Copy link
Author

sebkur commented Aug 12, 2020

Also I didn't see #14 already mentions the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant