Skip to content

Character encoding problems in table names and field values #26

@moonhouse

Description

@moonhouse

When trying to access data from the Svenskt Pressregister 1903-1911 dataset (direct download link: Dataset(8991 kB)) I get some unexpected results.

  • One table is called "Konsert" when I open it in Microsoft Access Office 2007 but when parsed by access_parser it is called "Konsert춢".
  • Some columns of type varchar can't decode the characters and . They instead become \0& (0x00 0x38) and \0\25 \0 (0x00 0x25 0x32 0x00). In Microsoft Access they have the expected values of and respectively.

The following code

from access_parser import AccessParser

db_path = "SVEPDB.accdb"
db = AccessParser(db_path)
tables = db.catalog.keys()

concert_table = [x for x in tables if x.startswith("Konsert")][0]
if concert_table != "Konsert":
    print(f"Can't find Konsert table, found {concert_table}")
else:
    print("Found Konsert table")

table = db.parse_table("nskon")
ellipsis_value = table['ntit'][4338]
apostrophe_value = table['ntit'][14986]

if ellipsis_value != 'Det "fula" Stockholm …':
    print(f"Ellipsis not decoded correctly, got: {ellipsis_value}")
else:
    print("Ellipsis decoded correctly")

if apostrophe_value != 'Landsmålsbref. Tell ’n Stanialus':
    print(f"Apostrophe not decoded correctly, got: {apostrophe_value}")
else:
    print("Apostrophe decoded correctly")

outputs

WARNING:Could not find overflow record data page overflow pointer: 27
WARNING:Could not find overflow record data page overflow pointer: 27
Can't find Konsert table, found Konsert춢
Ellipsis not decoded correctly, got: Det "fula" Stockholm & 
Apostrophe not decoded correctly, got: Landsmålsbref. Tell  n Stanialus

instead of the expected

Found Konsert table
Ellipsis decoded correctly
Apostrophe decoded correctly

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions