Skip to content

Parquet serializer #511

@runejo

Description

@runejo

The Parquet serializer has two issues. They are illustrated with the following api query

https://data.ssb.no/api/pxwebapi/v2/tables/14216/data/?valuecodes[TettSted]=0801  
&valuecodes[ContentsCode]=Areal,Bosatte&valuecodes[Tid]=2025,2024&outputFormat=parquet

Resulting Parquet

år timestamp tettsted ContentsCode_Areal ContentsCode_Areal_symbol ContentsCode_Bosatte ContentsCode_Bosatte_symbol
2024 2024-01-01T00:00:00.000 0801 275,87   1110887  
2024 2024-01-01T00:00:00.000 0801 1110887   276,3  
2025 2025-01-01T00:00:00.000 0801 276,3   1098061  
2025 2025-01-01T00:00:00.000 0801 1098061   1098061  
  1. Selecting two or more contents (Areal and Bosatte) creates to many rows in the resulting parquet file, in this case there should have been two rows
  2. Selecting years 2025,2024 is not the same as selecting 2024,2025. In this case the the first row is actually the 2025 figures. The reason for this is that the parquet seralizer uses TIMEVAL and from the px output below we see that TIMVAL is the same when swapping the years. The api does not sort any valuecodes. This is intentional in the new api.
$ curl "https://data.ssb.no/api/pxwebapi/v2/tables/14216/data/?valuecodes%5bTettSted%5d=0801&valuecodes%5bContentsCode%5d=Areal,Bosatte&valuecodes%5bTid%5d=2025,2024&outputFormat=px" -s -i | grep -E '(TIMEVAL|CODES|VALUES)'
VALUES("tettsted")="Oslo";
VALUES("statistikkvariabel")="Areal av tettsted (km?)","Bosatte";
VALUES("år")="2025","2024";
TIMEVAL("år")=TLIST(A1),"2024","2025";
CODES("tettsted")="0801";
CODES("statistikkvariabel")="Areal","Bosatte";
CODES("år")="2025","2024";
$ curl "https://data.ssb.no/api/pxwebapi/v2/tables/14216/data/?valuecodes%5bTettSted%5d=0801&valuecodes%5bContentsCode%5d=Areal,Bosatte&valuecodes%5bTid%5d=2024,2025&outputFormat=px" -s -i | grep -E '(TIMEVAL|CODES|VALUES)'
VALUES("tettsted")="Oslo";
VALUES("statistikkvariabel")="Areal av tettsted (km?)","Bosatte";
VALUES("år")="2024","2025";
TIMEVAL("år")=TLIST(A1),"2024","2025";
CODES("tettsted")="0801";
CODES("statistikkvariabel")="Areal","Bosatte";
CODES("år")="2024","2025";

The first issue with to many rows is a clear bug. I have changed the tests and will try and fix the bug in PxTools/PCAxis.Serializers#181

For the second issue it is not clear if the bug is in the parquet serializer or in the PxWebApi for not sorting time in ascending order?

Is this a valid PX file according to the TIMEVAL documentation?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions