diff --git a/di/analytics/analytics.md b/di/analytics/analytics.md new file mode 100644 index 0000000..c241e5e --- /dev/null +++ b/di/analytics/analytics.md @@ -0,0 +1,213 @@ +# Analytics Functions Library + +A set of analytical utilities designed to streamline and make common data manipulation operations more efficient in kdb+/q. + +The library provides specialized functions for handling typical analytical workflows, including forward filling missing values, creating custom time intervals, pivoting tables, and generating cross-product expansions. Each function accepts dictionary parameters or a table for flexible configuration and includes robust error handling with informative messages. + +--- + +## Overview + +- **`ffill`** – Forward fill missing values within columns (optionally by group). +- **`ffillzero`** – Treat zeros as missing and forward fill with the last non-zero value. +- **`intervals`** – Generate custom time/value intervals with configurable step and rounding. +- **`pivot`** – Transform tables into cross-tab (wide) format using a pivot column. +- **`rack`** – Build cross products of key columns, optionally with time intervals and base tables. + +--- + +## Functions + +### ⚙️`ffill` + +**Description** + +Forward fills null values in specified columns with the most recent non-null observation. Supports both table-level operations and granular control through dictionary parameters. + +**Parameters** +- Input can be either a table or a dictionary. +- when argument is table, forward fill the whole table (same as calling fills). +- When using dictionary format: + - `table`: The table to process (**required**) + - `keycols`: Column(s) to fill (optional – defaults to all columns) + - `by`: Grouping column(s) for segmented filling (optional) + +**Behaviour** +- Processes columns independently, preserving data types. +- Handles both typed columns and mixed-type columns. +- When `by` is specified, filling occurs within each group. +- Combines `by` and `keycols` for targeted group-wise filling. + +**Examples** +```q +// Fill all columns in a table +filledTable: ffill[table] + +// Fill specific columns +ffill[`table`keycols!(myTable; `ask`bid)] + +// Group-wise filling by symbol +ffill[`table`by`keycols!(myTable; `sym; `price`size)] + +// Combined grouping and column selection +ffill[`table`by`keycols!(myTable; `sym; `ask`bid)] +``` + +--- + +
+ + +### ⚙️ `ffillzero` + +**Description** + +Extends forward-fill functionality to handle zero values by treating them as missing data points before applying the fill operation. + +**Parameters** +- Dictionary with: + - `table`: Source table (**required**) + - `keycols`: Columns where zeros should be filled (**required**) + - `by`: Optional grouping column(s) + +**Behaviour** +1. Converts zero values to null in specified columns. +2. Applies `ffill` logic. +3. Returns a table with zeros replaced by previous non-zero values. + +**Examples** +```q +// Replace zeros with last non-zero value +ffillzero[`table`keycols!(priceData; `bid`ask)] + +// Group-wise zero filling +ffillzero[`table`by`keycols!(priceData; `sym; `price)] +``` + +--- +
+ +### ⚙️ `intervals` + +**Description** + +Generates custom time or numeric interval sequences with configurable start, end, and increment parameters. Supports multiple temporal data types with optional rounding to interval boundaries. + +**Parameters** +- Dictionary containing: + - `start`: Beginning of interval range (**required**) + - `end`: End of interval range (**required**) + - `interval`: Step size between successive intervals (**required**) + - `round`: Boolean flag for rounding start to nearest interval boundary (optional, default: `1b`) + +**Behaviour** +- Supports multiple data types: `minute`, `second`, `time`, `timespan`, `timestamp`, `month`, `date`, `int`, `long`, `short`, `byte`. +- `start` and `end` must have matching data types. +- When `round` is false or omitted, `start` is rounded down to the nearest interval boundary. +- The sequence excludes any final interval that would exceed `end`. +- Date/month intervals: `interval` must be int or long (fractional dates/months not permitted) +- Timestamp intervals: interval accepts minute, second, timespan, int, or long. When using numeric types (int/long), values represent nanoseconds—use caution to prevent overflow. + +**Examples** +```q +// Generate 15-minute intervals for trading day +intervals[`start`end`interval!(09:30:00.000; 16:00:00.000; 00:15:00.000)] + +// Daily intervals without rounding +intervals[`start`end`interval`round!(2024.01.01; 2024.12.31; 1; 0b)] + +// Hourly timestamps with automatic rounding +intervals[`start`end`interval!(2024.01.01D09:00:00; 2024.01.01D17:00:00; 01:00:00)] +``` + +--- +
+ +### ⚙️`pivot` + +**Description** + +Reorganizes tabular data by transforming unique values from a pivot column into individual columns, with aggregated values at intersections. Creates a cross-tabular representation suitable for reporting and analysis. + +**Parameters** +- Dictionary with: + - `table`: Source table (**required**) + - `by`: Row grouping column(s) (**required**) + - `piv`: Column whose distinct values become new columns (**required**) + - `var`: Value column(s) to aggregate (**required**) + - `f`: Column naming function (optional – defaults to concatenation with underscore) + - `g`: Column ordering function (optional – defaults to keeping `by` columns followed by sorted pivot columns) + +**Behaviour** +- Groups data by `by` columns to form rows. +- Groups by `piv` columns to determine new column structure. +- Aggregates `var` values at each intersection. +- Applies naming and ordering functions (`f`, `g`) to the final result. + +**Examples** +```q +// Basic pivot: levels become columns +pivot[`table`by`piv`var!(quotes; `date`sym`time; `level; `price)] + +// Multiple aggregation columns +pivot[`table`by`piv`var!(trades; `date`sym; `exchange; `price`volume)] + +// Custom column naming +pivot[`table`by`piv`var`f!(data; `date; `category; `value; {[v;P] `$"_" sv' string v,'P})] +``` + +--- +
+ +### ⚙️`rack` + +**Description** + +Constructs a cross product of distinct column values, creating all possible combinations. Optionally integrates time series intervals and/or base table expansion for comprehensive data frameworks. + +**Parameters** +- Dictionary containing: + - `table`: Source table (**required**) + - `keycols`: Columns to cross-product (**required**) + - `base`: Additional table to cross with result (optional) + - `timeseries`: Dictionary for interval generation (optional, uses `intervals` function) + - `fullexpansion`: Boolean for complete Cartesian product of key columns (optional, default: `0b`) + +**Behaviour** +- Standard mode preserves existing row-wise combinations in `keycols`. +- Full expansion mode (`fullexpansion` = `1b`) generates all possible combinations across `keycols`. +- Can integrate with time series intervals for temporal expansion. +- Supports base table cross-product for additional dimensionality. + +**Examples** +```q +// Generate all symbol combinations from table +rack[`table`keycols`fullexpansion!(trades; `sym; 1b)] + +// Rack with time intervals +rack[`table`keycols`timeseries!(trades; `sym; `start`end`interval!(09:30; 16:00; 00:15))] + +// Combine base table with rack and intervals +rack[`table`keycols`base`timeseries!(trades; `sym; baseData; intervalDict)] + +// Preserve existing combinations without expansion +rack[`table`keycols!(quotes; `sym`exchange)] +``` + +--- + +## Error Handling + +The functions implement comprehensive validation with descriptive error messages: +- **Type validation** – Ensures input parameters match expected types. +- **Structure validation** – Verifies the dictionary contains required keys. +- **Column validation** – Confirms specified columns exist in target tables. +- **Data type consistency** – Validates matching types across related parameters. + +**Example error messages** +```q +'Input parameter must be a dictionary with keys-(table, keycols, by), or a table to fill +'Input parameter must be a dictionary with at least three keys (an optional key round):-start-end-interval +'some columns provided do not exist in the table +'interval start and end data type mismatch +``` diff --git a/di/analytics/analytics.q b/di/analytics/analytics.q new file mode 100644 index 0000000..565b849 --- /dev/null +++ b/di/analytics/analytics.q @@ -0,0 +1,101 @@ +ffill:{[arg] + / forward fills null values in specified columns (or all columns) with the last non-null value, optionally grouped by key columns. The function is the single point of entry for different input types: dictionary or table. + :$[.Q.qt arg;filltable[arg]; + 99h=type arg;filldict[arg]; + '`$"Input parameter must be a dictionary with keys-(table, keycols, by), or a table to fill"]; + } + +/ forward fill a column in a table, handle both typed and mixed columns +fillcol: {$[0h=type x; x maxs (til count x)*(0 type d; '`$"input should be a dictionary"; + not all `start`end`interval in fkey:key[d];'`$"Input parameter must be a dictionary with at least three keys (an optional key round):\n\t-",sv["\n\t-";string `start`end`interval]; + any not (itype:.Q.ty'[d`start`end`interval`round]) in ("MmuUiIjJhHNnVvDdPptTB");'`$("One or more of inputs are of an invalid type."); + 1 type d; '`$"input should be a dictionary"; + not all `table`by`piv`var in fkey:key[d];'`$"Input parameter must be a dictionary with at least four keys (with optional keys f and g):\n\t-",sv["\n\t-";string `table`by`piv`var]; + any not itype:.Q.ty'[d`table`by`piv`var] in (" sS");'`$("One or more of inputs are of an invalid type.")]; + + if[(any/) not d[`by`piv`var] in cols [d`table];'`$"some columns provided do not exist in the table"]; + + t:d`table; + k:(),d`by; + p:(),d`piv; + v:(),d`var; + f:$[`f in fkey;d`f;{[v;P] `$"_" sv' string (v,()) cross P}]; + g:$[`g in fkey;d`g;{[k;c] k,asc c}]; + G:group flip k!(t:.Q.v t)k; + F:group flip p!t p; + + count[k]!g[k;C]xcols 0!key[G]!flip(C:f[v]P:flip value flip key F)!raze + {[i;j;k;x;y] + a:count[x]#x 0N; + a[y]:x y; + b:count[x]#0b; + b[y]:1b; + c:a i; + c[k]:first'[a[j]@'where'[b j]]; + c}[I[;0];I J;J:where 1<>count'[I:value G]]/:\:[t v;value F]} + +rack:{[d] + / Creates a cross product (rack) of distinct column values, optionally with time series intervals and/or base table expansion + $[99h<> type d; '`$"input should be a dictionary"; + not all `table`keycols in fkey:key[d];'`$"Input parameter must be a dictionary with at least two keys (with optional keys base, timeseries, fullexpansion):\n\t-",sv["\n\t-";string `table`keycols]]; + if[any not d[`keycols] in cols [d`table];'`$"some of the key columns provided do not exist in the table"]; + + tab:d`table; + keycol:d`keycols; + fullexp:$[`fullexpansion in fkey;d`fullexpansion;0b]; + rackkeycol:$[fullexp;flip keycol!flip (cross/) distinct@/:(0!tab)[keycol];flip keycol!(0!tab)[keycol]]; + if[`timeseries in fkey; + timeinterval:flip (enlist `interval)!enlist intervals[d`timeseries]; + :$[`base in fkey; (cross/)(d`base;rackkeycol;timeinterval); (cross/)(rackkeycol;timeinterval)]]; + :$[`base in fkey; (cross/)(d`base;rackkeycol); rackkeycol]; + } + + + + diff --git a/di/analytics/init.q b/di/analytics/init.q new file mode 100644 index 0000000..befb656 --- /dev/null +++ b/di/analytics/init.q @@ -0,0 +1,3 @@ +\l ::analytics.q + +export:([ffill:ffill;ffillzero:ffillzero;intervals:intervals;pivot:pivot;rack:rack]) diff --git a/di/analytics/test.csv b/di/analytics/test.csv new file mode 100644 index 0000000..f9e483e --- /dev/null +++ b/di/analytics/test.csv @@ -0,0 +1,101 @@ +action,ms,bytes,lang,code,repeat,minver,comment +before,0,0,q,analytics:use`analytics,1,1,load module into session +run,0,0,q,N:20;prob:0.2,1,,initialize testing parameters +run,0,0,q,zerotable:table:`time xasc ([]time:N?.z.P;sym:N?`AMD`AAPL`MSFT`IBM;ask:N?100f;bid:N?100f;asize:N?500i;bsize:N?500i;ex:N?`NYSE`CME`LSE),1,,define testing table +run,0,0,q,update ask:?[prob>N?1f;0n;ask] from `table,1,,add null values in testing table +run,0,0,q,update bid:?[prob>N?1f;0n;bid] from `table,1,,add null values in testing table +run,0,0,q,update asize:?[prob>N?1f;0n;asize] from `table,1,,add null values in testing table +run,0,0,q,update bsize:?[prob>N?1f;0n;bsize] from `table,1,,add null values in testing table +run,0,0,q,update ex:?[prob>N?1f;`;ex] from `table,1,,add null values in testing table +run,0,0,q,update ask:?[prob>N?1f;0;ask] from `zerotable,1,,add zero values in testing table +run,0,0,q,update bid:?[prob>N?1f;0;bid] from `zerotable,1,,add zero values in testing table +run,0,0,q,update asize:?[prob>N?1f;0;asize] from `zerotable,1,,add zero values in testing table +run,0,0,q,update bsize:?[prob>N?1f;0;bsize] from `zerotable,1,,add zero values in testing table + +run,0,0,q,t1:analytics.ffill[`table`keycols!(table;`ask`bid)],1,,fill ask and bid columns +true,0,0,q,0=(sum/)(null 1_t1`ask;null 1_t1`bid),1,,verify no missing values in the columns except first row +true,0,0,q,(sum 1_null table`ask)=(count t1`ask)-(count distinct t1`ask),1,,verify number of filled matches with missing values +true,0,0,q,(sum 1_null table`bid)=(count t1`bid)-(count distinct t1`bid),1,,verify number of filled matches with missing values + +run,0,0,q,t2:analytics.ffill[`table`by!(table; `sym)],1,,fill by sym +run,0,0,q,nullcols:cols[t2] where any each null t2 cols[t2],1,,get null columns +run,0,0,q,"symlist:differ exec sym from t:`sym xasc ?[`t2;();0b;(`sym,nullcols)!(`sym,nullcols)]",1,,get sorted symbol list +true,0,0,q,"symlist~symlist | (|/) null t (nullcols except `sym)",1,,verify all nulls appears on the first occurrence of each sym value, i.e. no prior value to forward fill with + +run,0,0,q,t3:analytics.ffill[`table`by`keycols!(table; `sym;`asize)],1,,fill asize column only by sym +run,0,0,q,symlist:differ exec sym from t:`sym xasc ?[`t3;();0b;(`sym`asize)!(`sym`asize)],1,,get sorted symbol list +true,0,0,q,symlist~symlist | null t `asize,1,,verify all nulls appears on the first occurrence of each sym value, i.e. no prior value to forward fill with + +run,0,0,q,t4:analytics.ffill[table],1,,fill whole table(note first row will not get filled as no prior value) +true,0,0,q,0b~(any/) null 1_t4,1,,verify all null values filled + +run,0,0,q,nestedtable:([]sym:`a`b`c`a;exch:("LDN";"";"HK";"");v: (12 39;9 9;0N;1 2)),1,,define testing table with nested values +run,0,0,q,t4:analytics.ffill[nestedtable],1,,fill whole table +true,0,0,q,0b~1b in (raze/) value flip null t4,1,, verify all null values filled +true,0,0,q,(exec v from t4 where sym=`b)~(exec v from t4 where sym=`c),1,,verify nested values is filled forward +true,0,0,q,(raze exec exch from t4)~("LDNLDNHKHK"),1,,verify nested values is filled forward + +run,0,0,q,t5:analytics.ffill[`table`by`keycols!(nestedtable; `sym;`exch)],1,,fill by sym +true,0,0,q,(raze (exec distinct exch from t5 where sym=`a;exec distinct exch from t5 where sym=`b))~("LDN";""),1,,verify exch column filled where sym=`a and not filled for sym=`b since no prior value +true,0,0,q,null first exec v from t5 where sym=`c,1,,verify v column is not filled as keycols is exch only + +run,0,0,q,t6:analytics.ffillzero[`table`by`keycols!(zerotable; `sym;`bid`ask)],1,,fill zero values by sym +run,0,0,q,symlist:differ exec sym from t:`sym xasc ?[`t6;();0b;(`sym`bid`ask)!(`sym`bid`ask)],1,,get sorted symbol list +true,0,0,q,symlist~symlist | 0= t `bid,1,,verify all zeros appears on the first occurrence of each sym value, i.e. no prior value to forward fill with +true,0,0,q,symlist~symlist | 0= t `ask,1,,verify all zeros appears on the first occurrence of each sym value, i.e. no prior value to forward fill with + +run,0,0,q,params:`start`end`interval`round!(09:32;12:00;00:30;0b),1,,define input dictionary +true,0,0,q,(09:32 10:02 10:32 11:02 11:32)~analytics.intervals[params],1,,,verify intervals generated as specified + +run,0,0,q,params:`start`end`interval`round!(09:32;12:00;00:30;1b),1,,define input dictionary +true,0,0,q,(09:30 10:00 10:30 11:00 11:30 12:00)~analytics.intervals[params],1,,,verify intervals generated as specified + +run,0,0,q,params:`start`end`interval!(09:32;12:00;00:30),1,,define interval dictionary +true,0,0,q,(09:30 10:00 10:30 11:00 11:30 12:00)~analytics.intervals[params],1,,,verify intervals generated as specified + +run,0,0,q,params:`start`end`interval!(2001.04.07;2001.05.01;5),1,,define input dictionary +true,0,0,q,(2001.04.05 2001.04.10 2001.04.15 2001.04.20 2001.04.25 2001.04.30)~analytics.intervals[params],1,,,verify intervals generated as specified + +run,0,0,q,params:`start`end`interval`round!(2001.04.07;2001.05.01;5;0b),1,,define input dictionary +true,0,0,q,(2001.04.07 2001.04.12 2001.04.17 2001.04.22 2001.04.27)~analytics.intervals[params],1,,,verify intervals generated as specified + +run,0,0,q,params:`start`end`interval`round!(2001.04.07;2001.05.01;5;0b),1,,define input dictionary +true,0,0,q,(2001.04.07 2001.04.12 2001.04.17 2001.04.22 2001.04.27)~analytics.intervals[params],1,,,verify intervals generated as specified + +run,0,0,q,params:`start`end`interval!(00:01:00.000000007;00:05:00.000000001;50000000000),1,,define input dictionary +true,0,0,q,(0D00:00:50.000000000 0D00:01:40.000000000 0D00:02:30.000000000 0D00:03:20.000000000 0D00:04:10.000000000 0D00:05:00.000000000)~analytics.intervals[params],1,,,verify intervals generated as specified + +run,0,0,q,t:([]sym:`a`b`a;exch:`nyse`nyse`cme;price:1 2 3),1,,define testing table +run,0,0,q,dic:`table`keycols!(t;`sym`exch),1,,define input dictionary +true,0,0,q,([]sym:`a`b`a;exch:`nyse`nyse`cme)~analytics.rack[dic],1,,verify table is unalterted with input dictionary + +run,0,0,q,dic:`table`keycols`timeseries`fullexpansion!(t;`sym`exch;`start`end`interval!09:00 12:00 01:00;1b),1,,define input dictionary +run,0,0,q,res:analytics.rack[dic],1,,call rack function on input +true,0,0,q,(16~count res)& (98h=type res),1,,verify output as a table with expected number of rows +true,0,0,q,(&/) raze (`a`b in res`sym ; (`nyse`cme) in res`exch; (09:00 10:00 11:00 12:00) in res`interval),1,, verify output contains all possible values + +run,0,0,q,dic:`table`keycols`timeseries`base`fullexpansion!(t;`sym`exch;`start`end`interval!00:00:00 02:00:00 00:30:00;flip (enlist`base)!enlist `buy`sell`buy`sell;1b),1,,define input dictionary +run,0,0,q,res:analytics.rack[dic],1,,call rack function on input +true,0,0,q,(80~count res)& (98h=type res),1,,verify output as a table with expected number of rows +true,0,0,q,(&/) raze (`a`b in res`sym ; (`nyse`cme) in res`exch; (00:00:00 00:30:00 01:00:00 01:30:00 02:00:00) in res`interval),1,, verify output contains all possible values + +run,0,0,q,dic:`table`keycols`timeseries!(t;`sym`exch;`start`end`interval!00:00:00 02:00:00 00:35:00),1,,define input dictionary without fullexpansion parameter +run,0,0,q,res:analytics.rack[dic],1,,call rack function on input +true,0,0,q,(12~count res)& (98h=type res),1,,verify output as a table with expected number of rows +true,0,0,q,(&/) raze (`a`b in res`sym ; (`nyse`cme) in res`exch; (00:00:00 00:35:00 01:10:00 01:45:00) in res`interval),1,, verify output contains all possible values without fullexpansion which has default value 0b + +run,0,0,q,quote:([]date:2001.01.05;sym:N?`6;time:N?.z.t;side:N?`A`B;level:N?(0;2;1;4);price:N?100f;size:N?100i),1,,define table for pivot +run,0,0,q,args:(`table`by`piv`var)!(quote;`sym;`side;`size),1,,define input dictionary +run,0,0,q,res:analytics.pivot args,1,,call pivot function on input +true,0,0,q,"(99h=type res)&(((),args`by)~cols key res)&((`size_A`size_B)~cols value res)",1,,verify output in the correct format +true,0,0,q,(asc (raze value flip value res) except 0n)~(asc raze quote args`var),1,,verify pivoted values conform to original data + +run,0,0,q,args:(`table`by`piv`var)!(quote;`date`sym`time;`level;`price),1,,define input dictionary +run,0,0,q,res:analytics.pivot args,1,,call pivot function on input +true,0,0,q,"(99h=type res)&(((),args`by)~cols key res)&(`price_0`price_1`price_2`price_4~cols value res)",1,,verify output in the correct format +true,0,0,q,(asc (raze value flip value res) except 0n)~(asc raze quote args`var),1,,verify pivoted values conform to original data + +run,0,0,q,args:(`table`by`piv`var)!(quote;`date`sym;`side`level;`price`size),1,,define input dictionary +run,0,0,q,res:analytics.pivot args,1,,call pivot function on input +true,0,0,q,"(99h=type res)&(((),args`by)~cols key res)&(((count distinct (,/') flip string quote args`piv)*count args`by)~count cols value res)",1,,verify output in the correct format +true,0,0,q,(asc (raze value flip value res) except 0n)~(asc raze quote args`var),1,,verify pivoted values conform to original data