I'm still playing round with my pseudo-database. Basically it can now load in data from the disk-persistence, run some query ("select __rownum from table where col1 = 'GBP'") and then use the results from that first subquery to give some useful response ("select custname from table where __rownum in (x, y, z)").
Columns currently have two types - integer or string, and string columns can have a specified length. This allows me to emulate most of the datatypes supported by sql. Each type of column can currently 'decide' how to store its data, and that's (still) where things get complicated.
In order to do lookups, I basically need two functions per column type - getRownumsByValue and getValuesByRownum. In other words, I need to be able to do lookups by both key and value. Lookup in an array, by id (rownum) is easy to code - it's just array[id] - and the results are found relatively fast.
However, doing lookups by value this way means walking the array, which is not doable. I can store the information differently (a dictionary where each "id" is a string value, and the associated "value" is a list of rownums). It means I convery the list of key=value into something like value=keys. Doing a lookup by value is now very fast (a dictionary is a hashed tree), but doing a lookup by rownum is now slow again...
Dictionaries do have the advantage of compressing the data (by interning the strings, as it were). It is however, rather impressive the amount of memory a list of row numbers can take up. Once you have more than 65536 rows in your column, a rownumber takes up 4 bytes, which is quite a lot. Even with compression of the data, a string column ends up using around twice as much memory as the original file of strings.
Of course, the fastest lookups would mean storing the data in both key=value and value=key form.
Not the best way of doing things, but very fast (lookup by value in 1000000 lines = 0ms to find 333333 lines. Lookup by rownum of the 333333 lines to find another 333333 values = 16ms. I.E. select custname from table where currency = 'GBP' is just under 20ms, for a table of 1M lines...I'm quite happy with that