2017-01-30-1242Z


it always happens. once I get into Programming Mode, I start getting all kinds of wild ideas how to optimize stuff. one of these is related to pipelining CSV output between programs, and I'm thinking about ways to reduce the overhead of packing and unpacking the data, keeping its type, i.e., string, float, integer, etc., specified explicitly in the encoding.

not that I've done any research to see if CSV packing and unpacking is even significant... bleah.

one idea is using VLQ encoding, with a tag byte before each quantity indicating its type. table headers would also have their own type byte, as would table rows, so I could probably leave off the type bytes within the row altogether, by including them in the table headers just after the name. for example, let's say the table header type byte is 0xf0, table row 0xf1. string is 0, int is 1, and a float with scale of 2, a common financial spreadsheet format, is 2. here's some sample CSV:

id,dept,amount
1,shoes,1.37

and its equivalent (thanks to the WP article giving me the encoding for 137):

\xf0id\x00\x01dept\x00\x00amount\x00\x02\x00  # final \x00 marks end of header
\xf1\x01shoes\x00\x81\x09

rows need no end marker because their length is known from the header. a final \xff byte can mark the end of the table.

Back to blog or home page

last updated 2017-01-30 16:47:42. served from tektonic.jcomeau.com