I don't know if I was fired or not, but I don't see much hope for getting paid, so just for shits and grins I decided to abandon Spark, and even Pandas, for a couple days and see if I could duplicate the client's scripts with some simple Python using the csv module. turns out it wasn't that difficult, and though it's not fast yet, its output is equivalent to the Pandas scripts it was based on, except it doesn't add the gigabytes of duplicated rows the original scripts do (due to there being duplicated rows in the input data).
and it's not a memory hog.
I believe I can make it fast, too. I haven't really started optimizing yet.
last updated 2017-03-16 16:01:09. served from tektonic.jcomeau.com