Parsing dirty data in Common Lisp
I came across a bit of an issue today when I was building a parser in Common Lisp. Basically when I saved my archive folder in Outlook to a giant text file, I would like to parse through each individual email (there are over 8000), and save them to an individual file, on the hard disk. Later I would integrate this with an option with fetchmail to pull in those emails as well into the same folder structure which would then be indexed by DevonThink as an external folder. A bit of a long story, and I plan to write more about it – but for now, what about trying to parse dirty data in common lisp?
There are a few options about how to clean up dirty data. I came across two differnet options with what I was doing:
1. Just pull the readable data – using NIL for the other bits of data
The advantage of doing this is that it’s really fast. You just make a little different call to (with-open-file) and you effectively skip the data that’s not readable. There are a lot of disadvantages to this approach, mainly your data isn’t going to be near like you may have wanted it to be originally. Bullet points, for example, could be translated to a -. This method, though, will make it NIL, or empty. For my case this was OK, I didn’t really care about the translation of this bit of data – I was interested more in the overall theme of the email rather than the specific formatting.
To accomplish this, you can make use something like the following:
(with-open-file (stream parse-file :external-format :latin1)
….)
Thanks to nikodemus on Freenode for this information.
2. Clean up the data
Emacs gives a fairly nice way of handling this, well kinda. When you load the questionable file, you can type C-x RET f, and set the file encoding. I used utf-8-unix at first. Form that, save the file. You should be presented with a warning saying that some stuff can’t be encoded with that file system, blah blah, blah. You can see a listing, a minor list anyways, of characters that it’s complaining about. Cancel the save with C-g, switch to the warning buffer with C-x o and copy each individual character (C-space right arrow C-w). You can either hit enter at this point to view the first occurance of that character, or you can go to the original buffer and search. Once you made your determination of what that encoding char should be, simply hit M-< to go to the beginning, type M-x string-replace, C-y to insert that character, hit enter, then your substitution character of your wish. It’ll replace all occurances in that buffer with what you want. From there, rinse and repeat for the others.
The obvious disadvantage of this is it’ll take much longer to accomplish the task. The advantage is that you’ll end up with a sane file in the end. I started with this method, but went with method 1 in the end.
The one part I couldn’t figure out how to do, and I’ll likely post an update once I get this answered is when you’re trying to save the buffer with overriding the encoding system – it saves it as raw-text-unix, regardless of what I picked. Given an override, the warning states that I’d just lose those characters, which I was OK with. I’ll try to find out more and post later.





