Archive for the ‘lisp’ Category

Using the email library Mel with CL

Tuesday, March 30th, 2010

The mel library, if people aren’t familiar with it, is an extremely nice library for parsing various email sources. I started using this as a bit of a project to pull in emails to my central search store. I was extremely surprised how well done this library was – although I admit the documentation is a bit rough around the edges.

So first, if you want to install this you can either install via asdf, likely clbuild, or just grab the darcs repository. Personally I had problems with the asdf install so I went with the stable darcs repo instead. For my install I’ve done the following:

cd /usr/lib/sbcl/site
darcs clone http://common-lisp.net/project/mel-base/darcs/mel-base/
cd ../site-systems
ln -s /usr/lib/sbcl/site/mel-base/mel-base.asd .

The one problem I’ve found is that there is little documentation out there, but we can create some to give a starting point at the very least.

cd /usr/lib/sbcl/site/mel-base/docs/manual
texi2pdf mel.texinfo

You’ll get a mel.pdf if you have texinfo installed.

I will concentrate a bit on using Maildir, since that’s kinda what I was starting with. You connect to a Maildir directory by the following:

(defvar inbox (make-maildir-folder “/home/dthole/Maildir/”))

Form there, you can run quite a few things. The most notable ones around are the functions:

messages
map-messages

messages itself, as the docs describe it, will return all the messages. This is fairly helpful, but the real power comes from map-messages. With map-messages, we can pass a function to evaluate from each message that’s processed. Some sample code of what I’ve done to determine specific email from/to is the following:

(defun mapMessageTest ()
  "From the map-messages, we're given an object, and we can call methods on that."
  (let ((num-ui-froms 0)
        (num-to-greg 0))
    (map-messages #'(lambda (x)
                      (if (scan "foo@bar.com" (address-spec (from x)))
                          (incf num-ui-froms))
                      (if (and (to x) (scan "bar@foo.com" (address-spec
                                                            (if (listp (to x))
                                                            (car (to x))
                                                            (to x)))))
                          (incf num-to-greg)))
                  *inbox*)
    (format t "Num from foo@bar.com: ~a~%" num-ui-froms)
    (format t "Num to bar@foo.com: ~a~%" num-to-greg)))

So what I’m doing here is passing an anonymous function that does two if conditions. x in this case is an message CLOS-type object from mel. We can call (from), (to), (message-string) and so on. Many of these aren’t documented, but the code is there. Anyways, the one curious part people may notice is the issue related to the (address-spec (if ..)) section. The to address can contains a list of elements, if you have more than one email, since you can have multiple tos in an email. I check to say if it’s a list, then get the first element else just return the element. All this is doing is creating two counters and outputting the result at the end.

Another function I mentioned just breifly a bit ago is (message-string). There are a lot of message-* functions out there, this one returns the text of the email.

One big reason why I enjoy this library is the caching it does. I found from my testing that going through 14k emails in my MailDir folder took about 10 seconds or so. Subsequent calls to (mapMessageTest) was MUCH quicker, taking less than a second. There’s some interesting caching the library implemented that’s really helpful while in the REPL. I haven’t looked at the code yet for this, but I’m excited to do it.

The reason I’m finding to use this library is to do some recognition on the emails and copy them to another area on disk. This discussion on documentation and search will be in a later blog article.

Parsing dirty data in Common Lisp

Saturday, March 27th, 2010

I came across a bit of an issue today when I was building a parser in Common Lisp. Basically when I saved my archive folder in Outlook to a giant text file, I would like to parse through each individual email (there are over 8000), and save them to an individual file, on the hard disk. Later I would integrate this with an option with fetchmail to pull in those emails as well into the same folder structure which would then be indexed by DevonThink as an external folder. A bit of a long story, and I plan to write more about it – but for now, what about trying to parse dirty data in common lisp?

There are a few options about how to clean up dirty data. I came across two differnet options with what I was doing:

1. Just pull the readable data – using NIL for the other bits of data
The advantage of doing this is that it’s really fast. You just make a little different call to (with-open-file) and you effectively skip the data that’s not readable. There are a lot of disadvantages to this approach, mainly your data isn’t going to be near like you may have wanted it to be originally. Bullet points, for example, could be translated to a -. This method, though, will make it NIL, or empty. For my case this was OK, I didn’t really care about the translation of this bit of data – I was interested more in the overall theme of the email rather than the specific formatting.

To accomplish this, you can make use something like the following:

(with-open-file (stream parse-file :external-format :latin1)
….)

Thanks to nikodemus on Freenode for this information.

2. Clean up the data

Emacs gives a fairly nice way of handling this, well kinda. When you load the questionable file, you can type C-x RET f, and set the file encoding. I used utf-8-unix at first. Form that, save the file. You should be presented with a warning saying that some stuff can’t be encoded with that file system, blah blah, blah. You can see a listing, a minor list anyways, of characters that it’s complaining about. Cancel the save with C-g, switch to the warning buffer with C-x o and copy each individual character (C-space right arrow C-w). You can either hit enter at this point to view the first occurance of that character, or you can go to the original buffer and search. Once you made your determination of what that encoding char should be, simply hit M-< to go to the beginning, type M-x string-replace, C-y to insert that character, hit enter, then your substitution character of your wish. It’ll replace all occurances in that buffer with what you want. From there, rinse and repeat for the others.

The obvious disadvantage of this is it’ll take much longer to accomplish the task. The advantage is that you’ll end up with a sane file in the end. I started with this method, but went with method 1 in the end.

The one part I couldn’t figure out how to do, and I’ll likely post an update once I get this answered is when you’re trying to save the buffer with overriding the encoding system – it saves it as raw-text-unix, regardless of what I picked. Given an override, the warning states that I’d just lose those characters, which I was OK with. I’ll try to find out more and post later.

CL-SQL fun with wordnet

Friday, November 27th, 2009

Common Lisp surprises me more and more each and every day. I never thought of using it for SQL interaction, but the adventures of CL-SQL for my linguistics class lately really opened my eyes.

CL-SQL is an ORM, like that of ActiveRecord, Django’s ORM, and so on. It leans a little more toward the SQL Alchemy side of things, though. I will likely provide a much more detailed tutorial later once I get a general understanding of this library – but for now I’ll provide a bit for links on where to find more information:

CLSQL Main Site: http://clsql.b9.com/
Script I learned a lot from: http://www.boundp.net/files/weblog.lisp
Github post of my wordnet stuff: http://gist.github.com/244360

Now for a bit of information about wordnet. I’m currently in a computational linguistics course at the university, and part of the class is to develop a project of sorts. Mine relates to the use of wordnet, which is a fairly complicated database. Basically, wordnet was developed by Princeton, that has a fairly large set of words and their definitions, synonyms, hypernyms, hyponyms, and so on. In sort, it shows a lot of ways that one word relates to another word in various ways. I found a web site that had the MSSQL files for Wordnet, which I downloaded and attached. Along with that, a number of example view code that I modified to add to a listing of views. Since I do my development, most of the time, on Linux – I decided to get ODBC setup to the MSSQL server. CL-SQL works beautifully with this setup. The gist above shows the details.

For the SQL Server database I got, you can view the information here: http://opensource.ebswift.com/WordNetSQLServer/

I will likely provide more information as I go with this, as I’m finding wordnet more and more fun to mess with.