Parsing dirty data in Common Lisp

March 27th, 2010

I came across a bit of an issue today when I was building a parser in Common Lisp. Basically when I saved my archive folder in Outlook to a giant text file, I would like to parse through each individual email (there are over 8000), and save them to an individual file, on the hard disk. Later I would integrate this with an option with fetchmail to pull in those emails as well into the same folder structure which would then be indexed by DevonThink as an external folder. A bit of a long story, and I plan to write more about it – but for now, what about trying to parse dirty data in common lisp?

There are a few options about how to clean up dirty data. I came across two differnet options with what I was doing:

1. Just pull the readable data – using NIL for the other bits of data
The advantage of doing this is that it’s really fast. You just make a little different call to (with-open-file) and you effectively skip the data that’s not readable. There are a lot of disadvantages to this approach, mainly your data isn’t going to be near like you may have wanted it to be originally. Bullet points, for example, could be translated to a -. This method, though, will make it NIL, or empty. For my case this was OK, I didn’t really care about the translation of this bit of data – I was interested more in the overall theme of the email rather than the specific formatting.

To accomplish this, you can make use something like the following:

(with-open-file (stream parse-file :external-format :latin1)
….)

Thanks to nikodemus on Freenode for this information.

2. Clean up the data

Emacs gives a fairly nice way of handling this, well kinda. When you load the questionable file, you can type C-x RET f, and set the file encoding. I used utf-8-unix at first. Form that, save the file. You should be presented with a warning saying that some stuff can’t be encoded with that file system, blah blah, blah. You can see a listing, a minor list anyways, of characters that it’s complaining about. Cancel the save with C-g, switch to the warning buffer with C-x o and copy each individual character (C-space right arrow C-w). You can either hit enter at this point to view the first occurance of that character, or you can go to the original buffer and search. Once you made your determination of what that encoding char should be, simply hit M-< to go to the beginning, type M-x string-replace, C-y to insert that character, hit enter, then your substitution character of your wish. It’ll replace all occurances in that buffer with what you want. From there, rinse and repeat for the others.

The obvious disadvantage of this is it’ll take much longer to accomplish the task. The advantage is that you’ll end up with a sane file in the end. I started with this method, but went with method 1 in the end.

The one part I couldn’t figure out how to do, and I’ll likely post an update once I get this answered is when you’re trying to save the buffer with overriding the encoding system – it saves it as raw-text-unix, regardless of what I picked. Given an override, the warning states that I’d just lose those characters, which I was OK with. I’ll try to find out more and post later.

Compiling Ruby on Arch

March 24th, 2010

There are a few ways of dealing with Ruby/Rails on arch linux. There is a AUR entry for ruby, for 1.8. The default for pacman is 1.9. For my office, we were stuck with 1.8 for the most part so I decided to install by source (AUR may have been easier, but I don’t know it very well so I didn’t use it right now but may later – I do like being frozen at this version for work purposes though…)

To install it the way I did is fairly simple, all you need to do is:

wget ftp://ftp.ruby-lang.org/pub/ruby/1.8/ruby-1.8.7-p249.tar.bz2
tar -xjvf ruby-1.8.7-p249.tar.bz2
./configure –prefix=/usr –enable-shared –enable-pthread
make && make install

For the prefix, I know it’s usually customary to keep things in /usr/local. For my purpose, I didn’t do that because I had a lot of gems already installed (as well as rubygems itself) and they were in that location from the arch install that I did for 1.9. If you prefer it in /usr/local, just change the configure prefix option above.

The tricky part that I ran into so much was the –enable-shared and –enable-pthread options. Without those, our rake test task just would randomly die with an error such as:

note: ruby[5998] exited with preempt_count 1
BUG: scheduling while atomic: ruby/5998/0×10000002
Modules linked in: ipv6 vmsync vmmemctl vmblock vmhgfs ext2 mbcache snd_seq_dummy fan snd_seq_oss snd_seq_midi_event snd_seq snd_ens1371 snd_pcm_oss gameport s
nd_rawmidi snd_seq_device snd_ac97_codec ac97_bus snd_pcm snd_mixer_oss parport_pc uhci_hcd pcnet32 snd_timer battery vmxnet ppdev ehci_hcd snd soundcore snd_p
age_alloc mii usbcore container ac shpchp pci_hotplug i2c_piix4 intel_agp lp i2c_core processor button thermal psmouse pcspkr parport serio_raw evdev sg rtc_cm
os rtc_core rtc_lib reiserfs sr_mod cdrom pata_acpi ata_generic sd_mod floppy ata_piix libata mptspi mptscsih mptbase scsi_transport_spi scsi_mod
Pid: 5998, comm: ruby Tainted: G D W 2.6.32-ARCH #1
Call Trace:
[] ? thread_return+0×666/0×7ae
[] ? __cond_resched+0×1d/0×30
[] ? _cond_resched+0×2e/0×40
[] ? unmap_vmas+0×8e1/0xaa0
[] ? vt_console_print+0×7c/0×330
[] ? exit_mmap+0xc6/0×1d0
[] ? mmput+0×32/0xf0
[] ? exit_mm+0xfa/0×140
[] ? do_exit+0×136/0×7c0
[] ? printk+0×40/0×45
[] ? release_console_sem+0×1b0/0×200
[] ? oops_end+0xa3/0xf0
[] ? no_context+0xfa/0×260
[] ? HgfsDirOpen+0×0/0×30 [vmhgfs]
[] ? page_fault+0×25/0×30
[] ? task_rq_lock+0×3a/0xa0
[] ? try_to_wake_up+0×58/0×330
[] ? __mutex_unlock_slowpath+0xa1/0×150
[] ? HgfsDirLlseek+0×98/0xe0 [vmhgfs]
[] ? sys_lseek+0×6e/0×90
[] ? system_call_fastpath+0×16/0×1b
stack segment: 0000 [#10] PREEMPT SMP
…..

The part that convinced me of the issue was the line “scheduling while atomic”, which implied it was trying to spawn off another process of sorts and I went hunting around the AUR repository and found the right compile flags that worked. http://aur.archlinux.org/packages/ruby1.8/ruby1.8/PKGBUILD

Overall I was pretty happy that this fix worked – and I’m very curious if we had only one CPU/core, would this have been an issue? I’m not willing to disable it on my virtual machine to find out though.

CouchDB on Arch Linux

March 18th, 2010

CouchDB on Arch was a bit of a pain to get working. I wanted to share my thoughts on how I got it to work and hopefully that’ll help solve a few people’s similar problems.

First, I couldn’t find a prebuilt package within arch (pacman) so we had to go totally by source of couchdb, to make this work. You can do it by the following steps:

1. Install Pacman Dependencies
pacman -S gcc make erlang extra/icu spidermonkey automake autoconf curl
2. Download the source code from the web site:

http://www.apache.org/dyn/closer.cgi?path=/couchdb/0.10.1/apache-couchdb-0.10.1.tar.gz

3. Unpack the source code: tar -xzvf apache-couchdb-0.10.1.tar.gz
4. cd into the directory and run ./configure –prefix=/
5. Run make and make install

What you’ll notice is that if you don’t run the above prefix, it goes into strange places, such as /usr/local/rc.d, which complicated matters when it came to finding the install locations for everything. If you already tried to install it, just cd back into the directory at step 4 and run “make uninstall” which will clean it all out first, then rerun configure, make and make install listed above.

You’ll notice that if you run /etc/rc.d/couchdb start, it won’t exactly work. This is because it wants a couchdb user, so lets create that:

useradd -s /bin/couchdb

But…it’s still not happy! Well, now we have permission issues. First we should fix the permissions:

chown -R couchdb:root /var/log/couchdb
chown -R couchdb:root /var/lib/couchdb

This should get you up and running smoothly at least with the basics, but there was one more thing I ran into, and that’s when running ps aux, the paths now have double front-slashes infront of everything. This is OK to my understanding, but you can edit /etc/rc.d/couchdb, and remove the double front-slashes as you see fit.

Now, you should be able to run /etc/rc.d/couchdb start successfully. You should notice some output like the following similar output, if you see only one process and a sleep then there’s a problem:

[root@tdtdev lib]# ps aux | grep -i ‘couchdb’
couchdb 1942 0.0 0.1 13544 1748 pts/3 S 16:38 0:00 /bin/sh -e /bin/couchdb -a \”//etc/couchdb/default.ini\” -a \”//etc/couchdb/local.ini\” -b -r 5 -p //var/run/couchdb/couchdb.pid -o /dev/null -e /dev/null -R
couchdb 1959 0.0 0.0 13544 1012 pts/3 S 16:38 0:00 /bin/sh -e /bin/couchdb -a \”//etc/couchdb/default.ini\” -a \”//etc/couchdb/local.ini\” -b -r 5 -p //var/run/couchdb/couchdb.pid -o /dev/null -e /dev/null -R
couchdb 1960 0.0 1.3 169508 13708 pts/3 Sl 16:38 0:00 /usr/lib/erlang/erts-5.7.3/bin/beam.smp -Bd -K true — -root /usr/lib/erlang -progname erl — -home /home/couchdb -noshell -noinput -smp auto -sasl errlog_type error -pa //lib/couchdb/erlang/lib/couch-0.10.1/ebin //lib/couchdb/erlang/lib/mochiweb-r97/ebin //lib/couchdb/erlang/lib/ibrowse-1.5.2/ebin //lib/couchdb/erlang/lib/erlang-oauth/ebin -eval application:load(ibrowse) -eval application:load(oauth) -eval application:load(crypto) -eval application:load(couch) -eval crypto:start() -eval ssl:start() -eval ibrowse:start() -eval couch_server:start([ "//etc/couchdb/default.ini", "//etc/couchdb/local.ini", "//etc/couchdb/default.ini", "//etc/couchdb/local.ini"]), receive done -> done end. -pidfile //var/run/couchdb/couchdb.pid -heart
couchdb 1969 0.0 0.0 3668 480 ? Ss 16:38 0:00 heart -pid 1960 -ht 11

So a total of 4 processes. You should also be able to visit your local URL too:

http://127.0.0.1:5984/

If you’re still having problems after this, some things I’ve done is edited /etc/rc.d/couchdb as mentioned above, as well as /etc/couchdb/default.ini to remove the double slashes from there.

I hope this helps, it took a bit to really get it here.

Beginning Clojure

March 17th, 2010

I’ve recently begun to dabble in Clojure, a rising project with a fair bit of drive behind it. In a lot of ways it kinda reminds me of the Rails drive, in that the project itself is changing quite rapidly and a number of new people are joining the ranks fairly quickly. First, it’s worth mentioning what Clojure is – incase you haven’t heard of this before. Clojure can be defined as a lisp-like language that sits on top of the JVM. The purpose of it is to give the power that Java has in terms of concurrency and libraries and combine it with the best features of Common Lisp, without including “ugly-aspects” of the lisp history (extremely subjective…)

Common Lisp was pretty much the first language I’ve ever used that I actually fell in love with. There are many things that Common Lisp supports; including a REPL for interactive programming, extremely flexible error handling (restart case, handler case), macros, the multi-paradigm universe, and so on. I’ve found it extremely difficult to really think in a functional way when programming in PHP, Ruby, Python, and so on. Most of my programming, therefore, was done in Common Lisp because it feels cleaner and more simple. Clojure, though, I am finding is a nice balance point that Lisp has a difficulty meeting when it comes to being applicable to the masses. It includes much of the benefits of Lisp, but gives the options to call Java code too. Clojure feels also a little more on the functional side than Common Lisp does in a lot of aspects, or at least the presentation feels to lend itself more toward the functional aspect anyways.

Clojure’s project page can be located here:
http://clojure.org/

Using Clojure is fairly simple and straightforward. There are currently 3 books (that I’ve seen) for this language that talk about syntax, setup, and so on. The one book I decided upon was Programming Clojure, from the Pragmatic Studios. The book is fairly cheap and is of decent quality. You can find a link to it here:

http://www.pragprog.com/titles/shcloj/programming-clojure

There are many editors that support Clojure development, with plugins. Each one of them have different ways of being setup. Personally I settled upon using Emacs and Slime because that’s what I use for my Common Lisp development, and not having to change development environments is really helpful for me. Emacs and Slime may be a bit of a jump for many people, so there are many options when it comes to editors out there. Netbeans with Enclojure was well recommended from my searches. Textmate also has a LISP mode (but it doesn’t have the REPL – I really recommend not using it for serious development).

Clojure is also fairly simple to setup, as long as you find a good guide for doing so. Unfortunately there are many ways to setup Clojure and at a minimum you need JDK, ant, and so on – but this may differ significantly depending on your editor of choice. Emacs with slime offers an option in Emacs 23.X, for automated installed from using ELPA. To be honest, and I really hated this option, setting up this all by ELPA was probably the most simple. Your jar files are downloaded automagically to the ~/.swank-clojure directory and things are magically setup. If you use emacs 23.X and don’t currently do common lisp development, then this will work well. If you want to build stuff manually, and aren’t using emacs then your best bet is to get maven – then build your clojure .jar files accordingly. You can find detailed directions on building it on Ubuntu here:

http://riddell.us/ClojureOnUbuntu.html

If you decide to use Emacs, this video has been helpful from my perspective:

http://www.bestinclass.dk/index.php/2009/12/clojure-101-getting-clojure-slime-installed/

For general clojure development, there are a few areas that I find extremely helpful is leiningen. Essentially what this tool does is give you a working directory that makes clojure development a bit easier. There’s little I can say that the project page doesn’t say better – so you should visit this:

http://github.com/technomancy/leiningen

One last thing I’ll add is if you do Common Lisp development as well. If you do, some time may be sunk into trying to figure out how to make that work. Following the directions in the video you’d be left with an (eval-after-load “slime) block. In there, add something like: (add-to-list ’slime-lisp-implementations ‘(sbcl (“/usr/bin/sbcl”))), then run slime by hitting M– (hold meta hit dash), M-x and type slime. You should get a prompt that asks you what lisp, type “sbcl” then enter. The other issue I want to bring up is CVS version of swank/slime don’t work with clojure. You should get git://github.com/technomancy/slime.git instead. Note that if you use ELPA to install clojure and deps, it will already install parts of the compatible slime. Running M– M-x slime, sbcl will generate errors about not being able to find source .lisp files. You can fix this by going into the .emacs.d/elpa/slime-../ directory and copying all the .lisp contents from the slime checkout from technomancy into this directory. If you find a cleaner way of accomplishing this, without separate .emacs conf files, please add a comment.

Clojure is a bit of a handful to get started on, but if you find yourself having problems I’d check the google groups group for clojure, and #clojure on freenode IRC.

CL-SQL fun with wordnet

November 27th, 2009

Common Lisp surprises me more and more each and every day. I never thought of using it for SQL interaction, but the adventures of CL-SQL for my linguistics class lately really opened my eyes.

CL-SQL is an ORM, like that of ActiveRecord, Django’s ORM, and so on. It leans a little more toward the SQL Alchemy side of things, though. I will likely provide a much more detailed tutorial later once I get a general understanding of this library – but for now I’ll provide a bit for links on where to find more information:

CLSQL Main Site: http://clsql.b9.com/
Script I learned a lot from: http://www.boundp.net/files/weblog.lisp
Github post of my wordnet stuff: http://gist.github.com/244360

Now for a bit of information about wordnet. I’m currently in a computational linguistics course at the university, and part of the class is to develop a project of sorts. Mine relates to the use of wordnet, which is a fairly complicated database. Basically, wordnet was developed by Princeton, that has a fairly large set of words and their definitions, synonyms, hypernyms, hyponyms, and so on. In sort, it shows a lot of ways that one word relates to another word in various ways. I found a web site that had the MSSQL files for Wordnet, which I downloaded and attached. Along with that, a number of example view code that I modified to add to a listing of views. Since I do my development, most of the time, on Linux – I decided to get ODBC setup to the MSSQL server. CL-SQL works beautifully with this setup. The gist above shows the details.

For the SQL Server database I got, you can view the information here: http://opensource.ebswift.com/WordNetSQLServer/

I will likely provide more information as I go with this, as I’m finding wordnet more and more fun to mess with.

Get Adobe Flash playerPlugin by wpburn.com wordpress themes