Someone made a comment that it was not a good idea to use ruby for bayesian filtering of things like forum posts. (Bayesian filtering is one of the primary algorithms used for determining if an email or forum post is spam or not).
They made some performance claims which seemeds exceedingly slow, but also made some statements that made me suspect their application design was not all that it could be so I thought I would see if ruby was the guilty party.
This post is not primarily about bayesian filtering but about performance testing; it is probably most helpful to low to intermediate ruby developers.
In this post I will write about:
If you are here for the performance analysis stuff with Stackprof then you can skip down to “The Analysing” section below.
So first we need to identify the claims made against ruby, this will determine what our goals are.
The poster made the following claims (Paraphrased slightly):
Now, if you are a ruby person the third claim might have been a red flag for you (It was for me). “All the Array Structures”… Now when it comes to searching, Arrays are slow, dog slow. A quick look at our farvourite Big-O Notation Cheetsheet site shows that searching an array is an “O*N” operation. This means that search time increases as the size of the array increases.
The other thing is that Arrays shouldn’t really feature that much in Bayesian filtering as the algorithm doesn’t really care if a word appears once or a hundred times.
These were the things that made me wonder if the poster was correct in blaiming ruby for the slowness. I also happened to know that there are some bayesian filter gems in the ruby-ecosystem which doesn’t really make sense if it is as slow as the poster claims.
When I pointed out the fact that it might be the poster’s design they didn’t take it very well. They stated the challenge to create a userland ruby application that processes 1000 unique variable length posts a second. Their training data set was 100,000 posts.
So, now we know the claims we need the goal.
Right sports fans! The first thing we need to do is write a bayesian filter in ruby… Bwahahahahah, I make myself laugh. I am a great believer in standing on the shoulders of giants so instead of writing a filter lets find some likely ruby gems.
Some searching on github and “The Ruby Toolbox” lead me to the Classifier gem, however the last commit was a bit old and there was this issue which called Classifier’s accuracy into question which, at the time, was not fixed.
Now, at this point I should confess, I only have a vague overview level of knowledge on how bayesian filtering works and I was hoping to spare my brain-power by not learning the nitty gritty mathematical details (One of the reasons I was looking for a Gem in the first place)
However, the poster of that issue wrote their own gem instead called Ankusa which looks pretty good. So lets go with that.
So now we have our filter we just need a data-set. The original poster said that they used a training data set of 100,000 posts. I am going to assume a roughtly 50/50 split here and say they had 50,000 known good posts (from now on called ‘ham’) and 50,000 spam posts.
After some DuckDuckGo‘ing I found some likely sources of curated spam/ham data-sets. It looked like at least one good thing came out of the Enron Collapse which is that their emails were made public as part of the discovery process.
Some researchers then put them up for download. While the numbers aren’t quite up to the 100,000 posts the original poster said they are in the same ballpark.
For a grand-total of around 50,000 emails. Due to the way the filter is written I don’t see there being much difference in speed between 50,000 and 100,000 items in the training set.
NOTE: If you are reading this and know of a bigger training set that is freely available I would love to hear about it.
Wow, 750 odd words into this article and we haven’t even gotten to the beginning of the code stuff yet.
Before I continue with the actual code stuff I am also going to have to make some assumptions on how the filter was actually used. (At least this is probably how I would have done it) I hope you will agree that these are logical assumptions:
A dedicated Bayesian machine: If you are processing 1000 forum posts / second I am going to assume that your operation is large enough to warrent a dedicated bayesian filtering server (Probably at the end of a REST API or message queue called by your front-end machines)
No online updating of the training data-set: I will assume the training set is batch-updated at some point (Maybe a daily / weekly thing).
The training data-set is held in memory: After implementating the training code using the Enron data-set I found that the training data only took up about 17MB of disk-space. Since this is pretty small and we assumed we have a dedicated server (See Assumption #1) that we will hold it in memory for maximum performance rather than a database.
Before we can benchmark Ankusa we need some sort of test-runner code. To that end I present to you “Don’t Bayes Me Bro” (DBMB: Well it made me chuckle when I named it and I like the idea of spammers being tazered).
All code samples from here-on-out will come from either Ankusa or DBMB.
WARNING: DBMB code is messy and not TDD’d and likely to make your “Beautiful Code” gland rupture.
The first thing we need to do is create our training data-set.
The actual code is here
DBMB has a “training” folder into which we dumped the Enron emails from earlier. All we are doing in this code recursively reading files from the “spam” and “ham” sub-directories and training ankusa with them.
We then save the data to a file called “corpus”. Note that since we are doing an operation that uses file I/O we can speed things up by creating one thread for “spam” and one for “ham”. The GIL gets in the way a bit but still get a performance speed-up in this case.
Since the original poster was talking about “Forum Threads” I decided to just parse the email body rather than everything including headers etc.
This is kicked off from a Rake task.
Benchmarking is done here and is started from another Rake task which allows us to test 1,000 to 30,000 emails using the original data-set from which the corpus came from.
To run the benchmark we insert the required number of email bodies into a queue. We then pop from the queue and run the filter. Why a queue? Well I was thinking down the road when multi-threading might make an appearance.
A few things to note. To avoid skewing the benchmarks we pre-initialize a two variables which are otherwise lazy loaded by Ankusa.
To keep things deterministic we save our queue data to a file for future reading, this also saves time over parsing thousands of emails each time. We also avoid emails with VERY short bodies (Less than 100 characters) to avoid getting an corpus with an overly short average body length.
Ok, here it is, the moment we have all been waiting for. After lots of waffle about setting this up, it is time for the main event, the actual benchmarking, the thing you are actually reading this blog post about probably, the thing that is at the end of this overly long sentence which is driving you insane!
Note: In the interest of brevity (In a post this long?! HAH) I have removed a lot of unnecessary output from the commands.
Test Machine:
1 2 3 4 5 6 7 |
|
Oook then, roughly 159 jobs per second. 1/10th of what we need. This is not the end of the world but it was a bit slower than I was hoping for.
So, first step is to analyse the code. For this we will use the godly, amazing, idiot-proof Stackprof gem by tmm1. Stackprof is introduced by tmm1 here and I highly recommend reading it.
So let us run the test again with stackprof enabled and see what we can see.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
Here I have told stackprof to identify the methods which were taking up the most runtime and limit it to the longest 4 methods. And holy moly!
From reading line 17 we can tell that the vast VAST VAST majority of the runtime is taken up with a single method! This is good and bad. Good, in that if we can optimise this then it will be a huge win, bad in that if we cannot, we are screwed. So let us look at the offending code.
1 2 3 4 |
|
Hunh, ok, not that complicated. But the first part of this stands out.
.include?
is used on enumerables. Enumerables like…. Arrays? Let us
check.
1 2 3 |
|
I have removed all the words from STOPWORDS but let me tell you it had 544 entries. So what we have here is a 544 entry Array that is searched for every.. SINGLE… WORD! Remember what we said about searching arrays? O*N average complexity, as the size of the array increases so does the time it takes to search. We can show this using a micro-benchmark.
The following code searches a set of arrays 5000 times.
1 2 3 4 5 6 |
|
You can see how the time take to run .include?
increases with the
size of the array.
So what are we to doooooo!? Well, when we have an array for the
sole purpose of calling include?
on it we do not care about
duplicate values. Therefore we can use another, underused Ruby
Data-structure.
Tadaaaa! Set to the rescue! Sets are similar to Arrays with a few key differences.
But the big one, the BIG one, is that, unlike Arrays, Sets use the same “Hash Table” data-structure as ruby Hashes to store their data. What does this mean? Well another visit to Big-O Notation Cheetsheet tells us that Hash Tables are much MUCH better for searching with an average complexity of O*1 and a worst-case of O*N.
This means that, even as the size increases the search-time remains relatively constant. Let us test this again with our micro-benchmark.
1 2 3 4 5 6 |
|
Sweeeeeeeeeet. So let us replace the Ankusa::STOPWORDS Array with a Set
1 2 3 4 5 |
|
and see what happens.
1 2 3 4 |
|
Wow, it just goes to show how a tiny, simple change can have such a huge impact. A running time reduced from about 6 seconds to 2 seconds, we are now halfway to our goal!
However halfway is not all the way. Improving performance is a simple cycle
We have fixed out first bottleneck, let us find the next one using trusty stackprof again.
1 2 3 4 5 6 7 8 9 10 11 |
|
So the next longest action is actually a method on String which is checking
if the string is numeric or not and it there is some rescue action going on
in there which means it is taking up the top two spots. Now String is a
ruby core class and it does not have a numeric?
method by default so
this looks like something Ankusa has monkey-patched in.
A quick look through the source-code and we see this is the case (in the
appropriately named extensions.rb
)
1 2 3 4 5 |
|
So what is wrong with this method? Well, nothing is wrong with it per se, it is one of the standard ways to see if a String is numeric or not.
The problem with it is that it is slow, it is even slower if the string is not numeric because then it raises an exception which has to be rescued (sloooooooooooow). This is one of the reasons you see people saying. “Don’t use exceptions for flow-control!”.
The problem is that we cannot really change this because every other way of checking if a string is numeric or not has edge-cases where they fail. This is the only bullet-proof way of making sure if a String is numeric or not.
But do we need bullet-proofness? Lets have a nose around and see if there is any other option.
If we go up the call chain a bit we can see that each word is processesed in add_text:
1 2 3 4 5 6 7 8 9 10 11 |
|
To create a “word” it first atomises any text passed to it. The comment
looks very interesting “Replace dashes with spaces”… well that would
remove negative numbers for starters. Lets have a look at the atomize
method.
1 2 3 |
|
Hmm, this looks interesting, this code basically strips all dashes and replaces anything that is not a word or whitespace character with a space. Lets assume our regexp knowledge is fuzzy and we are not sure what a “word” (\w) is, we can fire up IRB and do some testing:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
Well, with some experimentation it looks like any kind of number will
always be split into a bunch of integers. This means that we don’t
really need the edge-case surety of Float(string)
. Lets see how much
faster a simple regex is.
1 2 3 4 5 |
|
And the result:
1 2 3 4 |
|
Great success! Another simple change, another massive speed-up.
Eagle-eye mathematicians might notice that we DO have an edge-case
that we are no longer covering in that numbers like 1.05e16 will
end up as ["1","05e16"]
. However by the time we check if a String
is numeric this number has already been mashed up and checking for
[\d.]+(?:e?\d+)
could result in us ignoring words that we would
prefer to check. All in all I think it is safer to not ignore a
string like “1e05”.
Others may cringe at having such a method as a monkey-patch on String but do not worry, in the real PR I also moved it out of there as evidenced here
We are now close enough to 1000 jobs per second that I am going to call time on this post. There are other optimisations we could probably do but we have already done the easy stuff as evidenced by stackprof once again.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
As we can see from this (full) output there is no horrendously slow bottleneck that we can fix for a big win.
(A cofession here: I got the “Need for Speed” bug at this point and did some more tweaking that got us to about 970-980 jobs per second, you can see the full list of changes here
So we had a challenge and I think we met it, We didn’t do much in the way of long-run tests and our setup might have differed from theirs but I think I showed that ruby can have respectable results. This means that Ruby is perfect, right?.
Well no, while I do believe the original poster was wrong to blame ruby for his application’s slowness there are a few issues here.
First is that using ankusa in this manner is massively CPU bound we are stuck here using a single thread. This operation would benefit hugely from effective multi-threading but Ruby’s GIL prevents us from doing so since we are not doing much in the way of I/O.
JRuby to the rescue! I did actually try testing on JRuby. Ankusa actually uses a C Extension for the word-stemming and there is a JRuby drop-in equivalent but when I ran the tests on JRuby it was horrendously slow (Something like 40 times slower) and at that point I was not really up for trying to figure out why.
There is always Rubinius, I have never used it to be honest, but it does sound ideal for this case, maybe I will write a part 2 (RBX Redux!).
What I hoped I demonstrated was that improving performance is not the black-magic beginners might think it is. There are tools that make it dead simple to do so I highly recommend you give it a go.
]]>This is just as true for Ruby as for any other language so here are the resources I like to use to keep up to date.
These are besides things like API sites and the ruby on rails guides etc.
Made by Ryan Biggs, Railscasts is a great site with webcasts exploring new and interesting things happening in the ruby world. As the name suggests it is more focused on Web Development and Ruby on Rails but it also explores technology that can be useful to other types of ruby developers.
There are free and paid-for episodes available (for US$9 per month) and every episode has source code avilable on github. Most episodes also have full transcripts, including code, which can be read if you are not a video-watching type.
If you do not have money to travel the world attending conferences and your company does not stump up the cash then Confreaks is a gem of a site. They record presentations at a lot of major development conferences.
Confreaks is very popular with ruby conferences so you can find a wealth of interesting talks there. (Most of the talks are about 40-60 minutes long which is perferct for filling up a lunch break while eating or watching on the daily commute)
Hacker News is a (mostly) news aggregation site that has reasonable high standards of submissions and comments. Like all aggregation sites it has plenty of stories that do not interest me directly but there is a lot of good technical news / startup news there.
Requires a bit more mental filtering but still worth it. Just be aware that it is quite start-up focused which can cause a bit of an echo chamber effect. A healthy dose of cynicism (realism?) is useful.
Ah, Reddit, the “Front page of the internet” and a hell of a good time-waster. Beware of this site in that regard.
Reddit has various subforums related to specialised topics. For the Ruby developer about town the following Reddits may prove useful:
Ruby5 is a podcast from the gents at Envy Labs which covers interesting ruby related things. They are relatively short and to the point (About 5 minutes)
To be honest, I do not listen to the podcast that often but they list each tech they talk about on the page for that episode which is what I usually have a look at.
In the same manner as Ruby5 above this podcast talks about interesting things going on in the ruby world. The podcasts from Rubyshow are a lot longer than ruby5.
Again like Ruby5 I mainly use this site for the list of technologies they have in the transcript rather than listening to the podcast itself. (Aren’t I an old fashioned duffer…)
While I am mainly a ruby dev I am also resposible for other aspects such as server administration etc. (Not my main job but hey).
Webpulp.tv is basically a series of interviews with employees of quite famous tech related companies.
The interviews usually focus on the technology used by the company, challenges they have faced and how they got around them etc.
It is exceedingly useful to see what others are using in their stacks and a good way to learn about new and interesting software / techniques that might be useful for you.
Unfortunately webpulp does not update very often but what the hey.
As well as the above I have blogs by various ruby related companes on an RSS feed. for example:
You should find companies doing stuff that interest you and sign up to their blogs
This is a handy little site which lets you write and test regular expressions as you write them. It is my go-to site when I need to use regexes (which is not that common, hence why it is such a useful site)
If you have read this far then I thank you for your attention and would like to use it to remind you of one thing.
Everything above is optional. What you should be doing anyway is being signed up to the security mailing lists of all the major components in your stack.
For example if you you Ruby on Rails with PostgreSQL and Redis datastores and a Varnish cache running on CentOS then you should be signed up to the security mailing lists of all of these.
I have listed a reasonable number of resources above. This list is no-where near exhaustive and you should be building up your own list of go-to resources. As well as this you need to beware of something I call “Learner’s Paralysis”.
Following all of the above sites (plus ones you find yourself) can take up a significant chunk of your time if you let it.
Do not let it take up so much that you end up reading / learning a lot more than doing. This is a problem I suffer from, I find learning about these things so interesting (at a superficial level) that I don’t actually get round to doing anything. (Work on side-projects, think about that bootstrapped business I want to to start etc.)
Get out there and put the stuff to use rather than just thinking “Hey, I learned something” and leaving it at that.
]]>Luckily with Kubuntu 12.04 getting setup with Japanese input has become a very simple process
I am going to make a few assumptions before we get going. I am going to assume that:
The reason for 1. is simply that I have not tested this on any earlier versions. Feel free to give it a go and report back, but no guarantees.
The reason for 2. is again, because I have not tested it with other languages. Again feel free to give it a go and report back, but no guarantees.
The reason for 3. is because, historically, attempting ot get Japanese input working with one method would intefere with other attempts using different methods. If you have already tried to setup Japanese input on your install then by all means try the below method, but if it doesn’t work I am afraid that I cannot help you. In this case the best (albeit annoying) option is to reinstall.
Right, with these caveats out of the way lets proceed.
1
|
|
Run this command from the konsole. It will install the ibus-mozc software, this has worked for me without issues. I used to use some different software but this seems to be the way to go for 12.04 and up.
gnome-icon-theme is required by ibus-mozc (yes, even on KDE) but due to a missing dependency it is not installed so we need to specify it here.
After running the above command there is a good chance that you will need to restart, if prompted then please do so.
After rebooting go go to your system-settings and choose the locale option (It is at the top under the “Common Appearance and Behavior” section).
In the left-hand list select “System Languages” and then in the right hand window choose the “Set System Language” option.
In my case I choose “English (United Kingdom)” (Rule Britannia!) and then in the bottom right of the window set the “Keyboard input method” to “ibus”
Clicking apply will require you to enter your root password.
Now we need to start ibus. One option is to reboot but the quicker way is to bring up the app launcher by pressing ALT+F2 then typing in “ibus”. Doing so will bring up the “IBus Input Method Framework” option, please click this.
Clicking the above will result in a little keyboard icon coming into life in your taskbar bottom right of the screen.
Right click on this icon and select “Preferences”. On the window that appears select the “Input Method” tab.
Check the “Customize active input methods” checkbox. Click the “Select an input method”, click the little arrow next to the greyed out “Japanese” text then click on the very orange icon with the “Mozc” text. Once that is done click “Add”.
The orange icon with “Japanese - Mozc” should appear in the input method list. At this point you can close the window.
You should now be able to input Japanese! Open a text editor like Kate or a browser. Select an area where you can enter text then use CTRL+Space, the keyboard icon bottom right should switch to the orange icon and you can enter Japanese. よし!
]]>Warning: This article was imported from an old site and is therefore itself rather old. It may not still be accurate for current versions of RedCloth.
Textile has a +filter_html+ option which I thought would do the trick but that only filters what HTML RedCloth allows users to enter. It doesn’t filter the HTML created by Redcloth itself when a user uses textile tags.
So how to filter the textile tags?
First, assuming you are using Rails 2.3 or later create the following file. For other frameworks please use the recommended method for adding start-up code to that framework.
1
|
|
This file will be run during the rails initialization and will contain the code we want to override (monkey-patch). Now paste the following code into the file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
|
ALLOWED_TAGS is a hash of tags that you want to allow. You can take the BASIC_TAGS to use as a base and strip tags you don’t want to allow from the hash and add other ones if you want to.
So we have defined the tags that we want to allow. Now we need to actually do some stripping. This is where the after_transform method comes in. This is called by RedCloth as standard after initial modification. So what we can do is override the method and tell RedCloth to clean_html again with the HTML string it has just created. To give you a list of steps.
At this point the HTML’ised string is usually returned; however we do some overriding so that:
Thinking about it you don’t even need +filter_html+ since it will all be filtered the second time around explicitly by our code. However I feel a little more secure by stripping all the user generated HTML cruft first using +filter_html+ before stripping our textile generated HTML ourselves.
Enjoy
]]>If, on the other hand, you have been directed to this page by someone else and quickly want to find out what this is about, Read on.
In short:
For the purposes of this article I will introduce “Terry” and “Gonad” (Of “Zero punctuation” fame). Terry shall be our long suffering helper in the “Spiffy” project’s IRC channel and Gonad shall be the person asking for help (badly).
This one drives people up the wall and many channels have special bots that will print out an entire spiel about not asking to ask, so let’s get it out of the way first.
1 2 3 4 |
|
Let us look at the obvious first. The entire channel is dedicated towards helping people with Spiffy, this is usually hinted at in the channel name and outright does in the topic. Do you think they have a personal grudge against you that is going to stop them answering? (they might possibly later, but right now it is a blank slate) of course they can help you. Just ask the question straight out.
What Gonad should have done was:
1 2 3 |
|
Which leads us nicely onto the next point
Let us follow on from the previous point and continue Gonad’s sentence.
1 2 3 4 5 6 7 |
|
Moral of the story, “Helpers are not psychic”. When you post a problem we need something to go on. Preferably some or all of the following:
A note about including error pages or output. Because these things are traditionally huge, pasting them directly into the channel will make you as well liked as a clown at a funeral. Instead use a “paste” website like “Pastie”, “Pastebin” or “Gist” and then paste the URL into the channel. These sites also offer syntax highlighting which is very useful.
If the problem is complicated, or the steps very detailed, then consider posting a summary in IRC and more detail in your linked paste.
So once more let’s look at what Gonad should have done.
1 2 3 4 5 6 7 |
|
And they all lived happily ever after.
I hang around in some programming channels and every now and then we get a question illustrated by the following:
1 2 3 4 5 6 7 |
|
Let us take a real-world equivalent. Your town has a local club of burly men, covered in engine oil, who like messing around with car engines: tuning, fixing and the like and they meet every night. Now imagine a someone walks into their workshop and asks “Hey guys, I want to design and build an engine, can you quickly tell me how to do that? I have my notepad and everything”
Be thankful that at least online you have a certain amount of anonymity and are not vulnerable to immediate physical retribution.
We are a help channel, this means we usually answer technical questions. We sometimes answer non-technical questions but don’t ask anything that will require an entire business/technical specification or 4 year education course to answer
Further reading on this subject can be read at the “Help Vampire” site (Which is very funny and better illustrated that this site). Make sure you don’t end up as one.
Common sense dictates that when you put yourself in a position where you are relying on the kindness of strangers if behoves you to be polite to those strangers.
Remember how much you paid for the support contract on the software? Exactly. 99.99% of the helpers are not being paid, they are volunteers and generally nice people doing this because they like the warm fuzzy feeling they get from helping. Acting like a prat really discourages them from continuing this noble endeavour.
Examples of bad behaviour:
This is the internet, land of many timezones. Therefore questions might not be answered immediately. Let us look at Gonad again.
1 2 3 4 5 6 7 8 9 10 |
|
Sometimes answers can take hours/days in a relatively quiet channel, state your question and wait. Most people will type your nick when replying so make sure your IRC client is set to alert you when your nick is typed.
If you have read this far and actually read everything then congratulations. You should now know enough to not be a prat and get ignored whenever you ask a question in IRC. However this is only the first step; I still recommend reading Eric S. Raymond’s “How to ask questions the smart way” for a more thorough explanation of everything… go on, don’t be adequate, go and be clued up.
Fun competition time. Can you sees what rules are being broken with the following real life examples? (Names changed to protect the… innocent.)
1 2 3 |
|
1 2 3 4 5 |
|