Ruby String Processing Overhead

UPDATE: In Spring of 2010 BreakingPoint unveiled the pioneering Cyber Tomography Machine to help you with problems such as the ones described in this post. Read more.

Just prior to joining BreakingPoint I taught myself enough Ruby to, with considerable help from HD and Matt Miller, implement a proof-of-concept of the research that I was to present at ToorCon 9 in the Metasploit Framework. Shortly after that I joined BreakingPoint Labs, and was thrown head-first into the world of Ruby. One thing I noticed early-on was that strings were frequently represented using both single as well as double quotes, without much reason as to why one was chosen over the other. After the simple tutorials and I began getting into more complex code, I found out that in Ruby, strings that are contained in double-quotes are processed for escape sequences and variable interpolation whereas strings contained in single-quotes are literal. For example, if you have a variable in ruby, you can have the value interpolated into a string:

irb(main):001:0> name = 'Dustin'
=> "Dustin"
irb(main):002:0> puts "Hi, my name is #{name}.\n" # processed string
Hi, my name is Dustin.
=> nil
irb(main):004:0> puts 'Hi, my name is #{name}.\n' # literal string
Hi, my name is #{name}.\n
=> nil

Being primarily a C programmer, I had a preconceived notion about double versus single quotes, as in C double quotes are used for a string and single quotes are used for a character. What originally confused me however was that many times I would see strings contained in double-quotes with nothing included in them that would cause interpolation or get interpreted when processed, essentially creating a string literal using the interpolation string construct method. I asked HD about this, and he went on a rant about how, when migrating Metasploit from perl to Ruby for version 3.0, one of the other Metasploit developers would always complain about his use of double-quotes where not needed and claim it a performance hit. HD maintained that there was no significant difference in the overhead between the two construct methods, but the other developer's point makes logical sense; if the string doesn't need to be processed for interpolation, it would be expected that fewer instructions would be called to handle that string. Seeming logical, I made it a point to use single quotes for strings unless I actually needed the string's content to be processed, but lately this question has been nagging at the back of my mind and the scientist in me demanded proof.

To prove whether or not there really was any detectable, and more importantly, significant performance hit, I wrote a little Ruby script to test this. The script first defines a string constant to use for the tests. it then builds two test cases using the String object constructor, one for each construct method (literal versus processed), consisting of Ruby code that simply creates a String object using the constant for initialization. The script uses the better-benchmark wrapper for rsruby (a Ruby interface to R), to measure the amount of time it takes to execute each test case via the eval() function. This test performs 10 passes of 200,000 instances of each test case and analyzes the timings using the Wilcoxon signed-rank test. In order to run these tests as accurately as possible I found the most under-utilized and idle system that I could so that there was less external influence on the tests and timings from other processes executing on the same system. This happened to be a freshly installed Ubuntu 8.04 system with very little running on it.

This script produced completely unexpected results; the literal string initializations were actually slightly less efficient than their processed counterparts by about 2.6%, and R deemed this difference to be statistically significant.

These initial measurements were taken using a string of a constant length, the smallest length possible (one character) in an attempt to directly measure the overhead aggregate of the object constructor itself and the difference in processing overhead between the two different initialization methods. Due to the initial results actually being the opposite of what I had expected, I also wanted to test if the results changed depending on the length of the initialization string value, so I then wrote a second script. The second script is similar to the first, however instead of using the same one-character initialization string for each test, the initialization string values instead increased in length, between 1 and 2000 bytes in 100 byte increments.

The second script produced slightly more expected results. Processed strings still had a statistically significant performance benefit up to about 600 byte strings. The difference was statistically insignificant only at around 600 to 700 bytes, and became significant again in the literal method's favor at somewhere between 700 and 800 bytes and beyond.

The important point to note is that String object constructors initialized with literal strings only seem to provide a performance benefit on longer strings and using processed strings seem to provide the performance benefit when using shorter strings. This means that there is a measurable string length threshold at which using one or the other initialization method becomes statistically significant for your project, and a window of length sizes within which it doesn't really matter all that much which method you use. Since most programmers are unlikely to have strings of any significant length directly in their code, it would appear that using double-quotes in normal practice would be the preferred method for achieving better performance.

For some further reading on the performance difference between string variable interpolation versus using string append and concatenation methods, I found this blog post to be interesting.

0 comments
Tags: Tech Talk //
Post a Comment
  1. Leave this field empty

Required Field

Videos

More >


Interact





LinkedIn

YouTube

Newsletter


Subscribe to BreakingPoint Labs blog by email:

Type in your email, hit submit and quickly verify your address.