Saturday, December 04, 2010

A more efficient piece of code

CMIE's firm databases use a fine-grained product code to identify each product. Each firm is also allocated to a product code based on its predominant activities. I like to reconstruct a coarse classification out of this that suits my tastes. I do this using this R function:

cmie.14.industries <- function(s) {
  values.8 <- c("Food","Textiles",
                 "Chemicals","NonMetalMin",
                "Metals","Machinery",
                 "TransportEq","MiscManuf",
                "Diversified","Serv.IT")
  names(values.8) <- c("01010101", "01010102",
                       "01010103", "01010104",
                       "01010105", "01010106",
                       "01010107", "01010108",
                       "01010109","01010408")
  values.6 <- c("Serv.Construction","Serv.Other",
                 "Mining","Electricity")
  names(values.6) <- c("010106","010104","010102",
                       "010103")

  if (is.na(s)) {return(NA)}

  leading8 <- substr(s, 1, 8)
  attempt <- values.8[leading8]
  if (!is.na(attempt)) {return(attempt)}

  leading6 <- substr(s, 1, 6)
  attempt <- values.6[leading6]
  if (!is.na(attempt)) {return(attempt)}

  leading4 <- substr(s, 1, 4)
  if (leading4 == "0102") {return("Serv.Finance")}

  return("MISTAKE")
}

This maps each firm into one of 14 coarse categories. Here are some examples of this in action:

> cmie.14.industries("0102090000000000")
"Serv.Finance"
> cmie.14.industries("0101041502000000")
"Serv.Other" 
> cmie.14.industries("0101010601010000")
"Machinery"

So in short, the function cmie.14.industries() maps a string like "0101010601010000" into a set of 14 broad industry names such as "Machinery".

Faced with a file with roughly 48,000 firm-years, at first blush, it seems that this function has to be run 48,000 times. For a given firm, this classification could change over time, so it isn't just a matter of doing this once for each firm. Here is one simple way to do it:

badway <- function(task) {
  result <- rep("", length(task))
  for (i in 1:length(task)) {
    result[i] <- cmie.14.industries(task[i])
  }
  result
}

This is just a loop that runs over everything in the supplied vector and calls cmie.14.industries() for each element. The only concession to efficiency is that the empty vector `result' is allocated ahead of time.

This proves to be quite slow. None of the standard R vectorisation ideas offer much relief.

The key idea for obtaining a leap in performance was that while I had to run through 48,000 firm-years, the industry codes actually attain only a modest list of possibilities. This makes possible a table lookup:

goodway <- function(task) {
  possibilities <- unique(task)
  values <- rep("", length(possibilities))
  for (i in 1:length(possibilities)) {
    values[i] <- cmie.14.industries(possibilities[i])
  }
  names(values) <- possibilities
  values[task]
}

For a problem of size 1000, this works out to be 13.5 times faster:

> load("task.rda")
> length(task)
[1] 1000
> system.time(res1 <- badway(task))
   user  system elapsed 
  0.030   0.000   0.031 
> system.time(res2 <- goodway(task))
   user  system elapsed 
  0.002   0.000   0.002 

This is just a demo with a 1000-sized task. In my production situation, the performance difference is even greater, since badway() calls cmie.14.industries() 48,000 times while goodway() only calls it a few hundred times.

2 comments:

  1. Why was it important to put in all that effort to reduce the time of the program from 0.31 to 0.02 seconds? Perhaps this is a subroutine in a larger program.

    ReplyDelete
  2. It's from 0.03 to 0.002 seconds for a 1000-sized problem. In production I'm doing 48,000 firm-years so it's a bit bigger than that.

    It was a neat idea, one that actually occurs more often than we think. We want to compute f(x) where x is a vector, but many values in x are repeated so there's no need to evaluate f() at every element of x. Hence I thought it's useful to write about it.

    ReplyDelete

Please note: Comments are moderated. Only civilised conversation is permitted on this blog. Criticism is perfectly okay; uncivilised language is not. We delete any comment which is spam, has personal attacks against anyone, or uses foul language. We delete any comment which does not contribute to the intellectual discussion about the blog article in question.

LaTeX mathematics works. This means that if you want to say $10 you have to say \$10.