CMIE's firm databases use a fine-grained product code to identify each product. Each firm is also allocated to a product code based on its predominant activities. I like to reconstruct a coarse classification out of this that suits my tastes. I do this using this R function:
cmie.14.industries <- function(s) { values.8 <- c("Food","Textiles", "Chemicals","NonMetalMin", "Metals","Machinery", "TransportEq","MiscManuf", "Diversified","Serv.IT") names(values.8) <- c("01010101", "01010102", "01010103", "01010104", "01010105", "01010106", "01010107", "01010108", "01010109","01010408") values.6 <- c("Serv.Construction","Serv.Other", "Mining","Electricity") names(values.6) <- c("010106","010104","010102", "010103") if (is.na(s)) {return(NA)} leading8 <- substr(s, 1, 8) attempt <- values.8[leading8] if (!is.na(attempt)) {return(attempt)} leading6 <- substr(s, 1, 6) attempt <- values.6[leading6] if (!is.na(attempt)) {return(attempt)} leading4 <- substr(s, 1, 4) if (leading4 == "0102") {return("Serv.Finance")} return("MISTAKE") }
This maps each firm into one of 14 coarse categories. Here are some examples of this in action:
> cmie.14.industries("0102090000000000") "Serv.Finance" > cmie.14.industries("0101041502000000") "Serv.Other" > cmie.14.industries("0101010601010000") "Machinery"
So in short, the function cmie.14.industries()
maps a
string like "0101010601010000" into a set of 14 broad industry names
such as "Machinery".
Faced with a file with roughly 48,000 firm-years, at first blush, it seems that this function has to be run 48,000 times. For a given firm, this classification could change over time, so it isn't just a matter of doing this once for each firm. Here is one simple way to do it:
badway <- function(task) { result <- rep("", length(task)) for (i in 1:length(task)) { result[i] <- cmie.14.industries(task[i]) } result }
This is just a loop that runs over everything in the supplied
vector and calls cmie.14.industries()
for each
element. The only concession to efficiency is that the empty vector
`result' is allocated ahead of time.
This proves to be quite slow. None of the standard R vectorisation ideas offer much relief.
The key idea for obtaining a leap in performance was that while I had to run through 48,000 firm-years, the industry codes actually attain only a modest list of possibilities. This makes possible a table lookup:
goodway <- function(task) { possibilities <- unique(task) values <- rep("", length(possibilities)) for (i in 1:length(possibilities)) { values[i] <- cmie.14.industries(possibilities[i]) } names(values) <- possibilities values[task] }
For a problem of size 1000, this works out to be 13.5 times faster:
> load("task.rda") > length(task) [1] 1000 > system.time(res1 <- badway(task)) user system elapsed 0.030 0.000 0.031 > system.time(res2 <- goodway(task)) user system elapsed 0.002 0.000 0.002
This is just a demo with a 1000-sized task. In my production
situation, the performance difference is even greater,
since badway()
calls cmie.14.industries()
48,000 times while goodway()
only calls it a few hundred
times.
Why was it important to put in all that effort to reduce the time of the program from 0.31 to 0.02 seconds? Perhaps this is a subroutine in a larger program.
ReplyDeleteIt's from 0.03 to 0.002 seconds for a 1000-sized problem. In production I'm doing 48,000 firm-years so it's a bit bigger than that.
ReplyDeleteIt was a neat idea, one that actually occurs more often than we think. We want to compute f(x) where x is a vector, but many values in x are repeated so there's no need to evaluate f() at every element of x. Hence I thought it's useful to write about it.