Question

I am writing a data processing program in Python and R, bridged with Rpy2.

Input data being binary, I use Python to read data out and pass them to R, then collect results to output.

Data are organized into pieces, each being around 100 Bytes (1Byte per value * 100 values).

They just work now, but the speed is very low. Here are some of my test on 1GB size (that is, 10^7 pieces) of data:

If I disable Rpy2 calls to make a dry run, it takes about 90min for Python to loop all through on a Intel(R) Xeon(TM) CPU 3.06GHz using one single thread.

If I enable full functionality and multithreading on that Xeon dual core, it (will by estimation) take ~200hrs for the program to finish.

I killed the Python program several times the call stack is almost alwarys pointing to Rpy2 function interface. I also did profiling, which gives similar results.

All these observations indicates the R part called by Rpy2 is the bottleneck. So I profiled a standalone version of my R program, but the profiling summary points to "Anonymous". I am still pushing my way to see which part of my R script is the most time consuming one. ****updated, see my edit below*****

There are two suspicious candidates through, one being a continuous wavelets transformation (CWT) and wavelets transformation modulus maxima (WTMM) using wmtsa from cran[1], the other being a non-linear fitting of ex-gaussion curve.

What come to my mind are:

for fitting, I could substitute R routing with inline C code? there are many fitting library available in C and fortran... (idea from the net; I never did that; unsure)
for wavelet algorithms.... I would have to analyze the wmtsa package to rewrite hotspots in C? .... reimplementing the entire wmtsa package using C or fortran would be very non-trivial for me. I have not much programming experience.
the data piece in file is organized in 20 consecutive Bytes, which I could map directly to a C-like char* array? at present my Python program just read one Byte at a time and append it to a list, which is slow. This part of code takes 1.5 hrs vs. ~200 hrs for R, so not that urgent.

This is the first time I meet program efficiency in solving real problems . I STFW and felt overwhelmed by the informations. Please give me some advice for what to do next and how.

Cheers!

footnotes:

http://cran.r-project.org/web/packages/wmtsa/index.html

* Update *

Thanks to proftools from cran, I managed to create a call stack graph. And I could see that ~56% of the time are spent on wmtsa, code snippet is like:

W <- wavCWT(s,wavelet="gaussian1",variance=1/p) # 1/4
W.tree <-wavCWTTree(W) # 1/2
holderSpectrum(W.tree) # 1/4

~28% of time is spent on nls:

nls(y ~ Q * dexGAUS(x, m, abs(s), abs(n)) + B, start = list(Q = 1000, m = h$time[i], s = 3, n = 8, B = 0), algorithm="default", trace=FALSE)

where evaluation of dexGAUS from gamlss.dist package takes the majority of time.

remaining ~10% of R time are spent on data passing/split/aggregation/subset.

Answer 1

For option 3.. getting your data in efficiently... read it all in as one long str type in python with a single read from the file. Let s assume it s called myStr.

import array
myNums = array.array( B , myStr)

Now myNums is an array of each byte easily converted... see help(array.array)... in fact, looking at that it looks like you can get it directly from a file that way through the array.

That should get rid of 1.4 hours of your data reading.

Answer 2

My understanding is that you have:

python code that uses rpy2 in places
performance issues that can be traced to calls to rpy2
the performance issues do not currently appear to have much to do with rpy2 itself, as the underlying R is largely responsible for the running time
a part of your R code was reading bytes one at a time and append them to a list, which you improved by moving that part to Python

It is somehow hard to try helping without seeing the actual code and you may want to consider:

a buffering strategies for reading bytes (as this was already answered by John).
work on optimizing your R code
consider trivial parallelization (and eventually rent compute space on the cloud)

友情链接