English 中文(简体)
Python Rpy R data processing optimization
原标题:
  • 时间:2010-07-14 01:00:41
  •  标签:
  • python
  • r
  • rpy2

I am writing a data processing program in Python and R, bridged with Rpy2.

Input data being binary, I use Python to read data out and pass them to R, then collect results to output.

Data are organized into pieces, each being around 100 Bytes (1Byte per value * 100 values).

They just work now, but the speed is very low. Here are some of my test on 1GB size (that is, 10^7 pieces) of data:

If I disable Rpy2 calls to make a dry run, it takes about 90min for Python to loop all through on a Intel(R) Xeon(TM) CPU 3.06GHz using one single thread.

If I enable full functionality and multithreading on that Xeon dual core, it (will by estimation) take ~200hrs for the program to finish.

I killed the Python program several times the call stack is almost alwarys pointing to Rpy2 function interface. I also did profiling, which gives similar results.

All these observations indicates the R part called by Rpy2 is the bottleneck. So I profiled a standalone version of my R program, but the profiling summary points to "Anonymous". I am still pushing my way to see which part of my R script is the most time consuming one. ****updated, see my edit below*****

There are two suspicious candidates through, one being a continuous wavelets transformation (CWT) and wavelets transformation modulus maxima (WTMM) using wmtsa from cran[1], the other being a non-linear fitting of ex-gaussion curve.

What come to my mind are:

  1. for fitting, I could substitute R routing with inline C code? there are many fitting library available in C and fortran... (idea from the net; I never did that; unsure)

  2. for wavelet algorithms.... I would have to analyze the wmtsa package to rewrite hotspots in C? .... reimplementing the entire wmtsa package using C or fortran would be very non-trivial for me. I have not much programming experience.

  3. the data piece in file is organized in 20 consecutive Bytes, which I could map directly to a C-like char* array? at present my Python program just read one Byte at a time and append it to a list, which is slow. This part of code takes 1.5 hrs vs. ~200 hrs for R, so not that urgent.

This is the first time I meet program efficiency in solving real problems . I STFW and felt overwhelmed by the informations. Please give me some advice for what to do next and how.

Cheers!

footnotes:

  1. http://cran.r-project.org/web/packages/wmtsa/index.html

* Update *

Thanks to proftools from cran, I managed to create a call stack graph. And I could see that ~56% of the time are spent on wmtsa, code snippet is like:

W <- wavCWT(s,wavelet="gaussian1",variance=1/p) # 1/4
W.tree <-wavCWTTree(W) # 1/2
holderSpectrum(W.tree) # 1/4

~28% of time is spent on nls:

nls(y ~ Q * dexGAUS(x, m, abs(s), abs(n)) + B, start = list(Q = 1000, m = h$time[i], s = 3, n = 8, B = 0), algorithm="default", trace=FALSE)

where evaluation of dexGAUS from gamlss.dist package takes the majority of time.

remaining ~10% of R time are spent on data passing/split/aggregation/subset.

问题回答

For option 3.. getting your data in efficiently... read it all in as one long str type in python with a single read from the file. Let s assume it s called myStr.

import array
myNums = array.array( B , myStr)

Now myNums is an array of each byte easily converted... see help(array.array)... in fact, looking at that it looks like you can get it directly from a file that way through the array.

That should get rid of 1.4 hours of your data reading.

My understanding is that you have:

  • python code that uses rpy2 in places

  • performance issues that can be traced to calls to rpy2

  • the performance issues do not currently appear to have much to do with rpy2 itself, as the underlying R is largely responsible for the running time

  • a part of your R code was reading bytes one at a time and append them to a list, which you improved by moving that part to Python

It is somehow hard to try helping without seeing the actual code and you may want to consider:

  • a buffering strategies for reading bytes (as this was already answered by John).

  • work on optimizing your R code

  • consider trivial parallelization (and eventually rent compute space on the cloud)





相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签