English 中文(简体)
Given a time series for many unique IDs, I need the top 100 deltas for each period
原标题:
  • 时间:2009-12-23 03:50:58
  •  标签:
  • r
  • analytics

I have a time series of data in TSV like so:

ID 	 Date 	 Value
-------------------------------
1234567 	 2009-01-01T00:00:00.000Z 	 121
12131 	 2009-06-01T00:00:00.000Z 	 151
12131 	 2009-07-01T00:00:00.000Z 	 15153
...

It easily fits in RAM, but is too big for Excel.

There is one value per month per ID, but not all IDs have entries for all 12 months.

The data spans 12 months, but not all IDs have all 12 months. I want to go through the data for each ID, and if there is an entry for the previous month, take the current month minus the previous month and store it in a new column to get a delta. If there is no entry for the previous month, then return 0. Then, for each month, I want the top 100 positive and negative of those deltas, along with the ID.

I d like to do this in R, because it s hard in Excel and it keeps crashing. I have R, Rattle, etc. installed and I ve worked through basic examples, but ... the learning curve is steep. I would really appreciate some help :)

最佳回答

Start by adding in all missing months:

all_combs <- expand.grid(
  ID = unique(data$ID),
  Date = unique(data$Date))

data <- merge(data, all_combs, by = c("ID", "Date"), all = T)
# Ensure data ordered by date
data <- data[with(data, order(ID, Date)), ]

Then add a column of deltas (calculated with diff)

library(plyr)
data <- ddply(data, "ID", transform, delta = c(NA, diff(Value)))

Finally, remove missing deltas, order by their value and extract the top and bottom 10 within each group.

changed <- subset(data, !is.na(delta))
changed <- changed[with(changed, order(ID, delta)), ]

# Select top 100 for each
top10 <- ddply(changed, "ID", function(df) {
 rbind(head(df, 10), tail(df, 10))
})
问题回答

Ok, first some code to generate some test data. This makes 100 random IDs and for each one chooses 20 months from a 2 year period along with random values. The order is then shuffled for extra fun.

## Generate some IDs
ids <- sample(1000, 100)

## Generate the data
data <- do.call(rbind,
                lapply(ids,
                       function(id)
                       data.frame(ID = id,
                                  Date = sample(as.Date(paste(rep(c(2008:2009), each=12),
                                    1:12, 1, sep="-")),
                                    20),
                                  Value = sample(1000, 20))))

## Shuffle
data <- data[sample(nrow(data), nrow(data)),]

Here s what it looks like for me:

> head(data)
      ID       Date Value
1007 205 2008-07-01   235
1391 840 2008-12-01   509
918  278 2009-12-01   951
1213 945 2009-03-01   842
1369 766 2009-07-01   555
798  662 2008-12-01   531

Ok, now let s iterate over IDs and find the diff for each month for each ID. Before that, let s convert the month to a number so it ll be easier to take differences (this is a bit unclean, does anyone know a better way to do arithmetic on Date objects?). This just does year * 12 + month so that normal arithmetic works:

data$Month <- as.POSIXlt(data$Date)$mon + as.POSIXlt(data$Date)$year * 12

Now compute the differences:

by.id <- by(data, data$ID, function(x) {
  ## Sort by month.
  x <- x[order(x$Month),]
  ## Compute the month and value differences, taking care to pad the edge case.
  data.frame(ID=x$ID,
             Date = x$Date,             
             Month.diff=c(0, diff(x$Month)),
             Value.diff=c(0,diff(x$Value)))
})
by.id <- do.call(rbind, by.id)

Here s what the result looks like:

> head(by.id)
    ID       Date Month.diff Value.diff
4.1  4 2008-02-01          0          0
4.2  4 2008-03-01          1        123
4.3  4 2008-05-01          2        -94
4.4  4 2008-06-01          1       -243
4.5  4 2008-08-01          2       -327
4.6  4 2008-10-01          2        656

If the difference between consecutive months was greater than 1, then the months were not adjacent and we should set their values to zero.

by.id$Value.diff <- ifelse(by.id$Month.diff == 1,
                           by.id$Value.diff,
                           0)

Finally, we iterate by month and take the top and bottom N differences (I ll set N to 10 here rather than 100 since my test data set is rather small).

by.month <- by(by.id, by.id$Date, function(x) {
  ## Sort the data in each month
  x <- x[order(x$Value.diff),]
  ## Take the top and bottom and label them accordingly.
  cbind(rbind(head(x, 10), tail(x, 10)),
        type=rep(c("min", "max"), each=10))
})

And there we have it. Here s an example result:

> by.month[[24]]
        ID       Date Month.diff Value.diff type
130.20 130 2009-12-01          1       -951  min
415.20 415 2009-12-01          1       -895  min
662.20 662 2009-12-01          1       -878  min
107.20 107 2009-12-01          1       -744  min
824.20 824 2009-12-01          1       -731  min
170.20 170 2009-12-01          1       -719  min
502.20 502 2009-12-01          1       -714  min
247.20 247 2009-12-01          1       -697  min
789.20 789 2009-12-01          1       -667  min
132.20 132 2009-12-01          1       -653  min
64.20   64 2009-12-01          1        622  max
82.20   82 2009-12-01          1        647  max
381.20 381 2009-12-01          1        698  max
303.20 303 2009-12-01          1        700  max
131.20 131 2009-12-01          1        751  max
221.20 221 2009-12-01          1        765  max
833.20 833 2009-12-01          1        791  max
806.20 806 2009-12-01          1        806  max
780.20 780 2009-12-01          1        843  max
912.20 912 2009-12-01          1        929  max

Pseudo Code to start with:

For Each ID
  If Previous month data Exists 
    compute Diff
  Else diff = 0
return diff

For Each Month
  Max 100 (Positive)
  Min 100 (Negative)

#Realish Code
dataset$diff <- lappply(dataset,function(ID,month,value){IF dataset[month-1] = TRUE{value-(value[month-1]})})
#This gets tricky since you need to know the month and what the previous month is in a format you can test




相关问题
How to plot fitted model over observed time series

This is a really really simple question to which I seem to be entirely unable to get a solution. I would like to do a scatter plot of an observed time series in R, and over this I want to plot the ...

REvolution for R

since the latest Ubuntu release (karmic koala), I noticed that the internal R package advertises on start-up the REvolution package. It seems to be a library collection for high-performance matrix ...

R - capturing elements of R output into text files

I am trying to run an analysis by invoking R through the command line as follows: R --no-save < SampleProgram.R > SampleProgram.opt For example, consider the simple R program below: mydata =...

R statistical package: wrapping GOFrame objects

I m trying to generate GOFrame objects to generate a gene ontology mapping in R for unsupported organisms (see http://www.bioconductor.org/packages/release/bioc/vignettes/GOstats/inst/doc/...

Changing the order of dodged bars in ggplot2 barplot

I have a dataframe df.all and I m plotting it in a bar plot with ggplot2 using the code below. I d like to make it so that the order of the dodged bars is flipped. That is, so that the bars labeled "...

Strange error when using sparse matrices and glmnet

I m getting a weird error when training a glmnet regression. invalid class "dgCMatrix" object: length(Dimnames[[2]]) must match Dim[2] It only happens occasionally, and perhaps only under larger ...

Generating non-duplicate combination pairs in R

Sorry for the non-descriptive title but I don t know whether there s a word for what I m trying to achieve. Let s assume that I have a list of names of different classes like c( 1 , 2 , 3 , 4 ) ...

Per panel smoothing in ggplot2

I m plotting a group of curves, using facet in ggplot2. I d like to have a smoother applied to plots where there are enough points to smooth, but not on plots with very few points. In particular I d ...

热门标签