English 中文(简体)
Estimating/forecasting download completion time
原标题:

We ve all poked fun at the X minutes remaining dialog which seems to be too simplistic, but how can we improve it?

Effectively, the input is the set of download speeds up to the current time, and we need to use this to estimate the completion time, perhaps with an indication of certainty, like 20-25 mins remaining using some Y% confidence interval.

Code that did this could be put in a little library and used in projects all over, so is it really that difficult? How would you do it? What weighting would you give to previous download speeds?

Or is there some open source code already out there?

Edit: Summarising:

  1. Improve estimated completion time via better algo/filter etc.
  2. Provide interval instead of single time ( 1h45-2h30 mins ), or just limit the precision ( about 2 hours ).
  3. Indicate when progress has stalled - although if progress consistently stalls and then continues, we should be able to deal with that. Perhaps about 2 hours, currently stalled
问题回答

More generally, I think you are looking for a way to give an instant mesure of the transfer speed, which is generally obtained by an average over a small period.

The problem is generally that in order to be reactive, the period is usually extremely small, which leads to the yoyo effect.

I would propose a very simple scheme, let s model it.

Think of a curve speed (y) over time (x).

  1. the Instant Speed, is no more than reading y for the current x (x0).

  2. the Average Speed, is no more than Integral(f(x), x in [x0-T,x0]) / T

  3. the scheme I propose is to apply a filter, to give more weight to the last moments, while still taking into account the past moments.

It can be easily implement as g(x,x0,T) = 2 * (x - x0) + 2T which is a simple triangle of surface T.

And now you can compute Integral(f(x)*g(x,x0,T), x in [x0-T,x0]) / T, which should work because both functions are always positive.

Of course you could have a different g as long as it s always positive in the given interval and that its integral on the interval is T (so that its own average is exactly 1).

The advantage of this method is that because you give more weight to immediate events, you can remain pretty reactive even if you consider larger time intervals (so that the average is more precise, and less susceptible to hiccups).

Also, what I have rarely seen but think would provide more precise estimates would be to correlate the time used for computing the average to the estimated remaining time:

  • if I download a 5ko file, it s going to be loaded in an instant, no need to estimate
  • if I download a 15 Mo file, it s going to take between 2 minutes roughly, so I would like estimates say... every 5 seconds ?
  • if I download a 1.5 Go file, it s going to take... well around 200 minutes (with the same speed)... which is to say 3h20m... perhaps that an estimates every minute would be sufficient ?

So, the longer the download is going to take, the less reactive I need to be, and the more I can average out. In general, I would say that a window could cover 2% of the total time (perhaps except for the few first estimates, because people appreciate immediate feedback). Also, indicating progress by whole % at a time is sufficient. If the task is long, I was prepared to wait anyway.

I wonder, would a state estimation technique produce good results here? Something like a Kalman Filter?

Basically you predict the future by looking at your current model, and change the model at each time step to reflect the changes to the real world. I think this kind of technique is used for estimating the time left on your laptop battery, which can also vary according to use, age of battery, etc .

see http://en.wikipedia.org/wiki/Kalman_filter for a more in depth description of the algorithm.

The filter also gives a variance measure, which could be used to indicate your confidence of the estimate (allthough, as was mentioned by other answers, it might not be the best idea to show this to the end user)

Does anyone know if this is actually used somewhere for download (or file copy) estimation?

Don t confuse your users by providing more information than they need. I m thinking of the confidence interval. Skip it.

Internet download times are highly variable. The microwave interferes with WiFi. Usage varies by time of day, day of week, holidays, and releases of new exciting games. The server may be heavily loaded right now. If you carry your laptop to cafe, the results will be different than at home. So, you probably can t rely on historical data to predict the future of download speeds.

If you can t accurately estimate the time remaining, then don t lie to your user by offering such an estimate.

If you know how much data must be downloaded, you can provide % completed progress.

If you don t know at all, provide a "heartbeat" - a piece of moving UI that shows the user that things are working, even through you don t know how long remains.

Improving the estimated time itself: Intuitively, I would guess that the speed of the net connection is a series of random values around some temporary mean speed - things tick along at one speed, then suddenly slow or speed up.

One option, then, could be to weight the previous set of speeds by some exponential, so that the most recent values get the strongest weighting. That way, as the previous mean speed moves further into the past, its effect on the current mean reduces.

However, if the speed randomly fluctuates, it might be worth flattening the top of the exponential (e.g. by using a Gaussian filter), to avoid too much fluctuation.

So in sum, I m thinking of measuring the standard deviation (perhaps limited to the last N minutes) and using that to generate a Gaussian filter which is applied to the inputs, and then limiting the quoted precision using the standard deviation.

How, though, would you limit the standard deviation calculation to the last N minutes? How do you know how long to use?

Alternatively, there are pattern recognition possibilities to detect if we ve hit a stable speed.

I ve considered this off and on, myself. I the answer starts with being conservative when computing the current (and thus, future) transfer rate, and includes averaging over longer periods, to get more stable estimates. Perhaps low-pass filtering the time that is displayed, so that one doesn t get jumps between 2 minutes and 2 days.

I don t think a confidence interval is going to be helpful. Most people wouldn t be able to interpret it, and it would just be displaying more stuff that is a guess.





相关问题
How to add/merge several Big O s into one

If I have an algorithm which is comprised of (let s say) three sub-algorithms, all with different O() characteristics, e.g.: algorithm A: O(n) algorithm B: O(log(n)) algorithm C: O(n log(n)) How do ...

Grokking Timsort

There s a (relatively) new sort on the block called Timsort. It s been used as Python s list.sort, and is now going to be the new Array.sort in Java 7. There s some documentation and a tiny Wikipedia ...

Manually implementing high performance algorithms in .NET

As a learning experience I recently tried implementing Quicksort with 3 way partitioning in C#. Apart from needing to add an extra range check on the left/right variables before the recursive call, ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Enumerating All Minimal Directed Cycles Of A Directed Graph

I have a directed graph and my problem is to enumerate all the minimal (cycles that cannot be constructed as the union of other cycles) directed cycles of this graph. This is different from what the ...

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...

热门标签