Segmented downloading : Why the big deal?

Well it has come to my attention (well has been for quite some time) that people dislike segmented downloading for some of the most ridiculous reasons so I thought I would write this up to debunk some of those silly problems (well the 2 main ones I have seen most people complain about)

#1) “Segmented downloading kills hard drives”

O.k so first off if anyone uses any kind of torrenting program (uTorrent, Transmission, etc.) and you are complaining about using segmented downloading in DC++ (or any other DC client) then you really can’t expect to be taken serious, if you were to monitor the I/O usage of your torrent app of choice compared to the I/O usage you would see that the torrent app you are using more then likely is using a loooot more I/O transfers then the DC app you’re using (obviously like anything it depends on how many files you are downloading in each app) but more then likely the torrent app is abusing your hard drive a lot more. Secondly hard drives today ca. 2010/2011 have MTBF’s of over 1,000,000 hours (which comes to around 113 years or so), so yeah kind of a bad argument against Segmented Downloading. Also consider logging for a moment (yes I know you are probably wondering why I would mention logging but just hold on) … for everyone who logs mainchats and pm’s or anything else in your DC app of choice, consider this, every time you receive a message (mainchat or private message) your client writes that to a log file which is written to the hard disk immediately (although if I recall correctly some clients may hold the messages in buffer to be written every x amount of minutes which is stupid in of itself but I digress) which would mean a whole lot more writes to the hard disk then say using Segmented Downloads? (I hope you see where i’m getting to with this.)

#2) “Segmented takes more slots and leaves less for other users”

Well, consider this, majority of internet connections are asymmetrical (Download = faster then Upload speed) so in all reality downloading from one user at a time is more likely to take MORE time to download (and thus holding up other users from downloading the file in the end) so if you have an 8 mbit download does it make more sense to download from one person with 1 mbit upload, or 8 users with 1 mbit upload? I don’t think anyone would opt for the slower download speed in the end, so what’s the complaint? I don’t see any major valid reason to not use Segmented Downloading.

Well that’s that on this subject, but I do encourage (and greatly appreciate) any feedback or discussion on this topic.

I won’t comment on 2 … I just would rather argue… Does anyone think here bittorrent is slow … and thats all about downloading from as many sources as possible,


1.
This was already discussed in the Dev hub. I gave a possible explanation that fits and nobody was yet able to debunk.
Idea is that the implementation of DC++ fights against the caching algorithm of the OS.
e.g.
1)- 1 MiB segment is uploaded …
2) Os thinks more will be read and will cache the next 20 or x MiB ,
3) DC++ closes the file which the OS takes as a hint that it will no longer be needed therefore discards the cached stuff.
4) go back to 1)
in the end The hdd will have to put up with 20/x times more strain than with normal upload.

Solution: 1. Change the implementation of DC++ to not close the file or
2. enlarge the segment size…

MTBF hours are really worth not much to us … I would go based on experience…
in a 30-40 user hub I hear about 3-4 died hdd per year.
Which ammounts to lets say 5-10 years lifetime under filesharing conditions…
Probably less… as some might replace hdds earlier…
if 10 times more strain was put on the hdd that should be pretty much noticeable… though I doubt that 10 times more read/writes will lead to to 10 times sooner hardware failure.

DC++ always uses FILE_FLAG_SEQUENTIAL_SCAN flag on Windows to tell that file will be processed sequentially. As second option, Windows support FILE_FLAG_RANDOM_ACCESS flag to hint random access. Problem is that no other OS supports it.

Or there could be possibility to disable caching completely with FILE_FLAG_NO_BUFFERING and handle it in our own way (I’m still not sure about other OS than Win)? Or could memory-mapped files helped in this?

Solution not to close file encounters the same problem as finished uploads logging. When the file is really finished, so it can be closed? Or leave file opened for whole session? It’s not correct behaviour, because it locks file completely.

It reminds me that RevConnect was using memory-mapped files for segmented downloading. Has it any sense?

QS: Could you please back up your claims that the file is removed from cache when closed? I see no reason why the OS should do this, modulo memory constraints which usually don’t happen…for example create a large file and an app that reads, closes then rereads and post your timings…

Post hoc ergo propter hoc? Really?

I have tested solution one and a dynamic segment size in FlowLib. (Thanks Hackward for giving me some pointers).
Solution one gave me a huge performance improvement :slight_smile:

What I do (Dont know if this solution is checked into SVN) is having a global filehandler
When user to user connection starts i set the segment size to 1 Mib.
When i receive stuff to save i call Write in the global file handler.
If the file is not already opened i open the file and adds a object (including file handle and last used timestamp) to a list.
Then i lock the specific section part of the file i want to write to and write that part.
Then i update the last used timestamp.
The global filehandler has a thread trying to close unused files (Not beeing used for X seconds).
I have also a trigger on file completion (Yes, i know when i have all content in this file) and forces a close of file handle when file is completed.

About the segment size, i have a function that is called after every successful segment.
This function calculates if i could download more if the size was bigger (more or less looking at the time it took to download X in size).

FlowLib is using the none sequal write for writing files and should work on all platforms supporing it (It is a part of .Net so it might work in Mono :slight_smile:)

@arnetheduck and cologic

This is a hypothesis I put out. I doubt that as long as nobody comes up with a tool that records actual reads on the hdd that we will have much chance here. (or some documentation of caching alg Windows uses)
Its really pure guesswork to provide a plausible hypothesis for the seemingly increased hdd failures. Primary point is: stay open minded that the reports of higher hdd failure rates may be possible and not just imagination by users.

Open/Close operations are very expensive in terms of operating system time. They require setup/taredown of tables in the kernel, as well as ensuring that write buffers are flushed (including any journaling). From that point of view, any code which prevents needlessly opening and closing files will see a performance increase, irrispective of any other contributing factors.

(@iceman50 whether an application itself does write buffering, the operating system certainly will do. It is not a stupid idea at all. It follows exactly the same logic as a L1/L2 cache, by combining many small seek/write/seekback into a single seek/writesector/seekback) I have lost a link to the article but when it was introduced to linux, it resulted in 40% less HD activity under ‘average’ load.

As for hard drive failures, dont fall into the trap of assuming that higher reports of failure implies higher failure. (In the past, before the days of SMART etc, you wouldnt know about hard drive failures until a failure hit a key file, at which point the chances were that your computer wouldnt boot. Then you just blame the computer and get a new one, without identifing the underlying cause.) On the other hand, the consumer market these days is constantly trying to sell products which have been made in a cheeper way/with cheeper materials, which itself can have negative effects with respect to its longevity.

My personal oppinion is that there are far worse things which happen to disks than segmented downloading.

~Andrew

@andyhhp : I agree 100% and like cologic said, they [users] will blame it on the first thing they can i.e a hard drive fails while up/downloading with segmented downloading, they default to blaming it on that and don’t actually go deeper in to try and find the actual cause (and this going on the assumption a strong one i might add, that it isn’t the segmented downloading causing it) of why their hard drive actually failed, and as human nature goes it starts like a wildfire … one user tells another that segmented downloading is evil and causes drive failure which leads 1 user to tell 10 and so on and so forth, thankfully we are having a lot of quality posts on this subject to show that , maybe, just maybe segmented downloading isn’t such a horribly awful thing. =)

Actually, you don’t need any “special” tools - just plot timings of a read or write loop to see how long it takes to do a file with and without segments, or reread a file multiple times…no rocket science there. Then do another that opens and closes said file in a loop and plot how many you can do in a minute (try looking a a clock for a full 60 seconds to get a good feeling of just how long that is…) to measure how “expensive” that operation is. Hypothesis is a strong word here, I’d say it’s closer to relaying rumor or hearsay - like telling a good story or reciting a bible.

@andyhpp I would disagree with you here. Doing the job of the OS in a program is a inner-platform effect antipattern.
This is something that should never be done without good reason! Here might be a good reason though …


@arnetheduck
Hypothesis is eactly the correct wording for an unproven idea. I avoided the word theory as that would have been to strong, but hypothesis seems exactly right. I even stated the 2 necessary preconditions:

  1. implementation that closes the file.
  2. caching alg that reacts to file closing in a plausible way.
    Sounds like perfectly fine for a hypothesis…

The measurements you are talking about seem to require modification of sourcecode so I see this as a job rather for you or for any other dc++ devs. I just wanted to provide a plausible explanation how segmented downloading could be bad for hdds, if done wrong and bad caching alg came together. I see segmented downloads as a must for any client out there with no way around. And now please stop trying to stultify me with comparison to bible and rumor relaying.

But it’s not very plausible even…why would it evict perfectly good data from the cache when it doesn’t need to…in fact, a file that’s been opened and closed is very likely to be opened and again soon…consider dll’s, consider source code files being recompiled, consider mp3’s on replay, browser cache files (same images being reloaded)…etc etc…

Hearsay because you’re repeating what others are saying without any additional facts…and you’re certainly capable of doing better, for example by doing said experiment; you know how to program just as well as I do…

@Quicksilver: I didn’t wish to imply that I thought application buffering and operating system level buffering together was a good idea. I just wished to state that write buffering itself (irrepective of whether it is implemented at the application or OS level) is a good idea. I would completely agree that buffering both at the application and OS level is a bad idea.

Windows at least allows you to pass flags to specifiy what sort of buffering you would like the operating system to do, including “dont do any buffering for me - it will do it myself”. The C standard uses setvbuf/setbuf functions in stdio.h to alter FILE* buffering. (I am not certain, but I believe the winapi flags just use the stdio functions behind the scenes, but it has been a long time since I researched the topic)

@arnetheduck: The problem with your argument is the definition of “when it doesn’t ned to”. The argument of temporal locality of data does not work for file handles. If I call close() on a file handle, I am telling the operating system that I am truely and utterly done with the file. It is perfectly reasonable for the OS to use this as an indication to free up the cache. Also, how much do you expect the OS to cache of a file? There is no way that an OS is going to cache all of a 700MB file in memory, even on a machine with 4GB of ram. The OS is constantly looking for any excuse to free up areas of its cache so the memory can be given to other applications without them taking as many page faults.

As for your examples:

DLLs: The common DLLs are resident in memory for all process and are demand paged into the process address space (which is trivial kernel overhead and no disk activity).

Source code files: This is why GCC only outputs the intermediate files if you specifically request them. Else, they are just kept in internel buffers in memory (taking the “let me buffer it myself” approach).

MP3s: This is the job of the media application to cache the file handles, especially if the track is on repeat. That way, any caching wont be flushed. This same argument applies to browers with cached content, except that this content tends to get cached in memory. This is one reason why browers have huge memory useages in comparison to other applications.

~Andrew

It could be very similar when using Memory mapped file. You open one global file mapping handler and then only opening/closing views for each segment. Global file handler is closed when file is finished. But this is mainly about downloading. What about situation on the side of uploader?

As I’m checking Revconnect code now, memory mapped files were used only in its first versions. Later it was replaced with own implementation of SharedFileStream (i.e. only one global handler for all segments used).

I have been trying to give an explanation for what others are saying. i.e. given the fact its real segmented downloads are bad for hdd… can we think up an explanation for this.

And no arne I don’t know how to program C++ or specifically DC++ as well as you do. For me it would be a journey of hours, just to find the responsible part in the source and set up and environment that can compile dc++. Its definateyl not worth this for me.
I though about such a problem might and coded defensively in my own implementation. I have to do this… as I want my code to work well under any os and I see it as andyhpp as rather plausible that a cache removes closed media files.

I dont know if this is perfect but the global file handler do the same here.
It keeps the file handle in X time if it is not told otherwise.

You could probably add a upper limit of number of file handles to keep, but im not :slight_smile:

What is your evidence for this? If you’re going to repeat your previous assertions on the topic, see my previous, Latin reply.

I will revert SharedFileStream into StrongDC++. It could improve performance at the beginning and at the end of segments. But I don’t think that it could be correctly used for uploads, because it’s not possible to say when download won’t request another segment so file can be closed.

RevConnect’s code for SharedFileStream can be seen here - http://reverseconnect.cvs.sourceforge.net/viewvc/reverseconnect/RevConnect/client/SharedFileStream.cpp?revision=1.1.1.1&content-type=text%2Fplain

I was also thinking about that FILE_FLAG_RANDOM_ACCESS flag. Has it sense to use it? Because segments are still read sequentially and it can’t be said whether next segment will follow current one or not, but there is bigger probability it will.

There is no prove for that… only evidence is users complaining about more failed hdds. So the evidence is slim but exists.
You can call this explanation a Gedankenexperiment. The whole point of this is, that if users complain about something perceived, then don’t dismiss it as if it is impossible, if we can come up with a reasonable explanation for such a perception.


Also to come back and repeat the point about MTBF hours of hdds. MTBF are numbers presented by manufacturer without preconditions.
Though we know that our users are filesharers which have potentially more strain on a hdd than an average user of that manufacturer. I know that my Math there was rough estimation … though given data from more and larger hubs I imagine we could come up with a MTTF time for filesharers that would be much more meaningful for us than anything a manufacturer provides. Counting in hubs which forbid segmenting and normal hubs… we might even get empirical check for the hypothesis. But of couse this seems like a lot of work compared to measureing the caching with a modified dc++ version.