How to maximize throughput when processing many filesHow do I copy a file in Python?How many files can I put in a directory?When and how should I use a ThreadLocal variable?Improve INSERT-per-second performance of SQLite?Technically, why are processes in Erlang more efficient than OS threads?What is the Haskell response to Node.js?Disk Throughput - new file vs. existing file using ddHow does HDFS write to a disk on the data nodeReplacing a 32-bit loop counter with 64-bit introduces crazy performance deviationsPoor disk read throughput in Neo4j when database is hosted on SAN
Grepping string, but include all non-blank lines following each grep match
Should I warn new/prospective PhD Student that supervisor is terrible?
Why is the Sun approximated as a black body at ~ 5800 K?
How do you justify more code being written by following clean code practices?
Quoting Keynes in a lecture
Alignment of six matrices
Check if object is null and return null
Echo with obfuscation
Visualizing the difference curve in a 2D plot?
In One Punch Man, is King actually weak?
Make a Bowl of Alphabet Soup
Personal or impersonal in a technical resume
Given this phrasing in the lease, when should I pay my rent?
How can I, as DM, avoid the Conga Line of Death occurring when implementing some form of flanking rule?
What's the name of the logical fallacy where a debater extends a statement far beyond the original statement to make it true?
Why didn’t Eve recognize the little cockroach as a living organism?
Limit max CPU usage SQL SERVER with WSRM
How much do grades matter for a future academia position?
How to make money from a browser who sees 5 seconds into the future of any web page?
How would you translate "more" for use as an interface button?
Giving feedback to someone without sounding prejudiced
How to get directions in deep space?
What is this high flying aircraft over Pennsylvania?
I'm just a whisper. Who am I?
How to maximize throughput when processing many files
How do I copy a file in Python?How many files can I put in a directory?When and how should I use a ThreadLocal variable?Improve INSERT-per-second performance of SQLite?Technically, why are processes in Erlang more efficient than OS threads?What is the Haskell response to Node.js?Disk Throughput - new file vs. existing file using ddHow does HDFS write to a disk on the data nodeReplacing a 32-bit loop counter with 64-bit introduces crazy performance deviationsPoor disk read throughput in Neo4j when database is hosted on SAN
Say you want to process many files as quickly as possible, where processing time > file read time.
- Will reading multiple files using a thread pool increase throughput? or does it just cause more disk contention?
- If a thread pool does help, what determines how many threads are needed to achieve the maximum? can this be calculated based on the target system?
- For a single core, will a loop reading and processing asynchronously via threads be faster than doing it synchronously? I assume since disk latency is so high, it would be. But maybe if the file read is much much smaller than processing time, it is better to let the processing step finish uninterrupted without context switches.
Also, do you have any other tips for maximizing disk throughput?
multithreading optimization operating-system filesystems disk
add a comment |
Say you want to process many files as quickly as possible, where processing time > file read time.
- Will reading multiple files using a thread pool increase throughput? or does it just cause more disk contention?
- If a thread pool does help, what determines how many threads are needed to achieve the maximum? can this be calculated based on the target system?
- For a single core, will a loop reading and processing asynchronously via threads be faster than doing it synchronously? I assume since disk latency is so high, it would be. But maybe if the file read is much much smaller than processing time, it is better to let the processing step finish uninterrupted without context switches.
Also, do you have any other tips for maximizing disk throughput?
multithreading optimization operating-system filesystems disk
any other tips for maximizing disk throughput? Buy faster disks and be done with the problem for all time, without worrying about bugs in your processing algorithms, spending time and money writing code, and having to maintain all that code in the future.
– Andrew Henle
Mar 7 at 10:22
It depends on a lot of factors. OS, available CPU's, available memory, disk performance and so on. When the files are small < 1 MB and they're just a few hundrets then maybe reading them all into memory and then processing them can be faster than reading, precessing, reading and so on. But I belive you have to test and profile things on your own.
– user743414
Mar 7 at 13:54
@AndrewHenle That's always good to keep in mind. Though, if the software is intended to run on a variety of different hardware/OS configurations, like a framework, you still would want to employ some software-based techniques as well.
– Azmisov
Mar 15 at 0:44
@user743414 Despite the numerous possible configurations, I suspect the internals of reading data from disk is implemented pretty similarly across the board for various motherboards, CPU's, RAM, etc. I was hoping someone with more expertise on the internals could describe the general principles, without having to benchmark across many rigs.
– Azmisov
Mar 15 at 0:48
add a comment |
Say you want to process many files as quickly as possible, where processing time > file read time.
- Will reading multiple files using a thread pool increase throughput? or does it just cause more disk contention?
- If a thread pool does help, what determines how many threads are needed to achieve the maximum? can this be calculated based on the target system?
- For a single core, will a loop reading and processing asynchronously via threads be faster than doing it synchronously? I assume since disk latency is so high, it would be. But maybe if the file read is much much smaller than processing time, it is better to let the processing step finish uninterrupted without context switches.
Also, do you have any other tips for maximizing disk throughput?
multithreading optimization operating-system filesystems disk
Say you want to process many files as quickly as possible, where processing time > file read time.
- Will reading multiple files using a thread pool increase throughput? or does it just cause more disk contention?
- If a thread pool does help, what determines how many threads are needed to achieve the maximum? can this be calculated based on the target system?
- For a single core, will a loop reading and processing asynchronously via threads be faster than doing it synchronously? I assume since disk latency is so high, it would be. But maybe if the file read is much much smaller than processing time, it is better to let the processing step finish uninterrupted without context switches.
Also, do you have any other tips for maximizing disk throughput?
multithreading optimization operating-system filesystems disk
multithreading optimization operating-system filesystems disk
asked Mar 7 at 3:23
AzmisovAzmisov
2,35842953
2,35842953
any other tips for maximizing disk throughput? Buy faster disks and be done with the problem for all time, without worrying about bugs in your processing algorithms, spending time and money writing code, and having to maintain all that code in the future.
– Andrew Henle
Mar 7 at 10:22
It depends on a lot of factors. OS, available CPU's, available memory, disk performance and so on. When the files are small < 1 MB and they're just a few hundrets then maybe reading them all into memory and then processing them can be faster than reading, precessing, reading and so on. But I belive you have to test and profile things on your own.
– user743414
Mar 7 at 13:54
@AndrewHenle That's always good to keep in mind. Though, if the software is intended to run on a variety of different hardware/OS configurations, like a framework, you still would want to employ some software-based techniques as well.
– Azmisov
Mar 15 at 0:44
@user743414 Despite the numerous possible configurations, I suspect the internals of reading data from disk is implemented pretty similarly across the board for various motherboards, CPU's, RAM, etc. I was hoping someone with more expertise on the internals could describe the general principles, without having to benchmark across many rigs.
– Azmisov
Mar 15 at 0:48
add a comment |
any other tips for maximizing disk throughput? Buy faster disks and be done with the problem for all time, without worrying about bugs in your processing algorithms, spending time and money writing code, and having to maintain all that code in the future.
– Andrew Henle
Mar 7 at 10:22
It depends on a lot of factors. OS, available CPU's, available memory, disk performance and so on. When the files are small < 1 MB and they're just a few hundrets then maybe reading them all into memory and then processing them can be faster than reading, precessing, reading and so on. But I belive you have to test and profile things on your own.
– user743414
Mar 7 at 13:54
@AndrewHenle That's always good to keep in mind. Though, if the software is intended to run on a variety of different hardware/OS configurations, like a framework, you still would want to employ some software-based techniques as well.
– Azmisov
Mar 15 at 0:44
@user743414 Despite the numerous possible configurations, I suspect the internals of reading data from disk is implemented pretty similarly across the board for various motherboards, CPU's, RAM, etc. I was hoping someone with more expertise on the internals could describe the general principles, without having to benchmark across many rigs.
– Azmisov
Mar 15 at 0:48
any other tips for maximizing disk throughput? Buy faster disks and be done with the problem for all time, without worrying about bugs in your processing algorithms, spending time and money writing code, and having to maintain all that code in the future.
– Andrew Henle
Mar 7 at 10:22
any other tips for maximizing disk throughput? Buy faster disks and be done with the problem for all time, without worrying about bugs in your processing algorithms, spending time and money writing code, and having to maintain all that code in the future.
– Andrew Henle
Mar 7 at 10:22
It depends on a lot of factors. OS, available CPU's, available memory, disk performance and so on. When the files are small < 1 MB and they're just a few hundrets then maybe reading them all into memory and then processing them can be faster than reading, precessing, reading and so on. But I belive you have to test and profile things on your own.
– user743414
Mar 7 at 13:54
It depends on a lot of factors. OS, available CPU's, available memory, disk performance and so on. When the files are small < 1 MB and they're just a few hundrets then maybe reading them all into memory and then processing them can be faster than reading, precessing, reading and so on. But I belive you have to test and profile things on your own.
– user743414
Mar 7 at 13:54
@AndrewHenle That's always good to keep in mind. Though, if the software is intended to run on a variety of different hardware/OS configurations, like a framework, you still would want to employ some software-based techniques as well.
– Azmisov
Mar 15 at 0:44
@AndrewHenle That's always good to keep in mind. Though, if the software is intended to run on a variety of different hardware/OS configurations, like a framework, you still would want to employ some software-based techniques as well.
– Azmisov
Mar 15 at 0:44
@user743414 Despite the numerous possible configurations, I suspect the internals of reading data from disk is implemented pretty similarly across the board for various motherboards, CPU's, RAM, etc. I was hoping someone with more expertise on the internals could describe the general principles, without having to benchmark across many rigs.
– Azmisov
Mar 15 at 0:48
@user743414 Despite the numerous possible configurations, I suspect the internals of reading data from disk is implemented pretty similarly across the board for various motherboards, CPU's, RAM, etc. I was hoping someone with more expertise on the internals could describe the general principles, without having to benchmark across many rigs.
– Azmisov
Mar 15 at 0:48
add a comment |
1 Answer
1
active
oldest
votes
I did some benchmarking to come up with some general guidelines. I tested with about ~500k smallish (~14kb) files. I think the results should be similar for medium sized files; but for larger files, I suspect disk contention becomes more significant. It would be appreciated if someone with deeper knowledge of OS/hardware internals could supplement this answer with more concrete explanations for why some things are faster than others.
I tested with a 16 virtual core (8 physical) computer with dual channel RAM and Linux kernel 4.18.
Do multiple threads increase read throughput?
The answer is yes. I think this could be either due to 1) a hardware bandwidth limitation for single threaded applications or 2) the OS's disk request queue is better utilized when many threads are making requests. The best performance is with virtual_cores*2
threads. Throughput slowly degrades beyond that, perhaps because of increased disk contention. If the pages happen to be cached in RAM, then it is better to have a thread pool of size virtual_cores
. If however < 50% of pages are cached (which I think is the more common case), then virtual_cores*2
will do just fine.
I think the reason why virtual_cores*2
is better than just virtual_cores
is that a file read also includes some non-disk related latency like system calls, decoding, etc. So perhaps the processor can interleave the threads more effectively: while one is waiting on the disk, a second can be executing the non-disk related file read operations. (Could it also be due to the fact that the RAM is dual channel?)
I tested reading random files vs sequentially (by looking up the files' physical block location in storage, and ordering the requests by this). Sequential access gives a pretty significant improvement with HDDs, which is to be expected. If the limiting factor in your application is file read time, as opposed to the processing of said files, I suggest you reorder the requests for sequential access to get a boost.
There is the possibility to use asynchronous disk IO, instead of a thread pool. However, from my readings it appears there is not a portable way to do it yet (see this reddit thread). Also, libuv which powers NodeJS uses a thread pool to handle its file IO.
Balancing read vs processing throughput
Ideally, we could have reading and processing in separate threads. While we are processing the first file, we can be queuing up the next one in another thread. But the more threads we allocate for reading files, the more CPU contention with the processing threads. The solution is to give the faster operation (reading vs processing) the fewest number of threads while still giving zero processing delay between files. This formula seemed to give good results in my tests:
prop = read_time/process_time
if prop > 1:
# double virtual core count gives fastest reads, as per tests above
read_threads = virtual_cores*2
process_threads = ceil(read_threads/(2*prop))
else:
process_threads = virtual_cores
# double read thread pool so CPU can interleave better, as mentioned above
read_threads = 2*ceil(process_threads*prop)
For example: Read = 2s, Process = 10s; so have 2 reading threads for every 5 processing threads
In my tests, there is only about a 1-1.5% performance penalty for having extra reading threads. In my tests, for a prop
close to zero, 1 read + 16 process threads had nearly the same throughput as 32 read + 16 process threads. Modern threads should be pretty lightweight, and the read threads should be sleeping anyways if the files aren't being consumed fast enough. (The same should be true of process threads when prop
is very large)
On the other hand, too few reading threads has a much more significant impact (my third original question). For example, for a very large prop
, 1 read + 16 process threads was 36% slower than 1 read + 15 process threads. Since the process threads are occupying all the benchmark computer's cores, the read thread has too much CPU contention and fails 36% of the time to queue up the next file to be processed. So, my recommendation is to err in favor of too many read threads. Doubling the read thread pool size as in my formula above should accomplish this.
Side note: You can limit the CPU resources your application consumes by setting virtual_cores
to be a smaller percentage of the available cores. You may also choose to forego doubling, since CPU contention may be less of an issue when there is a spare core or more that is not executing the more intensive processing threads.
Summary
Based on my test results, using a thread pool with virtual_cores*2
file reading threads + virtual_cores
file processing threads, will give you good performance for a variety of different timing scenarios. This configuration should give you within ~2% of the maximal throughput, without having to spend lots of time benchmarking.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55035572%2fhow-to-maximize-throughput-when-processing-many-files%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I did some benchmarking to come up with some general guidelines. I tested with about ~500k smallish (~14kb) files. I think the results should be similar for medium sized files; but for larger files, I suspect disk contention becomes more significant. It would be appreciated if someone with deeper knowledge of OS/hardware internals could supplement this answer with more concrete explanations for why some things are faster than others.
I tested with a 16 virtual core (8 physical) computer with dual channel RAM and Linux kernel 4.18.
Do multiple threads increase read throughput?
The answer is yes. I think this could be either due to 1) a hardware bandwidth limitation for single threaded applications or 2) the OS's disk request queue is better utilized when many threads are making requests. The best performance is with virtual_cores*2
threads. Throughput slowly degrades beyond that, perhaps because of increased disk contention. If the pages happen to be cached in RAM, then it is better to have a thread pool of size virtual_cores
. If however < 50% of pages are cached (which I think is the more common case), then virtual_cores*2
will do just fine.
I think the reason why virtual_cores*2
is better than just virtual_cores
is that a file read also includes some non-disk related latency like system calls, decoding, etc. So perhaps the processor can interleave the threads more effectively: while one is waiting on the disk, a second can be executing the non-disk related file read operations. (Could it also be due to the fact that the RAM is dual channel?)
I tested reading random files vs sequentially (by looking up the files' physical block location in storage, and ordering the requests by this). Sequential access gives a pretty significant improvement with HDDs, which is to be expected. If the limiting factor in your application is file read time, as opposed to the processing of said files, I suggest you reorder the requests for sequential access to get a boost.
There is the possibility to use asynchronous disk IO, instead of a thread pool. However, from my readings it appears there is not a portable way to do it yet (see this reddit thread). Also, libuv which powers NodeJS uses a thread pool to handle its file IO.
Balancing read vs processing throughput
Ideally, we could have reading and processing in separate threads. While we are processing the first file, we can be queuing up the next one in another thread. But the more threads we allocate for reading files, the more CPU contention with the processing threads. The solution is to give the faster operation (reading vs processing) the fewest number of threads while still giving zero processing delay between files. This formula seemed to give good results in my tests:
prop = read_time/process_time
if prop > 1:
# double virtual core count gives fastest reads, as per tests above
read_threads = virtual_cores*2
process_threads = ceil(read_threads/(2*prop))
else:
process_threads = virtual_cores
# double read thread pool so CPU can interleave better, as mentioned above
read_threads = 2*ceil(process_threads*prop)
For example: Read = 2s, Process = 10s; so have 2 reading threads for every 5 processing threads
In my tests, there is only about a 1-1.5% performance penalty for having extra reading threads. In my tests, for a prop
close to zero, 1 read + 16 process threads had nearly the same throughput as 32 read + 16 process threads. Modern threads should be pretty lightweight, and the read threads should be sleeping anyways if the files aren't being consumed fast enough. (The same should be true of process threads when prop
is very large)
On the other hand, too few reading threads has a much more significant impact (my third original question). For example, for a very large prop
, 1 read + 16 process threads was 36% slower than 1 read + 15 process threads. Since the process threads are occupying all the benchmark computer's cores, the read thread has too much CPU contention and fails 36% of the time to queue up the next file to be processed. So, my recommendation is to err in favor of too many read threads. Doubling the read thread pool size as in my formula above should accomplish this.
Side note: You can limit the CPU resources your application consumes by setting virtual_cores
to be a smaller percentage of the available cores. You may also choose to forego doubling, since CPU contention may be less of an issue when there is a spare core or more that is not executing the more intensive processing threads.
Summary
Based on my test results, using a thread pool with virtual_cores*2
file reading threads + virtual_cores
file processing threads, will give you good performance for a variety of different timing scenarios. This configuration should give you within ~2% of the maximal throughput, without having to spend lots of time benchmarking.
add a comment |
I did some benchmarking to come up with some general guidelines. I tested with about ~500k smallish (~14kb) files. I think the results should be similar for medium sized files; but for larger files, I suspect disk contention becomes more significant. It would be appreciated if someone with deeper knowledge of OS/hardware internals could supplement this answer with more concrete explanations for why some things are faster than others.
I tested with a 16 virtual core (8 physical) computer with dual channel RAM and Linux kernel 4.18.
Do multiple threads increase read throughput?
The answer is yes. I think this could be either due to 1) a hardware bandwidth limitation for single threaded applications or 2) the OS's disk request queue is better utilized when many threads are making requests. The best performance is with virtual_cores*2
threads. Throughput slowly degrades beyond that, perhaps because of increased disk contention. If the pages happen to be cached in RAM, then it is better to have a thread pool of size virtual_cores
. If however < 50% of pages are cached (which I think is the more common case), then virtual_cores*2
will do just fine.
I think the reason why virtual_cores*2
is better than just virtual_cores
is that a file read also includes some non-disk related latency like system calls, decoding, etc. So perhaps the processor can interleave the threads more effectively: while one is waiting on the disk, a second can be executing the non-disk related file read operations. (Could it also be due to the fact that the RAM is dual channel?)
I tested reading random files vs sequentially (by looking up the files' physical block location in storage, and ordering the requests by this). Sequential access gives a pretty significant improvement with HDDs, which is to be expected. If the limiting factor in your application is file read time, as opposed to the processing of said files, I suggest you reorder the requests for sequential access to get a boost.
There is the possibility to use asynchronous disk IO, instead of a thread pool. However, from my readings it appears there is not a portable way to do it yet (see this reddit thread). Also, libuv which powers NodeJS uses a thread pool to handle its file IO.
Balancing read vs processing throughput
Ideally, we could have reading and processing in separate threads. While we are processing the first file, we can be queuing up the next one in another thread. But the more threads we allocate for reading files, the more CPU contention with the processing threads. The solution is to give the faster operation (reading vs processing) the fewest number of threads while still giving zero processing delay between files. This formula seemed to give good results in my tests:
prop = read_time/process_time
if prop > 1:
# double virtual core count gives fastest reads, as per tests above
read_threads = virtual_cores*2
process_threads = ceil(read_threads/(2*prop))
else:
process_threads = virtual_cores
# double read thread pool so CPU can interleave better, as mentioned above
read_threads = 2*ceil(process_threads*prop)
For example: Read = 2s, Process = 10s; so have 2 reading threads for every 5 processing threads
In my tests, there is only about a 1-1.5% performance penalty for having extra reading threads. In my tests, for a prop
close to zero, 1 read + 16 process threads had nearly the same throughput as 32 read + 16 process threads. Modern threads should be pretty lightweight, and the read threads should be sleeping anyways if the files aren't being consumed fast enough. (The same should be true of process threads when prop
is very large)
On the other hand, too few reading threads has a much more significant impact (my third original question). For example, for a very large prop
, 1 read + 16 process threads was 36% slower than 1 read + 15 process threads. Since the process threads are occupying all the benchmark computer's cores, the read thread has too much CPU contention and fails 36% of the time to queue up the next file to be processed. So, my recommendation is to err in favor of too many read threads. Doubling the read thread pool size as in my formula above should accomplish this.
Side note: You can limit the CPU resources your application consumes by setting virtual_cores
to be a smaller percentage of the available cores. You may also choose to forego doubling, since CPU contention may be less of an issue when there is a spare core or more that is not executing the more intensive processing threads.
Summary
Based on my test results, using a thread pool with virtual_cores*2
file reading threads + virtual_cores
file processing threads, will give you good performance for a variety of different timing scenarios. This configuration should give you within ~2% of the maximal throughput, without having to spend lots of time benchmarking.
add a comment |
I did some benchmarking to come up with some general guidelines. I tested with about ~500k smallish (~14kb) files. I think the results should be similar for medium sized files; but for larger files, I suspect disk contention becomes more significant. It would be appreciated if someone with deeper knowledge of OS/hardware internals could supplement this answer with more concrete explanations for why some things are faster than others.
I tested with a 16 virtual core (8 physical) computer with dual channel RAM and Linux kernel 4.18.
Do multiple threads increase read throughput?
The answer is yes. I think this could be either due to 1) a hardware bandwidth limitation for single threaded applications or 2) the OS's disk request queue is better utilized when many threads are making requests. The best performance is with virtual_cores*2
threads. Throughput slowly degrades beyond that, perhaps because of increased disk contention. If the pages happen to be cached in RAM, then it is better to have a thread pool of size virtual_cores
. If however < 50% of pages are cached (which I think is the more common case), then virtual_cores*2
will do just fine.
I think the reason why virtual_cores*2
is better than just virtual_cores
is that a file read also includes some non-disk related latency like system calls, decoding, etc. So perhaps the processor can interleave the threads more effectively: while one is waiting on the disk, a second can be executing the non-disk related file read operations. (Could it also be due to the fact that the RAM is dual channel?)
I tested reading random files vs sequentially (by looking up the files' physical block location in storage, and ordering the requests by this). Sequential access gives a pretty significant improvement with HDDs, which is to be expected. If the limiting factor in your application is file read time, as opposed to the processing of said files, I suggest you reorder the requests for sequential access to get a boost.
There is the possibility to use asynchronous disk IO, instead of a thread pool. However, from my readings it appears there is not a portable way to do it yet (see this reddit thread). Also, libuv which powers NodeJS uses a thread pool to handle its file IO.
Balancing read vs processing throughput
Ideally, we could have reading and processing in separate threads. While we are processing the first file, we can be queuing up the next one in another thread. But the more threads we allocate for reading files, the more CPU contention with the processing threads. The solution is to give the faster operation (reading vs processing) the fewest number of threads while still giving zero processing delay between files. This formula seemed to give good results in my tests:
prop = read_time/process_time
if prop > 1:
# double virtual core count gives fastest reads, as per tests above
read_threads = virtual_cores*2
process_threads = ceil(read_threads/(2*prop))
else:
process_threads = virtual_cores
# double read thread pool so CPU can interleave better, as mentioned above
read_threads = 2*ceil(process_threads*prop)
For example: Read = 2s, Process = 10s; so have 2 reading threads for every 5 processing threads
In my tests, there is only about a 1-1.5% performance penalty for having extra reading threads. In my tests, for a prop
close to zero, 1 read + 16 process threads had nearly the same throughput as 32 read + 16 process threads. Modern threads should be pretty lightweight, and the read threads should be sleeping anyways if the files aren't being consumed fast enough. (The same should be true of process threads when prop
is very large)
On the other hand, too few reading threads has a much more significant impact (my third original question). For example, for a very large prop
, 1 read + 16 process threads was 36% slower than 1 read + 15 process threads. Since the process threads are occupying all the benchmark computer's cores, the read thread has too much CPU contention and fails 36% of the time to queue up the next file to be processed. So, my recommendation is to err in favor of too many read threads. Doubling the read thread pool size as in my formula above should accomplish this.
Side note: You can limit the CPU resources your application consumes by setting virtual_cores
to be a smaller percentage of the available cores. You may also choose to forego doubling, since CPU contention may be less of an issue when there is a spare core or more that is not executing the more intensive processing threads.
Summary
Based on my test results, using a thread pool with virtual_cores*2
file reading threads + virtual_cores
file processing threads, will give you good performance for a variety of different timing scenarios. This configuration should give you within ~2% of the maximal throughput, without having to spend lots of time benchmarking.
I did some benchmarking to come up with some general guidelines. I tested with about ~500k smallish (~14kb) files. I think the results should be similar for medium sized files; but for larger files, I suspect disk contention becomes more significant. It would be appreciated if someone with deeper knowledge of OS/hardware internals could supplement this answer with more concrete explanations for why some things are faster than others.
I tested with a 16 virtual core (8 physical) computer with dual channel RAM and Linux kernel 4.18.
Do multiple threads increase read throughput?
The answer is yes. I think this could be either due to 1) a hardware bandwidth limitation for single threaded applications or 2) the OS's disk request queue is better utilized when many threads are making requests. The best performance is with virtual_cores*2
threads. Throughput slowly degrades beyond that, perhaps because of increased disk contention. If the pages happen to be cached in RAM, then it is better to have a thread pool of size virtual_cores
. If however < 50% of pages are cached (which I think is the more common case), then virtual_cores*2
will do just fine.
I think the reason why virtual_cores*2
is better than just virtual_cores
is that a file read also includes some non-disk related latency like system calls, decoding, etc. So perhaps the processor can interleave the threads more effectively: while one is waiting on the disk, a second can be executing the non-disk related file read operations. (Could it also be due to the fact that the RAM is dual channel?)
I tested reading random files vs sequentially (by looking up the files' physical block location in storage, and ordering the requests by this). Sequential access gives a pretty significant improvement with HDDs, which is to be expected. If the limiting factor in your application is file read time, as opposed to the processing of said files, I suggest you reorder the requests for sequential access to get a boost.
There is the possibility to use asynchronous disk IO, instead of a thread pool. However, from my readings it appears there is not a portable way to do it yet (see this reddit thread). Also, libuv which powers NodeJS uses a thread pool to handle its file IO.
Balancing read vs processing throughput
Ideally, we could have reading and processing in separate threads. While we are processing the first file, we can be queuing up the next one in another thread. But the more threads we allocate for reading files, the more CPU contention with the processing threads. The solution is to give the faster operation (reading vs processing) the fewest number of threads while still giving zero processing delay between files. This formula seemed to give good results in my tests:
prop = read_time/process_time
if prop > 1:
# double virtual core count gives fastest reads, as per tests above
read_threads = virtual_cores*2
process_threads = ceil(read_threads/(2*prop))
else:
process_threads = virtual_cores
# double read thread pool so CPU can interleave better, as mentioned above
read_threads = 2*ceil(process_threads*prop)
For example: Read = 2s, Process = 10s; so have 2 reading threads for every 5 processing threads
In my tests, there is only about a 1-1.5% performance penalty for having extra reading threads. In my tests, for a prop
close to zero, 1 read + 16 process threads had nearly the same throughput as 32 read + 16 process threads. Modern threads should be pretty lightweight, and the read threads should be sleeping anyways if the files aren't being consumed fast enough. (The same should be true of process threads when prop
is very large)
On the other hand, too few reading threads has a much more significant impact (my third original question). For example, for a very large prop
, 1 read + 16 process threads was 36% slower than 1 read + 15 process threads. Since the process threads are occupying all the benchmark computer's cores, the read thread has too much CPU contention and fails 36% of the time to queue up the next file to be processed. So, my recommendation is to err in favor of too many read threads. Doubling the read thread pool size as in my formula above should accomplish this.
Side note: You can limit the CPU resources your application consumes by setting virtual_cores
to be a smaller percentage of the available cores. You may also choose to forego doubling, since CPU contention may be less of an issue when there is a spare core or more that is not executing the more intensive processing threads.
Summary
Based on my test results, using a thread pool with virtual_cores*2
file reading threads + virtual_cores
file processing threads, will give you good performance for a variety of different timing scenarios. This configuration should give you within ~2% of the maximal throughput, without having to spend lots of time benchmarking.
edited Mar 15 at 0:49
answered Mar 15 at 0:41
AzmisovAzmisov
2,35842953
2,35842953
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55035572%2fhow-to-maximize-throughput-when-processing-many-files%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
any other tips for maximizing disk throughput? Buy faster disks and be done with the problem for all time, without worrying about bugs in your processing algorithms, spending time and money writing code, and having to maintain all that code in the future.
– Andrew Henle
Mar 7 at 10:22
It depends on a lot of factors. OS, available CPU's, available memory, disk performance and so on. When the files are small < 1 MB and they're just a few hundrets then maybe reading them all into memory and then processing them can be faster than reading, precessing, reading and so on. But I belive you have to test and profile things on your own.
– user743414
Mar 7 at 13:54
@AndrewHenle That's always good to keep in mind. Though, if the software is intended to run on a variety of different hardware/OS configurations, like a framework, you still would want to employ some software-based techniques as well.
– Azmisov
Mar 15 at 0:44
@user743414 Despite the numerous possible configurations, I suspect the internals of reading data from disk is implemented pretty similarly across the board for various motherboards, CPU's, RAM, etc. I was hoping someone with more expertise on the internals could describe the general principles, without having to benchmark across many rigs.
– Azmisov
Mar 15 at 0:48