How to maximize throughput when processing many filesHow do I copy a file in Python?How many files can I put in a directory?When and how should I use a ThreadLocal variable?Improve INSERT-per-second performance of SQLite?Technically, why are processes in Erlang more efficient than OS threads?What is the Haskell response to Node.js?Disk Throughput - new file vs. existing file using ddHow does HDFS write to a disk on the data nodeReplacing a 32-bit loop counter with 64-bit introduces crazy performance deviationsPoor disk read throughput in Neo4j when database is hosted on SAN

Grepping string, but include all non-blank lines following each grep match

Should I warn new/prospective PhD Student that supervisor is terrible?

Why is the Sun approximated as a black body at ~ 5800 K?

How do you justify more code being written by following clean code practices?

Quoting Keynes in a lecture

Alignment of six matrices

Check if object is null and return null

Echo with obfuscation

Visualizing the difference curve in a 2D plot?

In One Punch Man, is King actually weak?

Make a Bowl of Alphabet Soup

Personal or impersonal in a technical resume

Given this phrasing in the lease, when should I pay my rent?

How can I, as DM, avoid the Conga Line of Death occurring when implementing some form of flanking rule?

What's the name of the logical fallacy where a debater extends a statement far beyond the original statement to make it true?

Why didn’t Eve recognize the little cockroach as a living organism?

Limit max CPU usage SQL SERVER with WSRM

How much do grades matter for a future academia position?

How to make money from a browser who sees 5 seconds into the future of any web page?

How would you translate "more" for use as an interface button?

Giving feedback to someone without sounding prejudiced

How to get directions in deep space?

What is this high flying aircraft over Pennsylvania?

I'm just a whisper. Who am I?



How to maximize throughput when processing many files


How do I copy a file in Python?How many files can I put in a directory?When and how should I use a ThreadLocal variable?Improve INSERT-per-second performance of SQLite?Technically, why are processes in Erlang more efficient than OS threads?What is the Haskell response to Node.js?Disk Throughput - new file vs. existing file using ddHow does HDFS write to a disk on the data nodeReplacing a 32-bit loop counter with 64-bit introduces crazy performance deviationsPoor disk read throughput in Neo4j when database is hosted on SAN













1















Say you want to process many files as quickly as possible, where processing time > file read time.



  • Will reading multiple files using a thread pool increase throughput? or does it just cause more disk contention?

  • If a thread pool does help, what determines how many threads are needed to achieve the maximum? can this be calculated based on the target system?

  • For a single core, will a loop reading and processing asynchronously via threads be faster than doing it synchronously? I assume since disk latency is so high, it would be. But maybe if the file read is much much smaller than processing time, it is better to let the processing step finish uninterrupted without context switches.

Also, do you have any other tips for maximizing disk throughput?










share|improve this question






















  • any other tips for maximizing disk throughput? Buy faster disks and be done with the problem for all time, without worrying about bugs in your processing algorithms, spending time and money writing code, and having to maintain all that code in the future.

    – Andrew Henle
    Mar 7 at 10:22












  • It depends on a lot of factors. OS, available CPU's, available memory, disk performance and so on. When the files are small < 1 MB and they're just a few hundrets then maybe reading them all into memory and then processing them can be faster than reading, precessing, reading and so on. But I belive you have to test and profile things on your own.

    – user743414
    Mar 7 at 13:54












  • @AndrewHenle That's always good to keep in mind. Though, if the software is intended to run on a variety of different hardware/OS configurations, like a framework, you still would want to employ some software-based techniques as well.

    – Azmisov
    Mar 15 at 0:44











  • @user743414 Despite the numerous possible configurations, I suspect the internals of reading data from disk is implemented pretty similarly across the board for various motherboards, CPU's, RAM, etc. I was hoping someone with more expertise on the internals could describe the general principles, without having to benchmark across many rigs.

    – Azmisov
    Mar 15 at 0:48















1















Say you want to process many files as quickly as possible, where processing time > file read time.



  • Will reading multiple files using a thread pool increase throughput? or does it just cause more disk contention?

  • If a thread pool does help, what determines how many threads are needed to achieve the maximum? can this be calculated based on the target system?

  • For a single core, will a loop reading and processing asynchronously via threads be faster than doing it synchronously? I assume since disk latency is so high, it would be. But maybe if the file read is much much smaller than processing time, it is better to let the processing step finish uninterrupted without context switches.

Also, do you have any other tips for maximizing disk throughput?










share|improve this question






















  • any other tips for maximizing disk throughput? Buy faster disks and be done with the problem for all time, without worrying about bugs in your processing algorithms, spending time and money writing code, and having to maintain all that code in the future.

    – Andrew Henle
    Mar 7 at 10:22












  • It depends on a lot of factors. OS, available CPU's, available memory, disk performance and so on. When the files are small < 1 MB and they're just a few hundrets then maybe reading them all into memory and then processing them can be faster than reading, precessing, reading and so on. But I belive you have to test and profile things on your own.

    – user743414
    Mar 7 at 13:54












  • @AndrewHenle That's always good to keep in mind. Though, if the software is intended to run on a variety of different hardware/OS configurations, like a framework, you still would want to employ some software-based techniques as well.

    – Azmisov
    Mar 15 at 0:44











  • @user743414 Despite the numerous possible configurations, I suspect the internals of reading data from disk is implemented pretty similarly across the board for various motherboards, CPU's, RAM, etc. I was hoping someone with more expertise on the internals could describe the general principles, without having to benchmark across many rigs.

    – Azmisov
    Mar 15 at 0:48













1












1








1








Say you want to process many files as quickly as possible, where processing time > file read time.



  • Will reading multiple files using a thread pool increase throughput? or does it just cause more disk contention?

  • If a thread pool does help, what determines how many threads are needed to achieve the maximum? can this be calculated based on the target system?

  • For a single core, will a loop reading and processing asynchronously via threads be faster than doing it synchronously? I assume since disk latency is so high, it would be. But maybe if the file read is much much smaller than processing time, it is better to let the processing step finish uninterrupted without context switches.

Also, do you have any other tips for maximizing disk throughput?










share|improve this question














Say you want to process many files as quickly as possible, where processing time > file read time.



  • Will reading multiple files using a thread pool increase throughput? or does it just cause more disk contention?

  • If a thread pool does help, what determines how many threads are needed to achieve the maximum? can this be calculated based on the target system?

  • For a single core, will a loop reading and processing asynchronously via threads be faster than doing it synchronously? I assume since disk latency is so high, it would be. But maybe if the file read is much much smaller than processing time, it is better to let the processing step finish uninterrupted without context switches.

Also, do you have any other tips for maximizing disk throughput?







multithreading optimization operating-system filesystems disk






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 7 at 3:23









AzmisovAzmisov

2,35842953




2,35842953












  • any other tips for maximizing disk throughput? Buy faster disks and be done with the problem for all time, without worrying about bugs in your processing algorithms, spending time and money writing code, and having to maintain all that code in the future.

    – Andrew Henle
    Mar 7 at 10:22












  • It depends on a lot of factors. OS, available CPU's, available memory, disk performance and so on. When the files are small < 1 MB and they're just a few hundrets then maybe reading them all into memory and then processing them can be faster than reading, precessing, reading and so on. But I belive you have to test and profile things on your own.

    – user743414
    Mar 7 at 13:54












  • @AndrewHenle That's always good to keep in mind. Though, if the software is intended to run on a variety of different hardware/OS configurations, like a framework, you still would want to employ some software-based techniques as well.

    – Azmisov
    Mar 15 at 0:44











  • @user743414 Despite the numerous possible configurations, I suspect the internals of reading data from disk is implemented pretty similarly across the board for various motherboards, CPU's, RAM, etc. I was hoping someone with more expertise on the internals could describe the general principles, without having to benchmark across many rigs.

    – Azmisov
    Mar 15 at 0:48

















  • any other tips for maximizing disk throughput? Buy faster disks and be done with the problem for all time, without worrying about bugs in your processing algorithms, spending time and money writing code, and having to maintain all that code in the future.

    – Andrew Henle
    Mar 7 at 10:22












  • It depends on a lot of factors. OS, available CPU's, available memory, disk performance and so on. When the files are small < 1 MB and they're just a few hundrets then maybe reading them all into memory and then processing them can be faster than reading, precessing, reading and so on. But I belive you have to test and profile things on your own.

    – user743414
    Mar 7 at 13:54












  • @AndrewHenle That's always good to keep in mind. Though, if the software is intended to run on a variety of different hardware/OS configurations, like a framework, you still would want to employ some software-based techniques as well.

    – Azmisov
    Mar 15 at 0:44











  • @user743414 Despite the numerous possible configurations, I suspect the internals of reading data from disk is implemented pretty similarly across the board for various motherboards, CPU's, RAM, etc. I was hoping someone with more expertise on the internals could describe the general principles, without having to benchmark across many rigs.

    – Azmisov
    Mar 15 at 0:48
















any other tips for maximizing disk throughput? Buy faster disks and be done with the problem for all time, without worrying about bugs in your processing algorithms, spending time and money writing code, and having to maintain all that code in the future.

– Andrew Henle
Mar 7 at 10:22






any other tips for maximizing disk throughput? Buy faster disks and be done with the problem for all time, without worrying about bugs in your processing algorithms, spending time and money writing code, and having to maintain all that code in the future.

– Andrew Henle
Mar 7 at 10:22














It depends on a lot of factors. OS, available CPU's, available memory, disk performance and so on. When the files are small < 1 MB and they're just a few hundrets then maybe reading them all into memory and then processing them can be faster than reading, precessing, reading and so on. But I belive you have to test and profile things on your own.

– user743414
Mar 7 at 13:54






It depends on a lot of factors. OS, available CPU's, available memory, disk performance and so on. When the files are small < 1 MB and they're just a few hundrets then maybe reading them all into memory and then processing them can be faster than reading, precessing, reading and so on. But I belive you have to test and profile things on your own.

– user743414
Mar 7 at 13:54














@AndrewHenle That's always good to keep in mind. Though, if the software is intended to run on a variety of different hardware/OS configurations, like a framework, you still would want to employ some software-based techniques as well.

– Azmisov
Mar 15 at 0:44





@AndrewHenle That's always good to keep in mind. Though, if the software is intended to run on a variety of different hardware/OS configurations, like a framework, you still would want to employ some software-based techniques as well.

– Azmisov
Mar 15 at 0:44













@user743414 Despite the numerous possible configurations, I suspect the internals of reading data from disk is implemented pretty similarly across the board for various motherboards, CPU's, RAM, etc. I was hoping someone with more expertise on the internals could describe the general principles, without having to benchmark across many rigs.

– Azmisov
Mar 15 at 0:48





@user743414 Despite the numerous possible configurations, I suspect the internals of reading data from disk is implemented pretty similarly across the board for various motherboards, CPU's, RAM, etc. I was hoping someone with more expertise on the internals could describe the general principles, without having to benchmark across many rigs.

– Azmisov
Mar 15 at 0:48












1 Answer
1






active

oldest

votes


















1














I did some benchmarking to come up with some general guidelines. I tested with about ~500k smallish (~14kb) files. I think the results should be similar for medium sized files; but for larger files, I suspect disk contention becomes more significant. It would be appreciated if someone with deeper knowledge of OS/hardware internals could supplement this answer with more concrete explanations for why some things are faster than others.



I tested with a 16 virtual core (8 physical) computer with dual channel RAM and Linux kernel 4.18.



Do multiple threads increase read throughput?



The answer is yes. I think this could be either due to 1) a hardware bandwidth limitation for single threaded applications or 2) the OS's disk request queue is better utilized when many threads are making requests. The best performance is with virtual_cores*2 threads. Throughput slowly degrades beyond that, perhaps because of increased disk contention. If the pages happen to be cached in RAM, then it is better to have a thread pool of size virtual_cores. If however < 50% of pages are cached (which I think is the more common case), then virtual_cores*2 will do just fine.



I think the reason why virtual_cores*2 is better than just virtual_cores is that a file read also includes some non-disk related latency like system calls, decoding, etc. So perhaps the processor can interleave the threads more effectively: while one is waiting on the disk, a second can be executing the non-disk related file read operations. (Could it also be due to the fact that the RAM is dual channel?)



I tested reading random files vs sequentially (by looking up the files' physical block location in storage, and ordering the requests by this). Sequential access gives a pretty significant improvement with HDDs, which is to be expected. If the limiting factor in your application is file read time, as opposed to the processing of said files, I suggest you reorder the requests for sequential access to get a boost.



read throughput vs thread count



There is the possibility to use asynchronous disk IO, instead of a thread pool. However, from my readings it appears there is not a portable way to do it yet (see this reddit thread). Also, libuv which powers NodeJS uses a thread pool to handle its file IO.



Balancing read vs processing throughput



Ideally, we could have reading and processing in separate threads. While we are processing the first file, we can be queuing up the next one in another thread. But the more threads we allocate for reading files, the more CPU contention with the processing threads. The solution is to give the faster operation (reading vs processing) the fewest number of threads while still giving zero processing delay between files. This formula seemed to give good results in my tests:



prop = read_time/process_time
if prop > 1:
# double virtual core count gives fastest reads, as per tests above
read_threads = virtual_cores*2
process_threads = ceil(read_threads/(2*prop))
else:
process_threads = virtual_cores
# double read thread pool so CPU can interleave better, as mentioned above
read_threads = 2*ceil(process_threads*prop)


For example: Read = 2s, Process = 10s; so have 2 reading threads for every 5 processing threads



In my tests, there is only about a 1-1.5% performance penalty for having extra reading threads. In my tests, for a prop close to zero, 1 read + 16 process threads had nearly the same throughput as 32 read + 16 process threads. Modern threads should be pretty lightweight, and the read threads should be sleeping anyways if the files aren't being consumed fast enough. (The same should be true of process threads when prop is very large)



On the other hand, too few reading threads has a much more significant impact (my third original question). For example, for a very large prop, 1 read + 16 process threads was 36% slower than 1 read + 15 process threads. Since the process threads are occupying all the benchmark computer's cores, the read thread has too much CPU contention and fails 36% of the time to queue up the next file to be processed. So, my recommendation is to err in favor of too many read threads. Doubling the read thread pool size as in my formula above should accomplish this.



Side note: You can limit the CPU resources your application consumes by setting virtual_cores to be a smaller percentage of the available cores. You may also choose to forego doubling, since CPU contention may be less of an issue when there is a spare core or more that is not executing the more intensive processing threads.



Summary



Based on my test results, using a thread pool with virtual_cores*2 file reading threads + virtual_cores file processing threads, will give you good performance for a variety of different timing scenarios. This configuration should give you within ~2% of the maximal throughput, without having to spend lots of time benchmarking.






share|improve this answer
























    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55035572%2fhow-to-maximize-throughput-when-processing-many-files%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    I did some benchmarking to come up with some general guidelines. I tested with about ~500k smallish (~14kb) files. I think the results should be similar for medium sized files; but for larger files, I suspect disk contention becomes more significant. It would be appreciated if someone with deeper knowledge of OS/hardware internals could supplement this answer with more concrete explanations for why some things are faster than others.



    I tested with a 16 virtual core (8 physical) computer with dual channel RAM and Linux kernel 4.18.



    Do multiple threads increase read throughput?



    The answer is yes. I think this could be either due to 1) a hardware bandwidth limitation for single threaded applications or 2) the OS's disk request queue is better utilized when many threads are making requests. The best performance is with virtual_cores*2 threads. Throughput slowly degrades beyond that, perhaps because of increased disk contention. If the pages happen to be cached in RAM, then it is better to have a thread pool of size virtual_cores. If however < 50% of pages are cached (which I think is the more common case), then virtual_cores*2 will do just fine.



    I think the reason why virtual_cores*2 is better than just virtual_cores is that a file read also includes some non-disk related latency like system calls, decoding, etc. So perhaps the processor can interleave the threads more effectively: while one is waiting on the disk, a second can be executing the non-disk related file read operations. (Could it also be due to the fact that the RAM is dual channel?)



    I tested reading random files vs sequentially (by looking up the files' physical block location in storage, and ordering the requests by this). Sequential access gives a pretty significant improvement with HDDs, which is to be expected. If the limiting factor in your application is file read time, as opposed to the processing of said files, I suggest you reorder the requests for sequential access to get a boost.



    read throughput vs thread count



    There is the possibility to use asynchronous disk IO, instead of a thread pool. However, from my readings it appears there is not a portable way to do it yet (see this reddit thread). Also, libuv which powers NodeJS uses a thread pool to handle its file IO.



    Balancing read vs processing throughput



    Ideally, we could have reading and processing in separate threads. While we are processing the first file, we can be queuing up the next one in another thread. But the more threads we allocate for reading files, the more CPU contention with the processing threads. The solution is to give the faster operation (reading vs processing) the fewest number of threads while still giving zero processing delay between files. This formula seemed to give good results in my tests:



    prop = read_time/process_time
    if prop > 1:
    # double virtual core count gives fastest reads, as per tests above
    read_threads = virtual_cores*2
    process_threads = ceil(read_threads/(2*prop))
    else:
    process_threads = virtual_cores
    # double read thread pool so CPU can interleave better, as mentioned above
    read_threads = 2*ceil(process_threads*prop)


    For example: Read = 2s, Process = 10s; so have 2 reading threads for every 5 processing threads



    In my tests, there is only about a 1-1.5% performance penalty for having extra reading threads. In my tests, for a prop close to zero, 1 read + 16 process threads had nearly the same throughput as 32 read + 16 process threads. Modern threads should be pretty lightweight, and the read threads should be sleeping anyways if the files aren't being consumed fast enough. (The same should be true of process threads when prop is very large)



    On the other hand, too few reading threads has a much more significant impact (my third original question). For example, for a very large prop, 1 read + 16 process threads was 36% slower than 1 read + 15 process threads. Since the process threads are occupying all the benchmark computer's cores, the read thread has too much CPU contention and fails 36% of the time to queue up the next file to be processed. So, my recommendation is to err in favor of too many read threads. Doubling the read thread pool size as in my formula above should accomplish this.



    Side note: You can limit the CPU resources your application consumes by setting virtual_cores to be a smaller percentage of the available cores. You may also choose to forego doubling, since CPU contention may be less of an issue when there is a spare core or more that is not executing the more intensive processing threads.



    Summary



    Based on my test results, using a thread pool with virtual_cores*2 file reading threads + virtual_cores file processing threads, will give you good performance for a variety of different timing scenarios. This configuration should give you within ~2% of the maximal throughput, without having to spend lots of time benchmarking.






    share|improve this answer





























      1














      I did some benchmarking to come up with some general guidelines. I tested with about ~500k smallish (~14kb) files. I think the results should be similar for medium sized files; but for larger files, I suspect disk contention becomes more significant. It would be appreciated if someone with deeper knowledge of OS/hardware internals could supplement this answer with more concrete explanations for why some things are faster than others.



      I tested with a 16 virtual core (8 physical) computer with dual channel RAM and Linux kernel 4.18.



      Do multiple threads increase read throughput?



      The answer is yes. I think this could be either due to 1) a hardware bandwidth limitation for single threaded applications or 2) the OS's disk request queue is better utilized when many threads are making requests. The best performance is with virtual_cores*2 threads. Throughput slowly degrades beyond that, perhaps because of increased disk contention. If the pages happen to be cached in RAM, then it is better to have a thread pool of size virtual_cores. If however < 50% of pages are cached (which I think is the more common case), then virtual_cores*2 will do just fine.



      I think the reason why virtual_cores*2 is better than just virtual_cores is that a file read also includes some non-disk related latency like system calls, decoding, etc. So perhaps the processor can interleave the threads more effectively: while one is waiting on the disk, a second can be executing the non-disk related file read operations. (Could it also be due to the fact that the RAM is dual channel?)



      I tested reading random files vs sequentially (by looking up the files' physical block location in storage, and ordering the requests by this). Sequential access gives a pretty significant improvement with HDDs, which is to be expected. If the limiting factor in your application is file read time, as opposed to the processing of said files, I suggest you reorder the requests for sequential access to get a boost.



      read throughput vs thread count



      There is the possibility to use asynchronous disk IO, instead of a thread pool. However, from my readings it appears there is not a portable way to do it yet (see this reddit thread). Also, libuv which powers NodeJS uses a thread pool to handle its file IO.



      Balancing read vs processing throughput



      Ideally, we could have reading and processing in separate threads. While we are processing the first file, we can be queuing up the next one in another thread. But the more threads we allocate for reading files, the more CPU contention with the processing threads. The solution is to give the faster operation (reading vs processing) the fewest number of threads while still giving zero processing delay between files. This formula seemed to give good results in my tests:



      prop = read_time/process_time
      if prop > 1:
      # double virtual core count gives fastest reads, as per tests above
      read_threads = virtual_cores*2
      process_threads = ceil(read_threads/(2*prop))
      else:
      process_threads = virtual_cores
      # double read thread pool so CPU can interleave better, as mentioned above
      read_threads = 2*ceil(process_threads*prop)


      For example: Read = 2s, Process = 10s; so have 2 reading threads for every 5 processing threads



      In my tests, there is only about a 1-1.5% performance penalty for having extra reading threads. In my tests, for a prop close to zero, 1 read + 16 process threads had nearly the same throughput as 32 read + 16 process threads. Modern threads should be pretty lightweight, and the read threads should be sleeping anyways if the files aren't being consumed fast enough. (The same should be true of process threads when prop is very large)



      On the other hand, too few reading threads has a much more significant impact (my third original question). For example, for a very large prop, 1 read + 16 process threads was 36% slower than 1 read + 15 process threads. Since the process threads are occupying all the benchmark computer's cores, the read thread has too much CPU contention and fails 36% of the time to queue up the next file to be processed. So, my recommendation is to err in favor of too many read threads. Doubling the read thread pool size as in my formula above should accomplish this.



      Side note: You can limit the CPU resources your application consumes by setting virtual_cores to be a smaller percentage of the available cores. You may also choose to forego doubling, since CPU contention may be less of an issue when there is a spare core or more that is not executing the more intensive processing threads.



      Summary



      Based on my test results, using a thread pool with virtual_cores*2 file reading threads + virtual_cores file processing threads, will give you good performance for a variety of different timing scenarios. This configuration should give you within ~2% of the maximal throughput, without having to spend lots of time benchmarking.






      share|improve this answer



























        1












        1








        1







        I did some benchmarking to come up with some general guidelines. I tested with about ~500k smallish (~14kb) files. I think the results should be similar for medium sized files; but for larger files, I suspect disk contention becomes more significant. It would be appreciated if someone with deeper knowledge of OS/hardware internals could supplement this answer with more concrete explanations for why some things are faster than others.



        I tested with a 16 virtual core (8 physical) computer with dual channel RAM and Linux kernel 4.18.



        Do multiple threads increase read throughput?



        The answer is yes. I think this could be either due to 1) a hardware bandwidth limitation for single threaded applications or 2) the OS's disk request queue is better utilized when many threads are making requests. The best performance is with virtual_cores*2 threads. Throughput slowly degrades beyond that, perhaps because of increased disk contention. If the pages happen to be cached in RAM, then it is better to have a thread pool of size virtual_cores. If however < 50% of pages are cached (which I think is the more common case), then virtual_cores*2 will do just fine.



        I think the reason why virtual_cores*2 is better than just virtual_cores is that a file read also includes some non-disk related latency like system calls, decoding, etc. So perhaps the processor can interleave the threads more effectively: while one is waiting on the disk, a second can be executing the non-disk related file read operations. (Could it also be due to the fact that the RAM is dual channel?)



        I tested reading random files vs sequentially (by looking up the files' physical block location in storage, and ordering the requests by this). Sequential access gives a pretty significant improvement with HDDs, which is to be expected. If the limiting factor in your application is file read time, as opposed to the processing of said files, I suggest you reorder the requests for sequential access to get a boost.



        read throughput vs thread count



        There is the possibility to use asynchronous disk IO, instead of a thread pool. However, from my readings it appears there is not a portable way to do it yet (see this reddit thread). Also, libuv which powers NodeJS uses a thread pool to handle its file IO.



        Balancing read vs processing throughput



        Ideally, we could have reading and processing in separate threads. While we are processing the first file, we can be queuing up the next one in another thread. But the more threads we allocate for reading files, the more CPU contention with the processing threads. The solution is to give the faster operation (reading vs processing) the fewest number of threads while still giving zero processing delay between files. This formula seemed to give good results in my tests:



        prop = read_time/process_time
        if prop > 1:
        # double virtual core count gives fastest reads, as per tests above
        read_threads = virtual_cores*2
        process_threads = ceil(read_threads/(2*prop))
        else:
        process_threads = virtual_cores
        # double read thread pool so CPU can interleave better, as mentioned above
        read_threads = 2*ceil(process_threads*prop)


        For example: Read = 2s, Process = 10s; so have 2 reading threads for every 5 processing threads



        In my tests, there is only about a 1-1.5% performance penalty for having extra reading threads. In my tests, for a prop close to zero, 1 read + 16 process threads had nearly the same throughput as 32 read + 16 process threads. Modern threads should be pretty lightweight, and the read threads should be sleeping anyways if the files aren't being consumed fast enough. (The same should be true of process threads when prop is very large)



        On the other hand, too few reading threads has a much more significant impact (my third original question). For example, for a very large prop, 1 read + 16 process threads was 36% slower than 1 read + 15 process threads. Since the process threads are occupying all the benchmark computer's cores, the read thread has too much CPU contention and fails 36% of the time to queue up the next file to be processed. So, my recommendation is to err in favor of too many read threads. Doubling the read thread pool size as in my formula above should accomplish this.



        Side note: You can limit the CPU resources your application consumes by setting virtual_cores to be a smaller percentage of the available cores. You may also choose to forego doubling, since CPU contention may be less of an issue when there is a spare core or more that is not executing the more intensive processing threads.



        Summary



        Based on my test results, using a thread pool with virtual_cores*2 file reading threads + virtual_cores file processing threads, will give you good performance for a variety of different timing scenarios. This configuration should give you within ~2% of the maximal throughput, without having to spend lots of time benchmarking.






        share|improve this answer















        I did some benchmarking to come up with some general guidelines. I tested with about ~500k smallish (~14kb) files. I think the results should be similar for medium sized files; but for larger files, I suspect disk contention becomes more significant. It would be appreciated if someone with deeper knowledge of OS/hardware internals could supplement this answer with more concrete explanations for why some things are faster than others.



        I tested with a 16 virtual core (8 physical) computer with dual channel RAM and Linux kernel 4.18.



        Do multiple threads increase read throughput?



        The answer is yes. I think this could be either due to 1) a hardware bandwidth limitation for single threaded applications or 2) the OS's disk request queue is better utilized when many threads are making requests. The best performance is with virtual_cores*2 threads. Throughput slowly degrades beyond that, perhaps because of increased disk contention. If the pages happen to be cached in RAM, then it is better to have a thread pool of size virtual_cores. If however < 50% of pages are cached (which I think is the more common case), then virtual_cores*2 will do just fine.



        I think the reason why virtual_cores*2 is better than just virtual_cores is that a file read also includes some non-disk related latency like system calls, decoding, etc. So perhaps the processor can interleave the threads more effectively: while one is waiting on the disk, a second can be executing the non-disk related file read operations. (Could it also be due to the fact that the RAM is dual channel?)



        I tested reading random files vs sequentially (by looking up the files' physical block location in storage, and ordering the requests by this). Sequential access gives a pretty significant improvement with HDDs, which is to be expected. If the limiting factor in your application is file read time, as opposed to the processing of said files, I suggest you reorder the requests for sequential access to get a boost.



        read throughput vs thread count



        There is the possibility to use asynchronous disk IO, instead of a thread pool. However, from my readings it appears there is not a portable way to do it yet (see this reddit thread). Also, libuv which powers NodeJS uses a thread pool to handle its file IO.



        Balancing read vs processing throughput



        Ideally, we could have reading and processing in separate threads. While we are processing the first file, we can be queuing up the next one in another thread. But the more threads we allocate for reading files, the more CPU contention with the processing threads. The solution is to give the faster operation (reading vs processing) the fewest number of threads while still giving zero processing delay between files. This formula seemed to give good results in my tests:



        prop = read_time/process_time
        if prop > 1:
        # double virtual core count gives fastest reads, as per tests above
        read_threads = virtual_cores*2
        process_threads = ceil(read_threads/(2*prop))
        else:
        process_threads = virtual_cores
        # double read thread pool so CPU can interleave better, as mentioned above
        read_threads = 2*ceil(process_threads*prop)


        For example: Read = 2s, Process = 10s; so have 2 reading threads for every 5 processing threads



        In my tests, there is only about a 1-1.5% performance penalty for having extra reading threads. In my tests, for a prop close to zero, 1 read + 16 process threads had nearly the same throughput as 32 read + 16 process threads. Modern threads should be pretty lightweight, and the read threads should be sleeping anyways if the files aren't being consumed fast enough. (The same should be true of process threads when prop is very large)



        On the other hand, too few reading threads has a much more significant impact (my third original question). For example, for a very large prop, 1 read + 16 process threads was 36% slower than 1 read + 15 process threads. Since the process threads are occupying all the benchmark computer's cores, the read thread has too much CPU contention and fails 36% of the time to queue up the next file to be processed. So, my recommendation is to err in favor of too many read threads. Doubling the read thread pool size as in my formula above should accomplish this.



        Side note: You can limit the CPU resources your application consumes by setting virtual_cores to be a smaller percentage of the available cores. You may also choose to forego doubling, since CPU contention may be less of an issue when there is a spare core or more that is not executing the more intensive processing threads.



        Summary



        Based on my test results, using a thread pool with virtual_cores*2 file reading threads + virtual_cores file processing threads, will give you good performance for a variety of different timing scenarios. This configuration should give you within ~2% of the maximal throughput, without having to spend lots of time benchmarking.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Mar 15 at 0:49

























        answered Mar 15 at 0:41









        AzmisovAzmisov

        2,35842953




        2,35842953





























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55035572%2fhow-to-maximize-throughput-when-processing-many-files%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Save data to MySQL database using ExtJS and PHP [closed]2019 Community Moderator ElectionHow can I prevent SQL injection in PHP?Which MySQL data type to use for storing boolean valuesPHP: Delete an element from an arrayHow do I connect to a MySQL Database in Python?Should I use the datetime or timestamp data type in MySQL?How to get a list of MySQL user accountsHow Do You Parse and Process HTML/XML in PHP?Reference — What does this symbol mean in PHP?How does PHP 'foreach' actually work?Why shouldn't I use mysql_* functions in PHP?

            Compiling GNU Global with universal-ctags support Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) Data science time! April 2019 and salary with experience The Ask Question Wizard is Live!Tags for Emacs: Relationship between etags, ebrowse, cscope, GNU Global and exuberant ctagsVim and Ctags tips and trickscscope or ctags why choose one over the other?scons and ctagsctags cannot open option file “.ctags”Adding tag scopes in universal-ctagsShould I use Universal-ctags?Universal ctags on WindowsHow do I install GNU Global with universal ctags support using Homebrew?Universal ctags with emacsHow to highlight ctags generated by Universal Ctags in Vim?

            Add ONERROR event to image from jsp tldHow to add an image to a JPanel?Saving image from PHP URLHTML img scalingCheck if an image is loaded (no errors) with jQueryHow to force an <img> to take up width, even if the image is not loadedHow do I populate hidden form field with a value set in Spring ControllerStyling Raw elements Generated from JSP tagds with Jquery MobileLimit resizing of images with explicitly set width and height attributeserror TLD use in a jsp fileJsp tld files cannot be resolved