Not enough CPU when plenty of CPU

Thanks, I’ll play around with it, I just find it counterintuitive as someone coming from analog pedal boards and trying to recreate that familiarity. I did go back after reading all the comments and info I could find on the message board and see the mention of this in the manual but it definitely slows down the creative process. Thanks for the reply though!

2 Likes

No worries, my philosophy is don’t worry about the DSP usage until you have to then in those instances, utilize additional rows if needed (rows 1/2 share 2 cores and rows 3/4 share 2 cores). I’d imagine future updates will make the FX blocks more efficient. Some FX blocks have already been updated to be more efficient and that process will continue.

Your perspective on this is very understandable, because we’ve gotten used to the marketing hype “more cores more better!” throughout tech, especially when it comes to PCs and smartphones.

But the reality is that multithreading and scheduling and all that good stuff is actually really pretty difficult, and despite all the magic, what’s really happening is that each core, in ANY device, is doing its own distinct task on its own schedule, and not being thrown into a single aggregate pool of CPU power. When it comes to near-realtime things, like processing audio, trying to multithread a task USUALLY leads to worse performance.

We’ve all gotten so used to the insane performance in modern computing that we forget that in most cases, a single core is just a dumb unit chewing through a backlog of commands linearly. Adding more cores doesn’t make a single command faster, it just makes that backlog of commands shorter. More cores means scaling horizontally, faster cores is scaling vertically. The goal is to find the right balance for the specific use case.

There’s that old saying “You can’t produce a baby in one month by getting nine women pregnant”, and it applies here

1 Like

That’s handy to know. At least I’m learning something. I know there’s a learning curve with modelers but having used fractals a little in the past, I was hoping for a lot simpler experience with the QC. In some ways it certainly is–i like the interface a lot. But I’m definitely looking forward to it improving as users provide feedback and updates come out. Thanks for explaining the nature of multi core processors, I didn’t know how that worked. I thought somehow they assigned themselves automatically to whichever task was most processor intensive and switched automatically rather than having dedicated lanes.

2 Likes

The only reason I didn’t reply to the OP’s comment initially is, that despite the fact that he has a really good point, lack of core/DSP management by the firmware/OS seems to be the current industry status quo. It is the current state of affairs with other modelers as well, but it shouldn’t be. Core management should be transparent to the user. There should be an extraction layer, a “middle tier” if you will, that is invisible to the user, handles the low-level grunt-work, and provides an optimized distribution of the blocks and other modeler functions such as Global EQ across the cores.

This has been painfully obvious for years on other modelers as well as @MP_Mod pointed out. They have the same issue with the same result. Users have to carefully allot blocks to the correct routes. I don’t know why under-the-covers core management has not been added yet “industry-wide”. Maybe it would demand considerable development resources and/or slow down future updates. Worse yet would be if it somehow added latency although that seems like it would be a surmountable challenge. I can only speculate.

In any case, when we stop requiring users to manage block distribution across multiple cores/DSPs, it will be a welcome innovation to modeling in general and one that would probably be taken for granted in subsequent devices that include it.

4 Likes

On a separate note, the ability to add a greyed-out block to your path as the OP documents, would definitely be a bug!

Most operating systems already do provide some sort of abstraction for this… macOS has Grand Central Dispatch, for instance, but multithreading is generally not well suited to synchronous processes and near-realtime requirements. I think even if one were to fashion some sort of custom scheduler for this purpose, the overhead combined with blocked threads (and deadlocks) would make execution both slower and less reliable than a single threaded application.

But, who knows- someone much smarter than me is probs inventing it right now!

I suppose multithreading could be involved but I was more so referring to multiprocessing. The user is manually guiding the multiprocessing or at least directing processor affinity to some extent when they assign blocks to path pairs. On the Helix or QC for example, depending on where they are placed in the signal path, blocks on the first two rows will be assigned to run on the first set of cores/DSP, blocks on the third and fourth rows run on the second set. Other system processes as well as features such as Global EQ, tend to get automatically assigned to the first set of cores.

Currently the user has to figure out how to distribute the blocks to prevent running out of CPU/DSP on the first pair of rows (or the second). This could be managed transparently to the user such that you just set up your path, without regard to which rows you placed your blocks on. As long as the user does not exceed the total aggregate amount of CPU required to split them across all available cores (or perhaps more accurately, at least as of now, sets of cores).

The idea is to let the system handle the block distribution. Essentially having the system do the same thing you are currently doing manually. With this kind of functionality, you just set up your path without regard to what row you have something on and let the OS figure out the details.

I don’t think this necessarily needs to be a more dynamic or “near-realtime” process than the current manual method, other than perhaps during the actual design of the preset. Once the assignation of blocks to cores is completed as calculated by the system rather than the user, and preset saved, the block to core affinity/assignation would be locked in for that preset. Or perhaps it would be recalculated and locked in when initially switching to the preset.

It must be more challenging than it appears to pull something like this off though, or I would think it would have been done by now. Complaints about having to strategize manually splitting blocks across paths have been showing up on various forums for years.

I might be misunderstanding what you’re talking about, but it seems like you’re describing the same concept, with the same set of problems, with different terminology.

No doubt guilty of some repetition in that last post but I was trying my best to differentiate between multithreading and multiprocessing because I think multiprocessing might be more pertinent in this context when you start talking about assigning different modeler tasks and blocks to different cores. I am reminded of how we would assign/configure multiple CPUs to enterprise database servers and let the database server (using this as an analogy to the QC) determine which CPU would handle which task or set of tasks.

When the discussion turns to how many threads are spawned to support a single task or specific set of tasks, I believe the topic would lean more towards multithreading. Those threads might be spread across multiple cores or only one. Multithreading does not require multiprocessing (multiple cores/CPUs) although that is certainly an intelligent way to load balance and leverage it. And I would be the first to acknowledge that there is likely some of that going on as well. But I have zero visibility into that even more granular behavior whereas the multiprocessing is far more obvious because QC users have to manually distribute blocks across cores.

Anyway, I am far from an expert on the subject and just trying my best to describe the issue. For all I know your supposition is correct regarding multithreading. There certainly seems to be some obstacle to easily making this a non-manual process.

Maybe some developers on the forum could help shed some light on why this ability to automatically manage block distribution across cores has not made its way into high-end modelers yet.

Btw, I can’t help but feel like we are wildly overcomplicating things. There must be issues we haven’t considered. Why can’t the CoreOS simply calculate when one set of cores’ capacity has been reached and automatically assign the remaining blocks to the second set of cores? Perhaps by using an internal lookup table on each block’s CPU consumption and summing, or even dynamically calculating usage during preset design. LOL, sounds so simple, the devil’s in the details.

Haha, I’m genuinely trying not to sound rude, but it’s not a complex issue, it just has a complex solution… basically, it comes down to this:

In any context, if you want to use more than one core, you need more than one thread. Multithreading an inherently synchronous and linear thing like processing near real-time FX is a world of pain without huge benefit.

Say for example you have three blocks: A is distortion, B is a delay, C is a reverb. You can’t calculate the individual blocks out of order, because you have to know the result of the prior block’s calculations. You could very easily spread that across multiple cores, say we have 3 cores and assign one block to each. What would happen is this:

Core 0 starts working on A.

Core 1 can’t start working on B until Core 0 finishes working on A, and it shouldn’t do anything else because of the near-realtime requirement means it has to be ready to roll on B as soon as Core 0 is done with A.

More or less the same story with Core 2.

So instead of 1 single core powering through A, B, and C and the other cores ready to work on other tasks, you wind up with 3 cores blocked, and you still need another thread going to actually manage the scheduling.

It’s just not a task well suited for general purpose processor architectures.

1 Like

Thanks for chiming in on this. Also not trying to be rude. Your logic seems unassailable regarding the necessity to serialize the blocks’ output. Unless they have some method once all the blocks in a preset are loaded to calculate the sum total effect they will have on a tone. Not sure how you would do that when you have a dynamic and constantly changing input signal. Wow, it just occurred to me that I have no clue as to how modeling works. :upside_down_face:

We can see via the CPU monitor that the blocks on the two pairs of paths are being assigned to two different sets of cores. The prevailing assumption on modelers featuring single or multiple cores/DSPs has been that the preset must be preloaded with all blocks and processes that will be running. You can’t just load blocks up dynamically as needed without incurring a performance penalty in the form of an audible gap. This is why the CPU/DSP for every block is reserved and consumed upon preset load, whether or not a block is bypassed. It also explains why there is a gap between presets but not snapshots.

One would think that at some point there is some kind of handoff from the first set of cores to the second if the execution of all the blocks in the path have to be done serially. Are you saying this does not qualify as multiprocessing because the processing is being done sequentially/serially rather than simultaneously? Maybe it doesn’t, not my area of expertise. Just trying to wrap my head around how they are currently employing all of the available cores and the correct terminology to reference it.

Clear to me that I am way out of my depth regarding my understanding of how all of this is done. Just can’t escape the notion that if a user can split the blocks manually across two sets of cores, then the device should be able to calculate and do the same operation automatically. If it were that straightforward though, I would think they would have done it already.

As ckd said, near real-time audio processing is basically a linear process, hence it can only be done ‘one core at a time’ for a single path.

Multi-core DSP is needed/used when you have several ‘parallel’ paths:

  • 1 input on each row, straight to a single output : each row can be calculated in parallel, is they don’t depend on each others. (but of course, in the end, each signal must be ‘mixed’ to a specific output, so some synchronization occurs at the end of the path).
  • When you split / merge a row : yes, row 1 & 2 share the same DSP Unit, but each DSP unit is dual-core , so each part of the split can be processed by its own core , and the synchronization occurs at the ‘merge’ point.

To sum up :

  • If you only use a single row , you can’t spread the work on several DSPs / Core. The whole work has to be done sequentially (ie : you can’t calculate a delay placed after the amp before you calculated the amp signal …)
  • If row 1 / 2 has its output set to ‘Row 3/4’ , the 2 DSPs are used, but sequentially , as the second core has to wait for the first core to finish its job (btw , that’s why several latency tests showed that when you use row 1/2 + 3/4 latency increase ….)
  • If you choose to use the 4 rows, but each one has its own ‘input’ , (and no split etc….), each row could be processed independantly from each other (one row in each core of each DSPs chip)
    • btw : I did not see any latency measurement of this scenario (4 independant rows , having its own input and output) , my bet is that the latency should be nearly the same as the ‘Row 1 + 2 only’ test … Any ‘latency Nerd*’ here ?

So, it would not be an easy task to add an ‘abstraction layer’ to automatically choose the correct DSP/Core for a block on the grid, but it would greatly simplify the user experience. The abstraction layer decision logic could be done on each block Add/Delete/Move and persisted when you save the Preset (or redone after each Firmware upgrade if they fine-tune the abstraction layer), but I don’t think that we will see this on the QC…

* ‘Nerd’ not used as a pejorative term here :dove:

3 Likes

Great writeup and follow-up to @ckd 's post! So… as per the paragraph above. When multiple cores or CPUs are used sequentially rather than simultaneously, can this still technically be called multiprocessing?

Multiprocessing isn’t really a meaningful term in this context. It usually just means a system has multiple CPUs and/or cores (which the Quad Cortex is, but so are most systems these days).

1 Like

I agree with ckd.

TLDR; :

Multi DSPs/Core can only apply to ‘true parallel processing’ → when several pieces of data don’t directly depends on each other result in the same time-frame. Add to this that multi DSP/Core processing is not free, there has to be some ‘synchronization’ between DSPs/Core that consume some CPU time, and it could be less efficient than a ‘serial processing’ …

Long version:

‘classical multitasking/processing’ don’t apply in the near real-time audio, as there are specific constraints we don’t have, say, in multi Database requests or multi I/O processing (ie: download several files from a web site : each file can be downloaded completely independently from each other, as they don’t depends on each others, better yet : a single file can still be downloaded in multiple independent blocks - at a very high level this is the way the Torrent protocol works : download multiples file blocks out of order from multiple sources and ‘glue’ them in the correct order to produce the ‘final file’ , because one of the ‘limiting/slowing factor’ is the upload speed of the machines that ‘serve’ the desired file … )

Having multiple DSPs / Core still has its benefits, of course (multiple parallel paths, etc….) , but for a ‘simple’
one-row, one-path, no-splitter on the QC there’s no easy way to efficiently spread the workload to multiple core/DSPs (*)

* : In theory , some block types could be computed in parallel (and so, uses other DSPs/Core).
Take for example a Delay : the aim of this type of block is to repeat an audio clip (let’s call it the ‘delayed frame’) that will be mixed/heard several millisecond later from the ‘main’ audio (let’s call it 'current frame’). The ‘delay’ block will have to ‘buffer’ the Current Frame in order to process it later. Several ms later, when we hear the final produced result , we hear two things : the ‘current frame’ (the ‘real-time’ note we played) + a mix of the ‘Delayed frame’. At this time , the ‘Delayed frame’ is an old ‘Current Frame’ altered by whatever specific computation the Delay block has done on it… So, in theory, the computation of a Delay block could be done in parallel from other blocks, as they don’t depends on the ‘Current Frame’ , but on a buffered ‘Previous Frame’ …

Of course, this is purely thorical, pratically there’s still a lot of challenges (synchronization, frame dropout, convolution, etc….) and perhaps that the QC is already doing it like this ?

The best way to solve the original problem (not enough CPU) would be to optimize the low-level code of each block in order to reduce their CPU power consumption (by the way, this is what they did circa Firmware 1.1 : the Reverb blocks were optimized and you could add a LOT more reverb than in the previous Firmware … see : CorOS 1.1.0 is now available - Neural DSP )

We naturally tend to think that the more CPUs/Core we have, the faster everything will be, but in reality it is not always true, it depends on the kind of work/processing the CPUs have to do, and we must also think in term of ‘Syncronization’ between the CPUs/Core , and this is not free , the synchronization itself has a cost. In the end, using multiple CPUs/Core + Sync could achieve worse result than a simple serial processing, or show no real gain…

So, we could sum up this discussion on multi DSPs/Core with my favorite answer : ‘It depends’ :man_shrugging:

2 Likes

Another great post! Welcome to Neural’s newest product offering, ‘Quantum Cortex’ - MSRP $250,000.00 :grinning:

Welcome to Neural’s newest product offering, ‘Quantum Cortex’ - MSRP $250,000.00

does it come with an editor?

3 Likes

LOL! Don’t know but it’s coming “soon”! :wink:

2 Likes