I managed to get the desired result using the two AVPlayers that I play at the same time. One AVPlayer has an input that has averaged audio data on the left channel and silence on the right; and vice versa in another AVPlayer. Finally, the effect applies to only one instance of AVPlayer.
Because applying the patented effect on an AVPlayer instance was trivial, the biggest obstacle was aligning the stereo input.
I found a couple of related questions ( Panning a mono signal using MultiChannelMixer and MTAudioProcessingTap , Playing AVPlayer single-channel stereo → mono sound ) and a tutorial ( Processing AVPlayers audio using MTAudioProcessingTap - which was referenced in almost all other tutorials I tried to google), all of which indicate that the solution is probably in the MTAudioProcessingTap.
Unfortunately, the official documentation for the MTAudioProcessing tap (or any other aspect of MediaToolbox) is more or less null. I mean, only some sample code was found online and the headers (MTAudioProcessingTap.h) through Xcode. But with the above tutorial I managed to run.
To make things not too easy, I decided to use Swift rather than Objective-C, in which existing tutorials were available. Call conversion is not so bad, and I even found an almost ready-made example of creating an MTAudioProcessingTap in Swift 2 . I managed to connect to the processing of taps and easily manipulate the sound with it (well, I could bring the stream as it is and, at least, completely reset it). However, to align the channels, the task was to speed up the framework , namely vDSP .
However, using C APIs that use pointers (case in point: vDSP) widely with Swift gets pretty cumbersome pretty quickly - at least compared to how it is done using Objective-C. This was also a problem when I originally wrote MTAudioProcessingTaps in Swift: I could not pass AudioTapContext without problems (in Obj-C, getting context is as simple as AudioTapContext *context = (AudioTapContext *)MTAudioProcessingTapGetStorage(tap); ), and all UnsafeMutablePointers make me think that Swift is not the right tool for the job.
So, for the processing class, I dropped Swift and reorganized it into Objective-C.
And, as mentioned earlier, I use two AVPlayers; so in AudioPlayerController.swift I have:
var left = AudioTap.create(TapType.L) var right = AudioTap.create(TapType.R) asset = AVAsset(URL: audioList[index].assetURL!) // audioList is [MPMediaItem]. asset is class property let leftItem = AVPlayerItem(asset: asset) let rightItem = AVPlayerItem(asset: asset) var leftTap: Unmanaged<MTAudioProcessingTapRef>? var rightTap: Unmanaged<MTAudioProcessingTapRef>? MTAudioProcessingTapCreate(kCFAllocatorDefault, &left, kMTAudioProcessingTapCreationFlag_PreEffects, &leftTap) MTAudioProcessingTapCreate(kCFAllocatorDefault, &right, kMTAudioProcessingTapCreationFlag_PreEffects, &rightTap) let leftParams = AVMutableAudioMixInputParameters(track: asset.tracks[0]) let rightParams = AVMutableAudioMixInputParameters(track: asset.tracks[0]) leftParams.audioTapProcessor = leftTap?.takeUnretainedValue() rightParams.audioTapProcessor = rightTap?.takeUnretainedValue() let leftAudioMix = AVMutableAudioMix() let rightAudioMix = AVMutableAudioMix() leftAudioMix.inputParameters = [leftParams] rightAudioMix.inputParameters = [rightParams] leftItem.audioMix = leftAudioMix rightItem.audioMix = rightAudioMix // leftPlayer & rightPlayer are class properties leftPlayer = AVPlayer(playerItem: leftItem) rightPlayer = AVPlayer(playerItem: rightItem) leftPlayer.play() rightPlayer.play()
I use "TapType" to highlight channels and is defined (in Objective-C) as simple as:
typedef NS_ENUM(NSUInteger, TapType) { TapTypeL = 0, TapTypeR = 1 };
MTAudioProcessingTap callbacks are created in much the same way as in the tutorial . However, when I create it, I save TapType in context, so I can check it in ProcessCallback:
static void tap_InitLeftCallback(MTAudioProcessingTapRef tap, void *clientInfo, void **tapStorageOut) { struct AudioTapContext *context = calloc(1, sizeof(AudioTapContext)); context->channel = TapTypeL; *tapStorageOut = context; }
And finally, the actual weightlifting happens in the process callback using the vDSP functions:
static void tap_ProcessCallback(MTAudioProcessingTapRef tap, CMItemCount numberFrames, MTAudioProcessingTapFlags flags, AudioBufferList *bufferListInOut, CMItemCount *numberFramesOut, MTAudioProcessingTapFlags *flagsOut) { // output channel is saved in context->channel AudioTapContext *context = (AudioTapContext *)MTAudioProcessingTapGetStorage(tap); // this fetches the audio for processing (and for output) OSStatus status; status = MTAudioProcessingTapGetSourceAudio(tap, numberFrames, bufferListInOut, flagsOut, NULL, numberFramesOut); // NB: we assume the audio is interleaved stereo, which means the length of mBuffers is 1 and data alternates between L and R in `size` intervals. // If audio wasn't interleaved, then L would be in mBuffers[0] and R in mBuffers[1] uint size = bufferListInOut->mBuffers[0].mDataByteSize / sizeof(float); float *left = bufferListInOut->mBuffers[0].mData; float *right = left + size; // this is where we equalize the stereo // basically: L = (L + R) / 2, and R = (L + R) / 2 // which is the same as: (L + R) * 0.5 // "vasm" = add two vectors (L & R), multiply by scalar (0.5) float div = 0.5; vDSP_vasm(left, 1, right, 1, &div, left, 1, size); vDSP_vasm(right, 1, left, 1, &div, right, 1, size); // if we would end the processing here the audio would be virtually mono // however, we want to use distinct players for each channel, so here we zero out (multiply the data by 0) the other float zero = 0; if (context->channel == TapTypeL) { vDSP_vsmul(right, 1, &zero, right, 1, size); } else { vDSP_vsmul(left, 1, &zero, left, 1, size); } }