Of course, a REALLY sophisticated pirate could multiply the entire audio track by his own slowly varying factor of 0.99-1.01, but even here there are tricks to stop it. You put brief stretches where your waveform varies faster; those wiggles will stand out from the pirate's slowly varying wobble. If the pirate tries to use a rapidly varying waveform for the entire piece (it has to be the entire piece because he doesn't know where your checkpoints are, and by using an error-correcting code you can make sure he has to find nearly all of them), he will degrade the sound quality, and the segments where you vary slowly will be detectable because his wiggles will average out, so you can still recover your 4 bytes.
And that's just amplitude modulation, you can also do tiny amounts of frequency modulation which is even harder to distort (though your leeway might be a bit less than 1% if you want to avoid it being detectable to the trained musical ear).
The basic principle is simple -- you are hiding 4 bytes of information amidst 400,000,000 bytes of information. Steganography works in practice with much denser hidden info than that -- you just have to code it in a holistic way.
I think I still see a hole in it. Wanna put it to the test? ;)