Testing for Numerical Stability and Can AI Help Fix It
Can LLMs fix numerical stability problems in your code. Let’s design a little experiment to find out and how much would you just like a way to just validate your numerical stability across CPU generations and architectures.
I think I’m bored with trying to vibe code a HPC scheduler. I do have another, probably more useful and potentially more feasible use case for your little AI friend though. Numerical Instability.
I don’t think it’s possibly to specify in sufficient detail the test cases (even if you vibe code the tests themselves) when a system becomes as complex as the HPC compute and data scheduling system I was attempting. Without those test cases it’s impossible to have the validation step in the AI generation loop. Without that validation step you’ll never have output that works. Plus, I just keep running out of tokens!
Skill issue? Maybe. Mythos required? We’ll see. I don’t think I’m on Anthropic’s special list though. Oh yea, all I need to do is change the model name in the API call anyway 🤣
I haven’t totally given up; I think I will come back to this when I have a little more time and perhaps money to blow on the experiment. Till then, I have something that might actually work.
If your little LLM pet can scan your code for security vulnerabilities, I wonder if it can also scan your code for numerical stability issues.
I’ve created a test case using QuantLib to price a digital barrier equity option using a Monte Carlo engine. It’s stable across multiple runs on the same machine (i.e. MC stable) but it does give me small but significant differences when run across different CPU architectures.
The next step is to create that all important automated verification. I have a couple of choices here and I’d like a little input from you. I could just have a couple of machines that I let run Claude run wild on with SSH access and see what it does and just compare numbers coming out the end. Cheap to do. Not very useful once it’s done. Or I could take the engine behind FLOPx that runs all our cloud benchmarks and change it every so slightly so that it can it runs a numerical stability test instead. That would give me a reusable, way to plug in any code to be run across multiple cloud VMs and then validate the answers from each are the same.
In the same way that we can now confidently and consistently benchmark cloud VMs at scale, this would allow us to validate numerical stability across multiple CPU generations and architectures. For the cost of running a few spot VMs.
In either case, I can then create a harness that attempts to find (and fix) the numerical stability problem – perhaps I can create a skill that allows it to use our Intel PIN tool extensions to help – using the aforementioned validation in a Karpathy style auto research loop and let it get on with it while I sleep. Might still need mountain of tokens though.
My question to you is: Do you care about having a numerical stability verification system? Or shall just write a bash script comparing two numbers and let Claude run rampant on two or three sandboxes.
