We Asked 100+ AI Models to Write Code. The Results: AI-generated Code That Works, But Isn’t Safe

Arthur Besse · 3 days ago

We Asked 100+ AI Models to Write Code. The Results: AI-generated Code That Works, But Isn’t Safe

@[email protected] · 3 days ago

How do they fare against human generated code?

Because the baseline is pretty bad.

SSUPII · edit-2 3 days ago

Yeah, because this would be something influenced a lot by training data we actually naturally provide.

Remember that AI is not magic, and will generate “average” code at best. Higher than average code might be possible, but will require insane training data filtering and drastically diminish its size making it actually less generally capable.

@[email protected] · 2 days ago

This is only true for basic pre-training of the base model. The later stage fine tuning (I used to call it RLHF but now I think many different techniques exist) is to make the model understand the basic level of expectation. Despite having 4chan in their training set, you will never see modern LLMs spontaneously generate edgy racist shit, not because it can’t but because it learnt that this is not the output expected.

Similarly with code, base models would produce by default average code, but fine tuning makes it understand that only the highest standard has to be generated. I can guarantee you that the code LLMs produce is much higher quality (on the superficial level) than the average code on github: documentation on all functions, error code and exceptions managed correctly, special cases handled whenever they are identified…

@[email protected] · 3 days ago

No conflict of interests here.

☆ Yσɠƚԋσʂ ☆ · 3 days ago

Using LLMs does not obviate the need for the human user to understand what the code is doing. I’ve also found that, as with any tool, it takes time to actually learn to use LLMs effectively.

In particular, I find it’s really important to understand the problem being solved, and then come up with the solution yourself. One approach I’ve found to be effective is to stub out the functions myself, and have the agent fill in the blanks for me. This helps focus the LLM and prevent it from going off into the weeds.

Another trick I find handy is to ask the agent to first write a plan for the solution. Then I can review the plan and tell the agent to adjust it as needed before implementing. Agents are also pretty good at writing tests, and tests are much easier to evaluate for correctness because good tests are just independent functions that do one thing and don’t have a deep call stack. My current approach is to get the LLM to write the plan, add tests, and then focus on making sure I understand the tests and that they pass. At that point I have a fairly high degree of confidence that the code is indeed doing what’s needed. The tests act as a contract for the agent to fill and as a specification for the defined functionality.

I suspect that programming languages might start shifting in the direction of contracts in general. I can see stuff like this becoming the norm, where you specify the signature for the function, and you could also specify parameters like computational complexity and memory usage. The agent could then try to figure out how to fill the contract you’ve defined. It would be akin to genetic algorithm approach where the agent could converge on a solution over time. If that’s the direction things will be moving in, then current skills could be akin to being able to write assembly by hand. Useful in some niche situations, but not necessary vast majority of the time.

Finally, it’s very helpful to structure things using small components components that can be tested independently and composed together to build bigger things. As long as the component functions in the intended way, I don’t necessarily care about the quality of the code internally. I can treat them as black boxes as long as they’re doing what’s expected at the surface level. This is already the approach we take with libraries as we don’t audit every line of code in a library we include in a project. We just look at its surface level API provided.

Incidentally, I’m noticing that functional style seems to work really well here. Having an assembly line of pure functions naturally breaks up a problem into small building blocks that you can reason about in isolation. It’s kind of like putting Lego blocks together. The advantage over stuff like microservies here is that you don’t have to deal with the complexity of orchestration and communication between the services.

Riskable · 3 days ago

Management: “Perfect!”

We Asked 100+ AI Models to Write Code. The Results: AI-generated Code That Works, But Isn’t Safe

We Asked 100+ AI Models to Write Code. The Results: AI-generated Code That Works, But Isn’t Safe

We Asked 100+ AI Models to Write Code. Here’s How Many Failed Security Tests. | Veracode