Monday, August 11, 2025

Malicious Ai training data and the lack of judgement

 So I just had an AI spit out a code snippit that was so egregiously bad that it boggled me for a while. 

function updateBlockSpan(sectionID, blockname) {
// Use jQuery to find the block_span in the section
// Will dynamically add the span if it doesn't exist yet
const $section = $('#' + sectionID);

if ($section.length) {
// Look for existing block_span
let $blockSpan = $section.find('.block_span');

// If no block_span exists, create one
if ($blockSpan.length === 0) {
// Create and add a debug span to the section
$blockSpan = $('<span class="block_span debug_display">Block: </span>');
$section.prepend($blockSpan);
}

// Update the content
$blockSpan.html('Block: ' + blockname);
} else {
console.log('Section not found: ' + sectionID);
}
} Without knowing the structure of the software I was working on, its difficult to see how shit this is.  

The key points are that the code is checking to see if the block_span tag exists... and if not, it tries to create one... and then PREPENDS it to the section.  

So as my UI works by showing and hiding lots of sections... this code would get called each time a section was displayed.  So it would create and inject a tag, outside of my sections that would persist and not get cleaned up, each time a section was displayed... so the error would stack up outside my UI area.... and keep on stacking the longer it went on.  


This is so many levels of bad ideas that it took me a while to articulate just how much I hate it. 

1) Hiding an error from a foundational assumption (that the block_span existed in the first place) 

2) Injecting DOM elements into random places.... 

3) The stacking behaviour as a consequence of not understanding the context where the function was being used. 




The larger context that I think is the problem, is that the AI presented this as its first attempt to solve this problem.  This was not the result of some vibe coding ..evolving to a bad solution... this was a simple first pass at replacing a 2 line function that was not selecting a tag correctly. 

This raised the question in my mind about how bad the training data must be to allow this to be generated as a first best guess.  How much rotted bad horrible code must have gone into training the AI that it has zero compunction about presenting this type of bad idea. 

I think this is a result of scraping the worst of the worst of the free internet and treating it all equally.  The AI's have not been trained on the best of the best code bases written by the most experienced and seasoned programmers... but by the most amateur keyboard cowboys who had the time to shit-post on the web. 

I would even go so far as to say that this is beyond a dark pattern... its actively malevolent.  And that can only have come from sufficient training data to generate persistently bad patterns in generative results.  


What's the solution to this?  

Very obviously the AI was not dealing with any context.  They are still taking in the minimum amount of code for the prompt and not considering the larger context that the function will be used in.  The tiny context windows... and the questionable way the AI actually uses the context creates the problem that I call this "sword fighting through a keyhole".   

But on top of this context we would expect "Judgement".  The ability to differentiate good from bad coding practices.  I still question that this exists or can exist for AI's that have been trained on a broad diet of random shit code from the internet.  
They have no means to learn judgement.  I can only assume that the code was not checked by seasoned programmers, line by line, before it was fed into the training data pool.  I think the evidence is building that any attempt to give these AI judgment has failed woefully. 


The other thought that occurred to me is that the vast majority of "high quality" code in many languages is probably hidden behind "commercial confidence".  There is going to be a lot of good quality code floating around in repos on github... passion projects by seasoned dev's... but the code that has been build to generate money... and battle tested by teams of seasoned devs is mostly going to be protected and not appear in training data sets that have been scraped or stolen for first generation AI's.  

Perhaps we will need to evolve the training data sets quality and meta data before we can evolve better AI code generators.