trsav - The Dead Blog Theory

Extra note: I appreciate that model collapse in practice doesn’t really occur, because of curated datasets and the effort that does into ensuring good training data. But indulge me.

This post is rotting, and will soon become AI slop. What follows is a love letter to an internet that will never exist again, it is a self-fulfilling obituary. Ashes to ashes, dust to dust, slop… to slop.

The dead internet theory states that sooner rather than later, genuine activity on the internet will be in the minority, and the majority of traffic will come from bots. Content will be produced by bots, and it will be engaged with by other bots. Every day, it becomes less of a theory, with ‘dead’ content being generally characterised as slop.

In the beginning was the Word, and then the next Word and the next Word, and then after enough words, someone claimed that a large-language model could think, and the Word was God¹.

¹ John 1:1, creatively embellished, some would say blasphemously.

Content traditionally produced by marketers, copywrighters, and journalists is slowly being replaced with generative content from LLMs. But what happens when the next-generation of LLMs are being trained on this pseudo-data from the internet?

The equivalent of the dead internet theory for LLMs, model-collapse illustrates a scenario where datasets become so poisoned that it becomes impossible to ever train a new LLM effectively².

² Why AI companies enter into multi-million pound contracts with news organisations with a wealth of verifiably human, proofread and well written content.

Like pre-trinity test low background radiation steel³, pre-large-language model content will become sought after. Humanity will see a return to the handwritten word, literature originating before chatGPT will be considered sacred, evidence of genuine human achievement.

³ Often stolen from WW2 shipwrecks for particle detectors.

And here, we, go…⁴

⁴

In London there is a Raspberry Pi running a cronjob. It has access to the source code to this very post and is also loaded with a small local BERT-based model⁵.

⁵ Specifically, google-bert/bert-base-uncased

Twice a day, a random sentence from this post will be selected, and a random word will be omitted. The small, local language model will be them prompted to infer the missing word. This word will be replaced, my site re-rendered, the changes committed to the Git repository, and reflected here⁶. Additionally I will prompt the model to add a word a day to the bottom of this post.

⁶ Stunning way to increase my Github contributions.

Blog collapse.

View Script & LLM Inference Code

#!/bin/bash

source mini_llama_env/bin/activate
python script.py
cd ../website
quarto render
git add -A
git commit -m "post continues to rot"
git push

from transformers import pipeline
import numpy as np
import time

def process_file(filename):
    # Read file content
    with open(filename, 'r') as file:
        data = file.readlines()
    
    # Find lines with 'changeable_text'
    changeable_lines = []
    for i in range(len(data)):
        if '''FLAG''' in data[i] :
            changeable_lines.append(i+1)
            break 

    random_line = np.random.choice(changeable_lines)
    text = data[random_line]
    
    # Process footnote if present
    if '^[' in text:
        footnote_index = text.index('^[')
        end_footnote = text.index('\\x00]')
        pre_footnote, post_footnote = text.split('^[')[0], text.split("\\x00]")[1]
        footnote = text[footnote_index:end_footnote+5]
        footnote_word_index = text[:footnote_index].count(' ')
        print('FOOTNOTE WORD INDEX:', footnote_word_index)
        data[random_line] = pre_footnote + post_footnote
    
    # Replace random word with [MASK]
    line = data[random_line].split()
    random_word_index = np.random.randint(0, len(line))
    print('REMOVED WORD', line[random_word_index])
    rem_word_store = line[random_word_index]
    line[random_word_index] = '[MASK]'
    line = ' '.join(line)
    
    # Use BERT to fill [MASK]
    unmasker = pipeline('fill-mask', model='bert-base-uncased')
    res = unmasker(line)
    res.append({'token_str': rem_word_store})
    word_index = 0 
    while res[word_index]['token_str'] in ['.', ',', '!', '?']:
        word_index += 1
    replacement_word = res[word_index]['token_str']
    print('REPLACEMENT WORD:', replacement_word)
    line = line.replace('[MASK]', replacement_word)
    
    # Reinsert footnote if it existed
    if 'footnote' in locals():
        line_words = line.split()
        line_words.insert(footnote_word_index, footnote)
        line = ' '.join(line_words) + '\n'
    
    line = line.replace(':::', '\n:::\n')
    data[random_line] = line+'\n'
    
    # Process 'extra' lines
    extra_lines = [i+1 for i, line in enumerate(data) if 'FLAG' in line]
    if extra_lines:
        generated = data[extra_lines[0]].strip()
        hypothesised = generated + ' [MASK]' + '. END OF STATEMENT.'
        res = unmasker(hypothesised)
        print('ADDED WORD:', res[0]['token_str'])
        data[extra_lines[0]] = f"{generated} {res[0]['token_str']}\n"
    
    # Update timestamp
    utc_str = time.asctime(time.gmtime())
    print('UTC:', utc_str)

    data[-1] = f'''Updated: {utc_str}. Replaced {rem_word_store} with {replacement_word}. Added {res[0]['token_str']} to the end of the generated text.'''
    
    # Write updated content back to file
    with open(filename, 'w') as file:
        file.writelines(data)

process_file('../website/posts/dead/index.qmd')

Token to token, slop to slop, all good things must come to an end.

I .

Updated: Fri Aug 23 10:55:41 2024. Replaced Content with content. Added . to the end of the generated text.