Extra note: I appreciate that model collapse in practice doesn’t really occur, because of curated datasets and the effort that does into ensuring good training data. But indulge me.
This post is rotting, and will soon become AI slop. What follows is a love letter to an internet that will never exist again, it is a self-fulfilling obituary. Ashes to ashes, dust to dust, slop… to slop.
The dead internet theory states that sooner rather than later, genuine activity on the internet will be in the minority, and the majority of traffic will come from bots. Content will be produced by bots, and it will be engaged with by other bots. Every day, it becomes less of a theory, with ‘dead’ content being generally characterised as slop.
In the beginning was the Word, and then the next Word and the next Word, and then after enough words, someone claimed that a large-language model could think, and the Word was God1.
1 John 1:1, creatively embellished, some would say blasphemously.
Content traditionally produced by marketers, copywrighters, and journalists is slowly being replaced with generative content from LLMs. But what happens when the next-generation of LLMs are being trained on this pseudo-data from the internet?
The equivalent of the dead internet theory for LLMs, model-collapse illustrates a scenario where datasets become so poisoned that it becomes impossible to ever train a new LLM effectively2.
2 Why AI companies enter into multi-million pound contracts with news organisations with a wealth of verifiably human, proofread and well written content.
Like pre-trinity test low background radiation steel3, pre-large-language model content will become sought after. Humanity will see a return to the handwritten word, literature originating before chatGPT will be considered sacred, evidence of genuine human achievement.
And here, we, go…4
4
In London there is a Raspberry Pi running a cronjob. It has access to the source code to this very post and is also loaded with a small local BERT-based model5.
5 Specifically, google-bert/bert-base-uncased
Twice a day, a random sentence from this post will be selected, and a random word will be omitted. The small, local language model will be them prompted to infer the missing word. This word will be replaced, my site re-rendered, the changes committed to the Git repository, and reflected here6. Additionally I will prompt the model to add a word a day to the bottom of this post.
6 Stunning way to increase my Github contributions.
View Script & LLM Inference Code
#!/bin/bash
source mini_llama_env/bin/activate
python script.py
cd ../website
quarto render
git add -A
git commit -m "post continues to rot"
git push
from transformers import pipeline
import numpy as np
import time
def process_file(filename):
# Read file content
with open(filename, 'r') as file:
= file.readlines()
data
# Find lines with 'changeable_text'
= []
changeable_lines for i in range(len(data)):
if '''FLAG''' in data[i] :
+1)
changeable_lines.append(ibreak
= np.random.choice(changeable_lines)
random_line = data[random_line]
text
# Process footnote if present
if '^[' in text:
= text.index('^[')
footnote_index = text.index('\\x00]')
end_footnote = text.split('^[')[0], text.split("\\x00]")[1]
pre_footnote, post_footnote = text[footnote_index:end_footnote+5]
footnote = text[:footnote_index].count(' ')
footnote_word_index print('FOOTNOTE WORD INDEX:', footnote_word_index)
= pre_footnote + post_footnote
data[random_line]
# Replace random word with [MASK]
= data[random_line].split()
line = np.random.randint(0, len(line))
random_word_index print('REMOVED WORD', line[random_word_index])
= line[random_word_index]
rem_word_store = '[MASK]'
line[random_word_index] = ' '.join(line)
line
# Use BERT to fill [MASK]
= pipeline('fill-mask', model='bert-base-uncased')
unmasker = unmasker(line)
res 'token_str': rem_word_store})
res.append({= 0
word_index while res[word_index]['token_str'] in ['.', ',', '!', '?']:
+= 1
word_index = res[word_index]['token_str']
replacement_word print('REPLACEMENT WORD:', replacement_word)
= line.replace('[MASK]', replacement_word)
line
# Reinsert footnote if it existed
if 'footnote' in locals():
= line.split()
line_words
line_words.insert(footnote_word_index, footnote)= ' '.join(line_words) + '\n'
line
= line.replace(':::', '\n:::\n')
line = line+'\n'
data[random_line]
# Process 'extra' lines
= [i+1 for i, line in enumerate(data) if 'FLAG' in line]
extra_lines if extra_lines:
= data[extra_lines[0]].strip()
generated = generated + ' [MASK]' + '. END OF STATEMENT.'
hypothesised = unmasker(hypothesised)
res print('ADDED WORD:', res[0]['token_str'])
0]] = f"{generated} {res[0]['token_str']}\n"
data[extra_lines[
# Update timestamp
= time.asctime(time.gmtime())
utc_str print('UTC:', utc_str)
-1] = f'''Updated: {utc_str}. Replaced {rem_word_store} with {replacement_word}. Added {res[0]['token_str']} to the end of the generated text.'''
data[
# Write updated content back to file
with open(filename, 'w') as file:
file.writelines(data)
'../website/posts/dead/index.qmd') process_file(
Token to token, slop to slop, all good things must come to an end.
I .
Updated: Fri Aug 23 10:55:41 2024. Replaced Content with content. Added . to the end of the generated text.