BREAKING NEWS
LATEST POSTS
-
Magic Carpet by artist Daniel Wurtzel
https://www.youtube.com/watch?v=1C_40B9m4tI http://www.danielwurtzel.com
-
FEATURED POSTS
-
AI and the Law – Copyright Traps for Large Language Models – This new tool can tell you whether AI has stolen your work
https://github.com/computationalprivacy/copyright-traps
Copyright traps (see Meeus et al. (ICML 2024)) are unique, synthetically generated sequences who have been included into the training dataset of CroissantLLM. This dataset allows for the evaluation of Membership Inference Attacks (MIAs) using CroissantLLM as target model, where the goal is to infer whether a certain trap sequence was either included in or excluded from the training data.
This dataset contains non-member (
label=0
) and member (label=1
) trap sequences, which have been generated using this code and by sampling text from LLaMA-2 7B while controlling for sequence length and perplexity. The dataset contains splits according toseq_len_{XX}_n_rep_{YY}
where sequences ofXX={25,50,100}
tokens are considered andYY={10, 100, 1000}
number of repetitions for member sequences. Each dataset also contains the ‘perplexity bucket’ for each trap sequence, where the original paper showed that higher perplexity sequences tend to be more vulnerable.Note that for a fixed sequence length, and across various number of repetitions, each split contains the same set of non-member sequences (
n_rep=0
). Also additional non-members generated in exactly the same way are provided here, which might be required for some MIA methodologies making additional assumptions for the attacker.