How does AI vocal removal actually work?

AI vocal removers use neural networks trained on thousands of pairs of full songs and their isolated stems. The model learns the spectral signature of human vocals versus drums, bass, and other instruments, then applies that knowledge to separate them from a new track. Demucs v4, the model we run, uses a hybrid time-domain plus frequency-domain transformer architecture and routinely reaches state-of-the-art benchmarks on the MusDB18 dataset.

Is the quality good enough to use professionally?

For most modern productions, yes. Demucs v4 produces stems clean enough to use as remix material, karaoke backings, or sample sources. You will notice some bleed on tracks with heavy backing vocals or unusual instrumentation, but on standard pop, hip-hop, and electronic productions the result is genuinely impressive.

Why is this free when other services charge?

Demucs is open-source. We self-host it on our own server, so the only cost is electricity and disk space. The page shows a few ads to cover infrastructure. You get unlimited use, no watermark, no monthly cap.

Can I use the separated vocals or instrumental commercially?

The technical separation is fine. The legal use depends on the rights to the original song. Separating someone else's track and releasing it commercially generally requires their permission or a sample-clearance license. Educational, personal-use, and karaoke applications are normally fine. When in doubt, get clearance.

Will this work on a single instrument other than vocals?

This page does vocal-versus-instrumental two-stem separation. Demucs can also produce four stems (drums, bass, vocals, other) which we may add later if there is demand.

What audio formats can I upload?

mp3, wav, m4a, and flac. Up to 25 MB per file. Output is always 192 kbps mp3 to keep download sizes small.

AI vocal remover

Remove vocals from any song.

Upload an audio file. We separate the vocals from the instrumental using self-hosted Demucs v4, the open-source model Facebook AI Research released in 2023. The whole track gets processed, not a clip. Free. No watermark. No signup.

◆ Job status

Queued.

Waiting for the Demucs worker to pick up your job.

—

How AI vocal removal actually works.

Until 2020, removing vocals from a finished mix was nearly impossible. Studios used phase-cancellation tricks that worked only on stereo recordings where the vocal sat dead-centre, and even then the result was muddy. The breakthrough came when researchers stopped trying to subtract the vocal mathematically and started training neural networks on thousands of pairs: a full mix and its isolated stems.

Demucs v4, the model running on this page, takes that further. It works in both the time domain and the frequency domain simultaneously. The time-domain branch listens to the raw waveform like a person would. The frequency-domain branch looks at the spectrogram, the visual representation of which frequencies are active when. The two views feed into a transformer that decides, for each tiny slice of audio, what proportion belongs to vocals versus everything else.

The model has been trained on the MusDB18 dataset (150 full songs with isolated stems) and millions of additional samples. It does not know your song specifically. It knows what vocals tend to look like in a spectrogram, what drums look like, what a bass line looks like, and it applies that pattern to whatever you give it.

What people actually use this for.

Karaoke tracks

The most popular use. Strip the vocal off any song, sing over the instrumental yourself. Quality is genuinely good enough that the result feels professional.

Remix material

Producers extract vocals from a track to remix or mash up over their own beat. Note: releasing a remix commercially still requires sample clearance from the rights holder.

Sample hunting

Hip-hop and electronic producers isolate instrumental hooks from older recordings to flip into new productions. Demucs catches details the original mix bus may have buried.

Music education

Vocal coaches isolate a singer's performance to study technique, phrasing, breath control. Producers isolate drum stems to study programming and groove.

Film + content creation

Content creators strip vocals from licensed music to use as soundtrack without copyright issues with the original lyric. (Still need to license the underlying composition.)

A&R screening

Labels and curators isolate vocal performances when judging a demo, to evaluate the singer separately from the production polish.

Why we self-host (and you get this for free).

Most vocal-removal services charge $10–$30 per month for a few separations. We thought about it and ran the numbers: Demucs is open-source, the only real cost is the CPU time on our server, and the page is small enough to support itself with a couple of ads. So we run it on our own machine and keep it free, unlimited, no watermark.

The trade-off is processing time. We don't have a GPU, so a 3-minute song takes about 2 to 3 minutes of real-world CPU time. The page tells you exactly when your job is queued, processing, and done. You can leave it open and come back, or close it and come back via the same URL. One running job at a time site-wide to keep the server happy. If there's a queue ahead of you, you'll see it.

You get 15 separations per IP per day. The 150 MB file size cap covers full albums and extended DJ sets. If you have a real production workflow that needs more than that, get in touch.

Common questions

How does the AI know what's a vocal and what's an instrument?

It was trained on thousands of pairs of full songs and their isolated stems. The model learns the spectral fingerprint of the human voice and reproduces it on new tracks. It doesn't know your specific song, just the general patterns of voice versus instruments.

Is the quality good enough for professional use?

For most modern productions, yes. Demucs v4 routinely beats commercial competitors on benchmark tests. You'll notice some bleed on tracks with heavy backing vocals or unusual instrumentation, but on standard pop, hip-hop, and electronic productions the result is clean enough for karaoke, remix material, and most sampling work.

How long does processing take?

Roughly real-time on our server. A 3-minute song takes 2 to 3 minutes. A 6-minute extended mix takes 5 to 6 minutes. The page polls every few seconds and updates the progress bar so you can see exactly where your job is.

How long can the audio file be?

There's no explicit length cap. The 150 MB file size is the natural limit, which is around 90 minutes of 192 kbps mp3 or 25 minutes of 16-bit wav. Demucs processes audio in 7.8-second chunks internally so memory stays bounded regardless of total length. Long files just take proportionally longer to process — roughly real-time on our CPU.

Will my audio be stored or shared?

Input audio deletes as soon as Demucs finishes. Output stems sit on the server for one hour so you can download them, then auto-delete. We don't fingerprint, log content, or analyse what you send us.

Can I use the separated stems commercially?

The technical separation is fine. The legal use depends on the rights to the original song. Separating someone else's track and releasing the result requires clearance from the rights holder, the same as any sample. Educational, personal-use, and karaoke applications are normally fine.

Why two stems and not four?

Demucs supports 4-stem separation (drums, bass, vocals, other) but it takes longer and uses more memory. For v1 we ship 2-stem (vocals + instrumental) which covers the most common use case. If there's demand we'll add the 4-stem option.

Make music? Submit it.

If your track is afro house or deep house, send it to Ben. €3, listened to in full, written feedback within 72 hours.

Pitch your track — €3 →