Skip to content

Transcription

Clipthesis can transcribe the spoken audio in your videos locally on your Mac and make the words searchable across your whole library or inside a single clip. No audio ever leaves your machine — transcription runs on your own hardware using the Whisper model.

This is especially useful for interview, vlog, or b-roll footage where you want to find the moment someone said a specific phrase without having to scrub through the timeline.

Getting started

1. Open transcription settings

Go to Settings → Library, which holds the Transcription settings.

2. Pick a model

The first time you transcribe a clip, Clipthesis downloads a Whisper model file (once — it's cached afterwards). Larger models are more accurate but slower:

ModelSizeBest for
tiny.en~75 MBVery fast, English-only, rough drafts
base~150 MBFast, decent quality
small~470 MBGood balance — default
medium~1.5 GBHigh quality, slower
large-v3~3.1 GBBest quality, noticeably slower

Click Download next to a model to fetch it. You can switch models at any time; existing transcripts aren't re-transcribed automatically (use the ↻ button on a clip's transcript panel to regenerate).

3. Language

Pick Auto-detect to let Whisper figure out the language per clip, or lock to a specific language (e.g. English) if you know all your footage is in the same language. Locking can produce slightly better results on accented or noisy speech.

4. Auto-transcribe on import

Toggle Auto-transcribe new videos to have Clipthesis queue a transcription job for every video added during an import. Leave it off if you only want to transcribe specific clips on demand.

5. Transcribe existing clips

Open any video in the detail modal. The Transcript section in the right sidebar shows one of these states:

  • Not yet transcribed — click Transcribe now to start.
  • Transcribing… 42% — a job is in progress. You can cancel from here.
  • No audio — the clip has no audio track; nothing to transcribe.
  • Failed — Retry — something went wrong (often a missing model file). Click Retry.

You can also bulk-transcribe everything in your library from Settings → Library → Transcribe all.

Searching transcripts globally

The main search bar at the top of the Library page runs a single unified search across tags, filenames, and transcripts at once.

  1. Click the search bar (or press ⌘F).
  2. Type your phrase.
  3. Press Enter.

The grid updates to show every clip that matches — including clips whose transcript contains the phrase. Click a result to open the clip, then use the in-clip transcript panel to jump to the exact moment.

Transcript matching uses full-text indexing (SQLite FTS5), so it's fast even across thousands of clips. It's case-insensitive and tolerates diacritics. See Unified Search for the full behaviour and how it combines with other filters.

Filtering to clips that have speech

In the filter sidebar, toggle Has speech to hide clips without spoken content (including clips with no audio track at all). Combine this with tag, drive, or date filters as usual.

Searching inside a single clip

Once you've opened a clip in the detail modal, the Transcript panel in the right sidebar lists every spoken segment with a timestamp. For long-form clips like interviews, scrolling through hundreds of segments to find a specific quote is slow. Use the in-panel filter:

  1. Click the Filter transcript… input at the top of the transcript panel.
  2. Start typing. The list narrows to segments containing your phrase (case-insensitive), with each match highlighted.
  3. A counter above the list shows K of N matches.
  4. Click any filtered segment to jump playback to that moment.
  5. Clear the filter with the × button or by pressing Escape while the input is focused.

While a filter is active, playback auto-scroll is paused so the active segment doesn't jump around under your cursor.

Exporting a transcript

From the transcript panel toolbar:

  • .srt — standard subtitle file with timecodes, usable in any NLE
  • .txt — plain-text transcript with timestamps
  • Copy — copy the full transcript to the clipboard as prose (no timestamps)

Privacy & offline

Transcription is 100% local. The audio is piped to a bundled Whisper model inside the app and never leaves your machine. Models are downloaded directly from Hugging Face the first time you need them; after that everything runs offline.

Troubleshooting

"The model isn't downloaded yet" — go to Settings → Library and click Download next to the model you've selected.

A transcription job is stuck "pending" — quit and relaunch Clipthesis. On startup, any jobs that were running when the app closed are reset to pending and picked up by the queue again.

The transcript text looks wrong or garbled — try a larger model (smallmedium). If your clip has heavy background music or multiple overlapping voices, Whisper will struggle — Clipthesis doesn't currently do speaker diarization.

Searching for a phrase I know is there returns nothing — make sure the clip has finished transcribing (the transcript panel will show the segments once it's done) and that the model matches the clip's language.

Released under the MIT License.