Question:  (mlx_whisper) How to eliminate spurious "you" transcriptions

[Macbook Pro M4 Max (128GB), MacOS Sequoia 15.7.2, Python 3.10.15, mlx_whisper-0.4.3.dist]

Hi, I'm using the mlx_whisper python package for which I've written a wrapper.  When I transcribe audio recorded from a Discord conversation using the Craig bot, I'm getting spurious text coming though in the transcript.  I've made some attempts to reduce the noise in the recording, but mlx_whisper is still transcribing the word "you" multiple times when the amplitude of the audio is continuously 0.0f (less than -60db).
Below (at the end of the post) is a sample of the output I'm getting. Note that at the 13 minute mark the audio sounds more like "ummm ummm" rather than "Thank you", and at the 20 minute mark is when the speech really starts.

In my wrapper, I'm passing the following parameters to `mlx_whisper.transcribe`:
`--language en --no_speech_threshold=0.85 --hallucination-silence-threshold 1.5 --logprob_threshold -0.8 --compression_ratio_threshold 2.0 --condition_on_previous_text False --output-format md --verbose False`

I've had very good results with several YouTube videos that I've transcribed, using the above parameters, but it's not working so well with Craig bot recordings from Discord. If anyone can suggest parameter settings that would help me get a better result, I would greatly appreciate it.

*(My wrapper script re-formats `vtt` output from mlx_whisper into the markdown I've posted at the end, hence the unusual --output-format parameter.)*

My wrapper script uses the following default values (which are overwritten by the parameters given in the command line above:
```python
            "mlx_whisper.transcribe": {
                "audio": None,  # path or waveform, filled by caller
                "path_or_hf_repo": self.path_or_hf_repo,
                "verbose": None,  # None | bool
                "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
                "compression_ratio_threshold": 2.4,
                "logprob_threshold": -1.0,
                "no_speech_threshold": 0.6,
                "condition_on_previous_text": True,
                "initial_prompt": None,
                "word_timestamps": False,
                "prepend_punctuations": "\"'“¿([{-",
                "append_punctuations": "\"'.。,，!！?？:：”)]}、",
                "clip_timestamps": "0",  # or List[float]
                "hallucination_silence_threshold": None,
                # Extra decoding options passed via **decode_options:
                # (these mirror DecodingOptions below)
                "decode_options": {
                    "task": "transcribe",
                    "language": None,
                    "temperature": 0.0,
                    "sample_len": None,
                    "best_of": None,
                    "beam_size": None,
                    "patience": None,
                    "length_penalty": None,
                    "prompt": None,
                    "prefix": None,
                    "suppress_tokens": "-1",
                    "suppress_blank": True,
                    "without_timestamps": False,
                    "max_initial_timestamp": 1.0,
                    "fp16": True,
                },
            },
            # DecodingOptions dataclass (decoding.DecodingOptions)
            "DecodingOptions": {
                "task": "transcribe",
                "language": None,  # Optional[str]
                "temperature": 0.0,
                "sample_len": None,
                "best_of": None,
                "beam_size": None,
                "patience": None,
                "length_penalty": None,
                "prompt": None,  # Optional[str | List[int]]
                "prefix": None,  # Optional[str | List[int]]
                "suppress_tokens": "-1",  # Optional[str | Iterable[int]]
                "suppress_blank": True,
                "without_timestamps": False,
                "max_initial_timestamp": 1.0,
                "fp16": True,
            },
            # decoding.decode() helper
            "decoding.decode": {
                "model": None,  # Whisper instance, set by caller
                "mel": None,  # mx.array
                "options": "DecodingOptions()",  # symbolic reference
                # Any of the DecodingOptions fields can also be passed via **kwargs
            },
            # Core Whisper model constructor (whisper.Whisper)
            "Whisper.__init__": {
                "dims": "ModelDimensions(...)",  # typically created by load_model
                "dtype": "mx.float16",
            },
            # Language detection (decoding.detect_language)
            "decoding.detect_language": {
                "model": None,  # Whisper instance
                "mel": None,  # mx.array
                "tokenizer": None,  # optional Tokenizer
            },
            # CLI-level defaults for the mlx_whisper command-line tool.
            # These are only honored in non-Craig runs; in Craig mode,
            # speechToText.py owns output-dir / output-format so the
            # meeting-transcript pipeline remains stable.
            "mlx_whisper.cli": {
                "output_dir": "transcripts-mlx_whisper",
                "output_name": None,
                "output_format": "vtt",
                # Writer-related options (used when word-timestamps are enabled).
                "highlight_words": False,
                "max_line_width": None,
                "max_line_count": None,
                "max_words_per_line": None,
            },
        }
```

```
[00:00.000 --> 00:00.620] you
[00:30.000 --> 00:30.620] you
[01:00.000 --> 01:00.620] you
[01:30.000 --> 01:30.620] you
[02:00.000 --> 02:00.620] you
[02:30.000 --> 02:30.620] you
[03:00.000 --> 03:00.620] you
[03:30.000 --> 03:30.620] you
[04:00.000 --> 04:00.620] you
[04:30.000 --> 04:30.620] you
[05:00.000 --> 05:00.620] you
[05:30.000 --> 05:30.620] you
[06:00.000 --> 06:00.620] you
[06:30.000 --> 06:30.620] you
[07:00.000 --> 07:00.620] you
[07:30.000 --> 07:30.620] you
[08:00.000 --> 08:00.620] you
[08:30.000 --> 08:30.620] you
[09:00.000 --> 09:00.620] you
[09:30.000 --> 09:30.620] you
[10:00.000 --> 10:00.620] you
[10:30.000 --> 10:30.620] you
[11:00.000 --> 11:00.620] you
[11:30.000 --> 11:30.620] you
[12:00.000 --> 12:00.620] you
[12:30.000 --> 12:30.620] you
[13:00.000 --> 13:17.440] Thank you.
[13:30.000 --> 13:30.620] you
[14:00.000 --> 14:00.620] you
[14:30.000 --> 14:30.620] you
[15:00.000 --> 15:00.620] you
[15:30.000 --> 15:30.620] you
[16:00.000 --> 16:00.620] you
[16:30.000 --> 16:30.620] you
[17:00.000 --> 17:00.620] you
[17:30.000 --> 17:30.620] you
[18:00.000 --> 18:00.620] you
[18:30.000 --> 18:30.620] you
[19:00.000 --> 19:00.620] you
[19:30.000 --> 19:30.620] you
[20:00.000 --> 20:26.340] This is still a good thought exercise though
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: (mlx_whisper) How to eliminate spurious "you" transcriptions #1402

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question: (mlx_whisper) How to eliminate spurious "you" transcriptions #1402

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions