C1Tech/whisper_base_persian

C1Tech/whisper_base_persian is a Persian ASR model based on Whisper architecture, fine-tuned on a large scale custom persian dataset.

With only 74 million parameters, this model achieves state-of-the-art performance on Persian ASR tasks, outperforming larger models like openai Whisper Large-v3 (1550M parameters) and Meta Wav2Vec2-XLSR (300M parameters).

Benchmark Performance

We evaluated the model on multiple Persian ASR benchmarks, including Common Voice, and fleurs. Results show that our model outperforms popular models like vosk, fast-conformer and openai's whisper on these benchmarks:

Model Image 1 Model Image 2

The benchmark results highlight the model's efficiency and accuracy, proving that high-quality Persian ASR is achievable even with a compact model.

For more detailed evaluation and comparison with other models, please refer to the Open Persian ASR Leaderboard.

Usage

Whisper base is supported in Hugging Face 🤗 Transformers. To run the model, first install the Transformers library.

pip install --upgrade pip
pip install --upgrade transformers

The model can be used with the pipeline class to transcribe audios of arbitrary length:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "C1Tech/whisper_base_persian"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:

result = pipe("audio.mp3")

Multiple audio files can be transcribed in parallel by specifying them as a list and setting the batch_size parameter:

result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)

Transformers is compatible with all Whisper decoding strategies, such as temperature fallback and condition on previous tokens. The following example demonstrates how to enable these heuristics:

generate_kwargs = {
    "num_beams": 3,
    "condition_on_prev_tokens": False,
    "compression_ratio_threshold": 1.35,  # zlib compression ratio threshold (in token space)
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    "logprob_threshold": -1.0,
    "no_speech_threshold": 0.6,
    "return_timestamps": True,
    "language": "fa"
}

result = pipe(sample, generate_kwargs=generate_kwargs)

Finally, the model can be made to predict timestamps. For sentence-level timestamps, pass the return_timestamps argument:

result = pipe(sample, return_timestamps=True)
print(result["chunks"])

And for word-level timestamps:

result = pipe(sample, return_timestamps="word")
print(result["chunks"])

For further information, keep in touch: info@c1tech.group

Downloads last month: 1

Safetensors

Model size

72.6M params

Tensor type

F32