Neyshekar is an open, community-driven Persian speech dataset collected via a web-based crowdsourcing platform at https://ney.shekar.io. It is designed to support research and development in text-to-speech (TTS), automatic speech recognition (ASR), speech representation learning, and other downstream Persian speech applications.
The recordings are provided by a combination of volunteer contributors and paid voice actors, all of whom are native Persian speakers. Each release represents a stable snapshot of the dataset, enabling reproducible research and consistent benchmarking.
Neyshekar is released incrementally. Each release represents a stable snapshot of the dataset at the time of publication.
v3 — 2026-03-23 (download)
- Total samples: 30019
- Total duration (hours): 45.71
- Average clip duration (seconds): 5.48
- Total tokens: 331714
- Vocab size: 23972
- Total samples: 20,020
- Total duration (hours): 29.08
- Average clip duration (seconds): 5.23
- Total tokens: 208,472
- Vocab size: 20,853
- Total samples: 10,044
- Total duration: 14.42 hours
- Average clip duration: 5.17 seconds
- Total tokens: 103,757
- Vocabulary size: 15,224
Any attempt to identify or uncover the identity of speakers in the Neyshekar datasets is strictly prohibited.
This dataset is released under the CC0 1.0 Universal license.
It may be used, modified, and redistributed for any purpose without restriction.