Skip to content

amirivojdan/neyshekar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Neyshekar

Neyshekar is an open, community-driven Persian speech dataset collected via a web-based crowdsourcing platform at https://ney.shekar.io. It is designed to support research and development in text-to-speech (TTS), automatic speech recognition (ASR), speech representation learning, and other downstream Persian speech applications.

The recordings are provided by a combination of volunteer contributors and paid voice actors, all of whom are native Persian speakers. Each release represents a stable snapshot of the dataset, enabling reproducible research and consistent benchmarking.

Dataset Releases

Neyshekar is released incrementally. Each release represents a stable snapshot of the dataset at the time of publication.

v3 — 2026-03-23 (download)

  • Total samples: 30019
  • Total duration (hours): 45.71
  • Average clip duration (seconds): 5.48
  • Total tokens: 331714
  • Vocab size: 23972

v2 — 2026-01-15

  • Total samples: 20,020
  • Total duration (hours): 29.08
  • Average clip duration (seconds): 5.23
  • Total tokens: 208,472
  • Vocab size: 20,853

v1 — 2025-12-29

  • Total samples: 10,044
  • Total duration: 14.42 hours
  • Average clip duration: 5.17 seconds
  • Total tokens: 103,757
  • Vocabulary size: 15,224

Terms of Use

Any attempt to identify or uncover the identity of speakers in the Neyshekar datasets is strictly prohibited.

License

This dataset is released under the CC0 1.0 Universal license.
It may be used, modified, and redistributed for any purpose without restriction.