I am developing a Speech Recognition Model using TensorFlow, and I have a directory structure for my dataset. Within the ./dataset
directory, I have subdirectories for validation, training, and testing (dev-clean
, test-clean
, train-clean-100
). Each of these subdirectories contains several directories with a unique ID. Inside each of these directories, there are two files: audio.wav
and text.txt
.
I want to perform preprocessing on the audio and text data on the fly while loading the dataset. Here is my current dataset loading function:
def build_dataset(self, path: str = "./dataset/train-clean-100") -> Any:
audio_files = []
text_files = []
for root, dirs, files in os.walk(path):
for file in files:
if file == "audio.wav":
audio_files.append(os.path.join(root, file))
elif file == "text.txt":
text_files.append(os.path.join(root, file))
audio_dataset = tf.data.Dataset.from_tensor_slices(audio_files)
audio_dataset = audio_dataset.map(
self.preprocess_audio, num_parallel_calls=tf.data.experimental.AUTOTUNE)
text_dataset = tf.data.Dataset.from_tensor_slices(text_files)
text_dataset = text_dataset.map(
self.preprocess_text, num_parallel_calls=tf.data.experimental.AUTOTUNE)
dataset = tf.data.Dataset.zip((audio_dataset, text_dataset))
return dataset
def preprocess_audio(self, audio_path: str) -> tf.Tensor:
audio = tf.io.read_file(audio_path)
audio, _ = tf.audio.decode_wav(audio, desired_samples=20000)
return audio
def preprocess_text(self, text_path: tf.Tensor) -> Any:
texts = tf.io.read_file(text_path)
tokens = self.vectorizer(texts)
return tokens
This implementation works fine when I don t use batching (loading a single item). However, when I call .batch() on the dataset, my development environment (VSCode) and python crash. I suspect this is due to the entire dataset being loaded into memory instead of just the file names.
How can I efficiently batch this dataset while still loading the data on the fly, rather than loading everything into memory at once?