English 中文(简体)
How can I batch a tensorflow dataset without loading all the data into memory simultaneously?
原标题:

I am developing a Speech Recognition Model using TensorFlow, and I have a directory structure for my dataset. Within the ./dataset directory, I have subdirectories for validation, training, and testing (dev-clean, test-clean, train-clean-100). Each of these subdirectories contains several directories with a unique ID. Inside each of these directories, there are two files: audio.wav and text.txt.

I want to perform preprocessing on the audio and text data on the fly while loading the dataset. Here is my current dataset loading function:

def build_dataset(self, path: str = "./dataset/train-clean-100") -> Any:
    audio_files = []
    text_files = []
    for root, dirs, files in os.walk(path):
        for file in files:
            if file == "audio.wav":
                audio_files.append(os.path.join(root, file))
            elif file == "text.txt":
                text_files.append(os.path.join(root, file))
    audio_dataset = tf.data.Dataset.from_tensor_slices(audio_files)
    audio_dataset = audio_dataset.map(
        self.preprocess_audio, num_parallel_calls=tf.data.experimental.AUTOTUNE)

    text_dataset = tf.data.Dataset.from_tensor_slices(text_files)
    text_dataset = text_dataset.map(
        self.preprocess_text, num_parallel_calls=tf.data.experimental.AUTOTUNE)

    dataset = tf.data.Dataset.zip((audio_dataset, text_dataset))
    return dataset

def preprocess_audio(self, audio_path: str) -> tf.Tensor:
    audio = tf.io.read_file(audio_path)
    audio, _ = tf.audio.decode_wav(audio, desired_samples=20000)
    return audio

def preprocess_text(self, text_path: tf.Tensor) -> Any:
    texts = tf.io.read_file(text_path)
    tokens = self.vectorizer(texts)
    return tokens

This implementation works fine when I don t use batching (loading a single item). However, when I call .batch() on the dataset, my development environment (VSCode) and python crash. I suspect this is due to the entire dataset being loaded into memory instead of just the file names.

How can I efficiently batch this dataset while still loading the data on the fly, rather than loading everything into memory at once?

问题回答

暂无回答




相关问题
what is wrong with this mysql code

$db_user="root"; $db_host="localhost"; $db_password="root"; $db_name = "fayer"; $conn = mysqli_connect($db_host,$db_user,$db_password,$db_name) or die ("couldn t connect to server"); // perform query ...

Users asking for denormalized database

I am in the early stages of developing a database-driven system and the largest part of the system revolves around an inheritance type of relationship. There is a parent entity with about 10 columns ...

Easiest way to deal with sample data in Java web apps?

I m writing a Java web app in my free time to learn more about development. I m using the Stripes framework and eventually intend to use hibernate and MySQL For the moment, whilst creating the pages ...

join across databases with nhibernate

I am trying to join two tables that reside in two different databases. Every time, I try to join I get the following error: An association from the table xxx refers to an unmapped class. If the ...

How can I know if such value exists in database? (ADO.NET)

For example, I have a table, and there is a column named Tags . I want to know if value programming exists in this column. How can I do this in ADO.NET? I did this: OleDbCommand cmd = new ...

Convert date to string upon saving a doctrine record

I m trying to migrate one of my PHP projects to Doctrine. I ve never used it before so there are a few things I don t understand. In my current code, I have a class similar to this: class ...

热门标签