Skip to content

Code explanation about data prepocessing #11

@sinlin0908

Description

@sinlin0908

Hello, thank for your open source. I am trying to understand your code. However, in the data.py, it is confused for me to preprocess the data.

In building vocabulary,

print("Load corpus with train size %d, valid size %d, "
              "test size %d raw vocab size %d vocab size %d at cut_off %d OOV rate %f"
              % (len(self.train_corpus), len(self.valid_corpus), len(self.test_corpus),
                 raw_vocab_size, len(vocab_count), vocab_count[-1][1], float(discard_wc) / len(all_words)))

What do the train size, valid size, and test size mean?
The values of all are 2 since they are a tuple with length of 2.

Do you mean that all vocabularies are from the training, testing, and validation data?
However, it only uses the training data to build the vocabulary in the code.

In formatting dialogue,
Is it essential to add [<s>,<d>,</s>] in the start of the dialogue?
Can I not use this?

thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions