Code explanation about data prepocessing

Hello, thank for your open source. I am trying to understand your code. However, in the data.py, it is confused for me to preprocess  the data. 

In building vocabulary,
 
```python
print("Load corpus with train size %d, valid size %d, "
              "test size %d raw vocab size %d vocab size %d at cut_off %d OOV rate %f"
              % (len(self.train_corpus), len(self.valid_corpus), len(self.test_corpus),
                 raw_vocab_size, len(vocab_count), vocab_count[-1][1], float(discard_wc) / len(all_words)))
```

What do the train size, valid size, and test size mean?
The values of all are 2 since they are a tuple with length of 2.

Do you mean that all vocabularies are from the training, testing,  and validation data?
However, it only uses the training data  to build the vocabulary in the code.

In formatting dialogue,
Is it essential to add [\<s\>,\<d\>,\</s\>] in the start of the dialogue?
Can I not use this?

thank you.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Code explanation about data prepocessing #11

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Code explanation about data prepocessing #11

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions