This project uses Recurrent Neural Networks (RNNs) to predict the language or country origin of a name based on its character sequence. Given a dataset of names from various countries, the model learns patterns in spelling and structure to classify names (e.g., English, Arabic, Persian). It involves preprocessing names into character embeddings, building an LSTM/GRU model, training on sequences, and predicting for new names.
Implemented in a Jupyter Notebook (new_born.ipynb) with Keras/TensorFlow, the project handles variable-length sequences with padding and outputs predictions in submission.csv and y_pred.npy.
- Predict Name Origin: Classify a name's language/country from its characters.
- Sequence Modeling: Use RNNs (LSTM/GRU) for character-level classification.
- Handle Variable Lengths: Pad sequences to fixed length for batch processing.
- Generate Submission: Predict labels for test names and package for evaluation.
- Data Preprocessing:
- Convert names to lowercase, one-hot encode characters (e.g., 0-1 for a-z, special chars).
- Pad/truncate sequences to max length (e.g., 20 chars).
- Build vocabulary from unique characters.
- RNN Model:
- Embedding layer for characters.
- LSTM/GRU layers for sequence processing.
- Dense output with softmax for multi-class prediction.
- Training:
- Categorical cross-entropy loss, Adam optimizer.
- Batch training with validation split.
- Evaluation:
- Accuracy and loss curves.
- Predictions saved as numpy array and CSV.
- Open
new_born.ipynbin Jupyter/Colab. - Run cells:
- Load data, build char vocabulary, preprocess sequences.
- Define RNN model, compile, train.
- Predict on test set, save outputs.
- Output:
submission.csv: Predicted labels.y_pred.npy: Raw predictions.result.zip: Zipped files.
-
Preprocessing:
from keras.preprocessing.sequence import pad_sequences # One-hot encode chars, pad to max_len X = pad_sequences(one_hot_sequences, maxlen=max_len)
-
Model:
from keras.models import Sequential from keras.layers import LSTM, Dense, Embedding model = Sequential([Embedding(vocab_size, 64), LSTM(128), Dense(num_classes, activation='softmax')]) model.compile(loss='categorical_crossentropy', optimizer='adam')
-
Submission:
y_pred = model.predict(X_test) np.save("y_pred.npy", y_pred) submission = pd.DataFrame({'id': test_ids, 'label': np.argmax(y_pred, axis=1)}) submission.to_csv('submission.csv', index=False)
- Cells 1-3: Imports, load data, build vocabulary/preprocess.
- Cells 4-6: Model definition, compilation, training.
- Cells 7-9: Evaluation, test predictions.
- Cell 10: Save outputs, zip submission.
- Metrics: Accuracy, top-k accuracy, confusion matrix.
- Validation: Train-val split for monitoring overfitting.
- Sequence Handling: Bidirectional LSTM for better context.
- Character-Level: Treats names as sequences, no word-level features.
- Padding: Uses zeros for shorter names; masks in LSTM.
- Vocab Size: ~50-100 unique chars across languages.
- Improvements: Bidirectional RNNs, attention, multilingual embeddings.