Name Origin Prediction with RNNs

📖 Overview

This project uses Recurrent Neural Networks (RNNs) to predict the language or country origin of a name based on its character sequence. Given a dataset of names from various countries, the model learns patterns in spelling and structure to classify names (e.g., English, Arabic, Persian). It involves preprocessing names into character embeddings, building an LSTM/GRU model, training on sequences, and predicting for new names.

Implemented in a Jupyter Notebook (new_born.ipynb) with Keras/TensorFlow, the project handles variable-length sequences with padding and outputs predictions in submission.csv and y_pred.npy.

🎯 Objectives

Predict Name Origin: Classify a name's language/country from its characters.
Sequence Modeling: Use RNNs (LSTM/GRU) for character-level classification.
Handle Variable Lengths: Pad sequences to fixed length for batch processing.
Generate Submission: Predict labels for test names and package for evaluation.

✨ Features

Data Preprocessing:
- Convert names to lowercase, one-hot encode characters (e.g., 0-1 for a-z, special chars).
- Pad/truncate sequences to max length (e.g., 20 chars).
- Build vocabulary from unique characters.
RNN Model:
- Embedding layer for characters.
- LSTM/GRU layers for sequence processing.
- Dense output with softmax for multi-class prediction.
Training:
- Categorical cross-entropy loss, Adam optimizer.
- Batch training with validation split.
Evaluation:
- Accuracy and loss curves.
- Predictions saved as numpy array and CSV.

🚀 Usage

Open new_born.ipynb in Jupyter/Colab.
Run cells:
- Load data, build char vocabulary, preprocess sequences.
- Define RNN model, compile, train.
- Predict on test set, save outputs.
Output:
- submission.csv: Predicted labels.
- y_pred.npy: Raw predictions.
- result.zip: Zipped files.

Key Code

Preprocessing:

from keras.preprocessing.sequence import pad_sequences
# One-hot encode chars, pad to max_len
X = pad_sequences(one_hot_sequences, maxlen=max_len)

Model:

from keras.models import Sequential
from keras.layers import LSTM, Dense, Embedding
model = Sequential([Embedding(vocab_size, 64), LSTM(128), Dense(num_classes, activation='softmax')])
model.compile(loss='categorical_crossentropy', optimizer='adam')

Submission:

y_pred = model.predict(X_test)
np.save("y_pred.npy", y_pred)
submission = pd.DataFrame({'id': test_ids, 'label': np.argmax(y_pred, axis=1)})
submission.to_csv('submission.csv', index=False)

📊 Code Structure

Cells 1-3: Imports, load data, build vocabulary/preprocess.
Cells 4-6: Model definition, compilation, training.
Cells 7-9: Evaluation, test predictions.
Cell 10: Save outputs, zip submission.

🔍 Evaluation

Metrics: Accuracy, top-k accuracy, confusion matrix.
Validation: Train-val split for monitoring overfitting.
Sequence Handling: Bidirectional LSTM for better context.

📝 Notes

Character-Level: Treats names as sequences, no word-level features.
Padding: Uses zeros for shorter names; masks in LSTM.
Vocab Size: ~50-100 unique chars across languages.
Improvements: Bidirectional RNNs, attention, multilingual embeddings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Name Origin Prediction with RNNs

📖 Overview

🎯 Objectives

✨ Features

🚀 Usage

Key Code

📊 Code Structure

🔍 Evaluation

📝 Notes

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.markdown		README.markdown
new_born.ipynb		new_born.ipynb
submission.csv		submission.csv
y_pred.npy		y_pred.npy

Mmonire/Name-Origin-Prediction-RNN

Folders and files

Latest commit

History

Repository files navigation

Name Origin Prediction with RNNs

📖 Overview

🎯 Objectives

✨ Features

🚀 Usage

Key Code

📊 Code Structure

🔍 Evaluation

📝 Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages