This project contains a Python script that parses a Kịrịkẹ-English dictionary PDF and converts it into structured formats (CSV, JSON, and Markdown).
- Python 3.7 or higher
- pip (Python package installer)
-
Clone this repository:
git clone https://github.com/seanwillex/Python-Script-For-PDF-Data-Processing.git cd kirike-english-dictionary-parser -
(Optional) Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate` -
Install the required packages:
pip install -r requirements.txt
-
Place your Kịrịkẹ-English dictionary PDF file in the same directory as the script.
-
Open
process_dictionary.pyand update theinput_filevariable with your PDF filename:input_file = "your-dictionary-file.pdf"
-
Run the script:
python process_dictionary.py -
The script will generate three output files in the same directory:
kirike-english-dictionary.csv: CSV version of the dictionarykirike-english-dictionary.json: JSON version of the dictionarykirike-english-dictionary.md: Markdown version of the dictionary
To use this script with a different PDF dictionary:
-
Update the
input_filevariable inprocess_dictionary.pywith your new PDF filename. -
If your dictionary starts on a different page, modify the
read_pdf_filefunction:def read_pdf_file(file_path): # Change '15' to the page number where your dictionary starts (0-indexed) text = extract_text(file_path, page_numbers=range(15, 1000)) text = unicodedata.normalize('NFKD', text) return text
-
If your dictionary has a different structure, update the regular expression in the
parse_dictionaryfunction:match = re.match(r'^(\S+)\s+(\S+\.)\s+(\S+)\s+(.*)', line)
Modify this pattern to match your dictionary's entry format.
- The CSV file can be opened with spreadsheet software like Microsoft Excel or Google Sheets.
- The JSON file can be used in various programming languages and web applications.
- The Markdown file can be viewed with any Markdown viewer or text editor.
-
If you encounter encoding issues, try different normalization methods in the
read_pdf_filefunction:text = unicodedata.normalize('NFKC', text) # or 'NFC', or 'NFD'
-
If the script is not correctly parsing entries, check the regular expression in the
parse_dictionaryfunction and adjust it to match your PDF's structure. -
For PDFs with complex layouts or scanned images, you might need to use OCR software to convert the PDF to text before processing.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
This README provides a comprehensive guide for users to understand, install, use, and customize your script. It also includes sections on troubleshooting and contributing, which are helpful for open-source projects.
Remember to update the repository URL in the Installation section with your actual GitHub repository URL once you've created it.
To add these files to your GitHub repository:
1. Create a new repository on GitHub.
2. Initialize a Git repository in your local project folder (if you haven't already):
git init
3. Add all the files:
git add .
4. Commit the files:
git commit -m "Initial commit"
5. Link your local repository to the GitHub repository:
git remote add origin https://github.com/seanwillex/Python-Script-For-PDF-Data-Processing.git
6. Push the files to GitHub:
git push -u origin master
This will set up your GitHub repository with the script, additional files, and a detailed README.md, making it easy for others to understand and use your Kịrịkẹ-English Dictionary Parser.