This is the backing repo for ⲣⲉⲙⲛ̀Ⲭⲏⲙⲓ, a project that aims to make the Coptic language more learnable.
We use:
- GitHub for our code base.
- GitHub Pages for our website.
- AWS Route 53 for domain registration and DNS.
- Google Drive and Google Cloud for cloud storage.
- Google Analytics and Google Search Console for traffic tracking and analysis.
-
Clone the repo with
--depth=1because the history is huge, and much of the outrageously large files have been cleaned up. -
Setting up the environment is necessary for a lot of pipelines to work.
In general, you should run this at the beginning of each development session:
source .envEquivalently:
. ./.envThis sets up the Python virtual environment; and exports many environment variables and helpers, some of which are used by the pipelines, and some are simply intended for developer convenience.
Alternatively, you can define a hook that would source it automatically once you
cdinto the directory. If you use ZSH, you can add the following to your.zshrc(replacing${PATH_TO_COPTIC_REPO}with the path to this repo):coptic_source_hook() { if [[ $PWD == "${PATH_TO_COPTIC_REPO}" ]]; then source ./.env chpwd_functions[(Ie)$0]=() # remove ourselves from the array fi } chpwd_functions+=(coptic_source_hook)
For Bash, add this to your
.bashrc(replacing${PATH_TO_COPTIC_REPO}appropriately):coptic_source_hook() { if [[ "$PWD" == "$PATH_TO_COPTIC_REPO" ]]; then source ./.env PROMPT_COMMAND=${PROMPT_COMMAND//coptic_source_hook;}/ fi } PROMPT_COMMAND="coptic_source_hook; $PROMPT_COMMAND"
Keep in mind that the Python
venvwill continue to be activated afterwards, and the environment variables will still be set, as long as you're in the same shell session. You can deactivate the Pythonvenvby runningdeactivate. Alternatively, you can just exit the shell window and start a new one. -
Running
make installshould take care of most of the installations. Sourcing.envis necessary for this to work. Thoughmake installonly needs to be run once, while.envneeds to be sourced for each session.If there are missing binaries that you need to download,
make installwill let you know. You may also need to log in withgh. -
Our pipelines are defined in
Makefile. Though some pipelines inMakefileare only used during development and testing, and are not relevant for output regeneration. -
Keep in mind that parameters are written with the assumption that scripts are being invoked from the repo's root directory, rather than from the directory where the script lives. You should do most of your development in the root directory.
-
This file is the only
README.mdin the repo (and this is enforced by a pre-commit hook). Technical documentation is intentionally centralized. Besides this file, docs can be found in:- In-code comments
- Planning framework
- Commit messages (albeit less significantly)
User-facing documentation shouldn't live on the repo, but should go on the website instead.
-
We use pre-commit hooks extensively, and they have helped us discover a lot of bugs and issues with our code, and also keep our repo organized. They are not optional. Their installation should be covered by
make install. They are defined in.pre-commit-config.yaml. They run automatically before a commit. You can execute the following to appease them (keep running them and applying their changes until they all pass), though keep in mind thatmake testrunsgit add --all:make testOur pipelines currently have minimal dependencies. For a pair of dependent pipelines (where one downstream pipeline consumes the output of another upstream pipeline), the downstream will fare well even if pre-commits haven't been executed on the output of the upstream pipeline. If this were to change, reopen #120.
We use GitHub to track our plans and TODO's.
This list of components helps us group our work into a number of well-defined focus areas. Milestones usually concern themselves with one of the components, and issues and commit messages should be prefixed with a component name between square brackets.
- Crum: Crum's dictionary
- KELLIA: KELLIA's dictionary
- Andreas: Andreas's dictionary
- Dawoud: Dawoud's dictionary
- Bible: The Coptic Bible
- Lexicon: ⲡⲓⲖⲉⲝⲓⲕⲟⲛ
- Site: Our website
- Morphology: Our morphological analysis pipelines
- platform: The development platform and tooling.
- Community: Community of contributors and users.
-
Milestones represent long-term, complex goals or deliverables. They help us draw our project path, and what it is that we're trying to achieve in the long run. Milestones are a translation of the project's mission.
-
Besides the more specific milestones that represent concrete goals, we have
(Backlog)milestones, that represent miscellaneous pending improvements, technical debt, optimizations, or desired changes; but which don't block the achievement of one of the project's main goals. -
Milestone priorities are assigned using due dates.
-
The number of milestones should remain under control.
-
When work on a milestone is good enough, it's closed, the achievement is celebrated, and its remaining issues move to an appropriate backlog milestone.
-
As much as possible, each milestone should be concerned with a given component.
-
Every issue must belong to a milestone.
-
Issues need to be as specific and isolated as possible. Most of the time, they span a single component and involve a local change or set of local changes, although they can sometimes work mainly in one component and spill to others, and sometimes they're generic and span one aspect of multiple components (such as the conventions set for the whole repo).
-
High-priority issues are marked in a number of ways:
- The
favorablelabel. - Assignment to a developer
- Belonging to a high-priority milestone.
- The
-
Add
TODOs to the code whenever appropriate, always followingTODOwith a colon, a space, and an issue number (with the pound sign) surrounded by parenthesis. This format is enforced by a pre-commit hook, though the hook only picks up aTODOif it's immediately followed by:. If theTODOis low-priority, and isn't worth an associated issue, you can assign it to the pseudo-issue#0.
Wherever possible, use labels to help track and organize issues. Issues mostly have exactly one How, and usually one Why.
Refer to labels for the most recent definitions, but they should belong to the following categories:
How- How can the task be achieved?
architect: Planning and design.diplomacy: Diplomacy, connections, and reachout.documentation: Writing documentation.labor: Manual data collection.code: There is nocodelabel, because that includes most tasks. A task that doesn't have anotherHowlabel is probably acodetask.
- How can the task be achieved?
Who- Is the issue user-facing or developer-oriented?
user: A user-oriented improvement.dev: A developer-oriented, not user-visible, improvement.
- Is the issue user-facing or developer-oriented?
Why- What is the purpose of this issue?
data: Expand the data that we own.rigor: Improve the rigor (particularly when it comes to such issues parsing, or inflection generation).UI: Improve the user interface.bug: Fix a bug.community: Grow the ⲣⲉⲙⲛ̀Ⲭⲏⲙⲓ community.
- What is the purpose of this issue?
What: A generic set of labels:favorable: Nice to do soon.backlog: Low-impact / low-priority.reports: User reports.
The project page offers alternative views of the issues, which can come in handy for planning purposes.
-
Use the following format for the first line of the commit message:
[#${ISSUE}][${COMPONENT}/${SUBCOMPONENT}] ${DESCRIPTION} -
Use proper punctuation and capitalization.
-
The subcomponent is optional.
-
Use
fix #${ISSUE}to automatically close an issue with the commit. -
Besides the description line, include more details in the body of the commit message, though make sure that the more important docs live in the code.
-
Add excessive in-code assertions, and validate your assumptions whenever possible. This is our first line of defense, and has been the champion when it comes to ensuring correctness and catching bugs.
-
When it comes to error checking:
- Employ assertions for sanity checks, such as catching logic errors, or situations that are impossible if your code is correct.
- Employ exceptions for errors that may occur – such as potential typos in the input data.
Exceptions tend to have error messages, which may be helpful. Assertions tend to simply crash without context. Use exceptions when the presence of an error message may be helpful.
-
Use our
utilspackages where appropriate: -
Use our
pathspackages to store (1) the project's internal structure, including subdirectories to other components, and (2) external dependencies: -
Document the code extensively.
-
Use type hints extensively.
-
Minimize dependence on HTML, and implement behaviours in TypeScript when possible.
-
Avoid using a generic
utilspackage. It can easily become a catch-all for unrelated logic, grow excessively large, and lose clear purpose. Instead, organize utilities into purpose-specific packages based on functionality. -
Some of our projects have a
datasubdirectory. Pay attention to the following distinction:-
raw/: Data that is copied from elsewhere. This would, for example, include the Marcion SQL tables copied as is, unmodified. The contents of this directory remain true to the original source. -
input/: Data that we either modified or created. If we want to fix typos to data that we copied, we don't touch the data underraw/, but we take the liberty to modify the copies that live underinput/.
-
-
It has been helpful to be able to know, from a quick glance at a TypeScript file:
- What the classes used are.
- What listeners are registered.
- What elements are retrieved from the document.
Therefore, whenever possible, try to abide by the following:
- Group all classes in a
CLSenum. - Group event listener registrations to one
addEventListenersfunction (or a use function name that starts with this prefix, so it's easy to find in search). - Also prefer the following syntax:
over this:
element.addEventListeners('click', () => {});
element.onclick = () => {};
- Use
querySelectororquerySelectorAllinstead of such methods asgetElementsByClassNameorgetElementsByTagName. The only exception is when retrieving an element by ID, in which case we enforcegetElementById.
-
Our pipelines are primarily written in Python. There is minimal logic in Bash.
-
We have a strong bias for Python over Bash. Use Bash if you expect the number of lines of code of an equivalent Python piece to be significantly more.
-
We use TypeScript for static site logic. It then gets transpiled to JavaScript by running
make transpile. We don't write JavaScript directly. -
We expect to make a similar platform-specific expansion into another territory for the app.
-
In the past, we voluntarily used Java (for an archived project). Won't happen again! We also used VBA and JS for Microsoft Excel and Google Sheet macros (also archived at the moment) because they were required by the platform.
-
It is desirable to strike a balance between the benefits of focusing on a small number of languages, and the different powers that different language can uniquely exhibit. We won't compromise the latter for the former. Use the right language for a task. When two languages can do a job equally well, uncompromisingly choose the one that is more familiar.
- We collect extensive stats, and we remind you of them using a pre-commit. The
primary targets of our statistics are:
- The size of our code (represented by the number of lines of code). We also collect this stat for each subproject or pipeline step independently.
- The number of data items we've collected for data collection tasks.
- We also record the number of commits, and the number of contributors.
This directory contains the data and logic for processing our dictionaries.
There are many reasons we have decided to add pictures to our dictionary, and heavily invested in the image pipeline. They have become one of the integral pieces of our dictionary framework.
-
The meaning of a word is much more strongly and concretely conveyed by an image than by a word. Learning is not about knowing vocabulary or grammar. Learning is ultimately about creating the neural pathways that enable language to flow out of you naturally. A given word needs to settle and connect with nodes in your associative memory in order for you to be able to use it. If our goal is to create or strengthen the neural pathways between a Coptic word and related nodes in your brain, then it aids the learning process to achieve as much neural activation as possible during learning. This is much better achieved by an image than by a mere translation, given the way human brains work. After all, the visual processing areas of our brains are bigger, faster, and far more ancient and primordial (even reptiles can see) compared to the language processing areas. You will often find that, when you learn a new word, the associated images pop up in your brain more readily than the translation. Thus the use of images essentially revolutionizes the language learning process.
-
Oftentimes, the words describe an entity or concept that is unfamiliar to many users. Things like ancient crafts, plant or fish species, farmer's tools, and the like, are unfamiliar. Showing a user the English translation of a word doesn't suffice for the user to understand what it is, and they would often look up images themselves in order to find out what the word actually means. By embedding the pictures in the dictionary, we save users some time so they don't have to look it up themselves.
-
Translations are often taken lightly by users. Pictures are not. When a dictionary author translates a given Coptic word into different English words, for example, the extra translations are often seen by users as auxiliary - tokens added there to convey a meaning that the dictionary author couldn't convey using fewer words.
That's not the case for pictures. Pictures are taken seriously by users, and are more readily accepted as bearing a true, authentic, independent meaning of the word. Listing images (especially after we have started ascribing each image to a sense that the word conveys) is a way to recognize and legitimize those different senses and meanings that a word possesses.
It's for this reason that images must be deeply contemplated, and a word must be digested well, before we add explanatory images for it. Collecting images is tantamount to authoring a dictionary.
Our experience collecting images has taught us a few lessons. We tend to follow the following guidelines when we search for pictures:
-
Each image ends up being resized to a width of 300 pixel and a height proportional to the original. We prefer images with a minimum width of 300 pixels, though down to 200 is acceptable.
-
As for image height, short images are rarely ugly, but long images usually are. So we set a generously low lower bound of 100 pixels on the resized height, but set a stricter upper bound of 500 pixels. Although we tend to prefer the height to fall within a range of 200 to 400 pixels.
-
Collecting sources is mandatory. We always record the URL that an image is retrieved from. Our
img_helperscript, which we use to process images, can be supplied by a URL, and it will download the image and store the source (and also resize the image to the final version). This simplifies the process. -
We make extensive use of icons. They can capture the meaning of a word in situations when it's otherwise hard to describe a word using an image (example).
-
This hasn't been contemplated, but when given a choice, prefer an ancient Egyptian explanatory image, followed by an old (not necessarily Egyptian) image, followed by a modern image (example). We prefer to keep the images as close as possible to their reflections in the mind of a native speaker. We also want to stress the fact that those Coptic words can be equally used to refer to entities from other cultures, or modern entities.
This could be revisited later.
Some entries have no dialect specified in Crum, so they get treated as belonging to all dialects. More information at #237.
The following entries are absent from Crum's dictionary. They were added to our database from other sources:
copticocc_org/ contains a digital scan of
Moawad Dawoud's dictionary.
TLA data:
The TLA data, which comprises the core of the dictionary, is retrieved from Comprehensive Coptic Lexicon: Including Loanwords from Ancient Greek v 1.2.
- 84c104 integrates some changes made by Coptic Scriptorium to CDO's copy of the XML.
- We may have made some changes afterwards. Use
git logorgit diffto find them.
Supplemental forms:
Coptic Scriptorium has attempted to grow the TLA by adding supplemental forms. As of the time of writing, CDO is capable of expanding an entry by adding variant forms, but it can't add any new entries that lack a TLA ID.
-
Bohairic supplemental forms are being directly retrieved from the sheet maintained by Coptic Scriptorium. The data that CDO actually uses is unavailable to us, but it's derived from the sheet.
-
Sahidic supplemental forms have been snapshotted from the CDO's
inflections.tabin October 2025. As of the time of writing, they remain the CDO'sdevbranch.
Supplemental forms have been, well, problematic! They seem to be poorly maintained by Coptic Scriptorium. As of October 2025, besides the issues above with accessing the latest data or a stable snapshot, their processing code seems to also suffer from at least the following:
- Parts-of-speech of supplemental forms are completely ignored.
- Markers of prenominal (
-), pronominal (⸗), and qualitative (†) forms, are omitted. - For a given entry, Bohairic supplemental forms are taken on an all-or-nothing basis. No deduplication or merging is performed.
As of the time of writing, Lexicon often doesn't show the same set of supplemental forms that CDO shows. CDO doesn't seem to be under active development at the moment, and the above issues aren't expected to be resolved. We are considering reverting the addition of supplemental forms, and relying only on the TLA data.
Code:
We based our TLA processing
logic on the CDO's
dictionary_reader.py.
Parts of the logic, particularly those pertaining to supplemental forms, are
derived from pieces that, as of October 2025, lives in the dev version of the
file.
The original code is very badly written and is completely unmaintainable, and it has several (small) bugs. Our code has since significantly diverged from the original, and there is little overlap left.
There are TLA and CDO artifacts that we chose to ignore in our own pipeline,
such as Egyptian
Etymologies,
entity
types,
and oRef
tags.
This directory contains the data and logic for processing the Bible corpus.
There are several published versions of the Coptic Bible. The most recent, and most complete, is that of St. Shenouda the Archmandrite Coptic Society. It is the Coptic Bible project that is most worthy of investment at the moment.
This directory contains the data and logic for processing dictionaries into flashcards and Lexicon. It is named as such because our first use case was a flashcard app, although our use of the dictionaries has since become more versatile.
This directory contains the data and logic for generating the morphological dictionaries (to support inflections).
This directory contains the static data for our website.
We need data collectors. Data collection tasks bear the labor
label. The data
label is related, but is more
generic.
Ⲉ̀ϣⲱⲡ ⲁⲓϣⲁⲛⲉⲣⲡⲉⲱⲃϣ Ⲓⲗ̅ⲏ̅ⲙ̅, ⲉⲓⲉ̀ⲉⲣⲡⲱⲃϣ ⲛ̀ⲧⲁⲟⲩⲓⲛⲁⲙ: Ⲡⲁⲗⲁⲥ ⲉϥⲉ̀ϫⲱⲗϫ ⲉ̀ⲧⲁϣ̀ⲃⲱⲃⲓ ⲉ̀ϣⲱⲡ ⲁⲓϣ̀ⲧⲉⲙⲉⲣⲡⲉⲙⲉⲩⲓ.