Skip to content

pishoyg/coptic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5,339 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ⲣⲉⲙⲛ̀Ⲭⲏⲙⲓ

This is the backing repo for ⲣⲉⲙⲛ̀Ⲭⲏⲙⲓ, a project that aims to make the Coptic language more learnable.

Technical Docs

Hosting

We use:

Getting started

  1. Clone the repo with --depth=1 because the history is huge, and much of the outrageously large files have been cleaned up.

  2. Setting up the environment is necessary for a lot of pipelines to work.

    In general, you should run this at the beginning of each development session:

    source .env

    Equivalently:

    . ./.env

    This sets up the Python virtual environment; and exports many environment variables and helpers, some of which are used by the pipelines, and some are simply intended for developer convenience.

    Alternatively, you can define a hook that would source it automatically once you cd into the directory. If you use ZSH, you can add the following to your .zshrc (replacing ${PATH_TO_COPTIC_REPO} with the path to this repo):

    coptic_source_hook() {
      if [[ $PWD == "${PATH_TO_COPTIC_REPO}" ]]; then
        source ./.env
        chpwd_functions[(Ie)$0]=() # remove ourselves from the array
      fi
    }
    chpwd_functions+=(coptic_source_hook)

    For Bash, add this to your .bashrc (replacing ${PATH_TO_COPTIC_REPO} appropriately):

    coptic_source_hook() {
    if [[ "$PWD" == "$PATH_TO_COPTIC_REPO" ]]; then
      source ./.env
      PROMPT_COMMAND=${PROMPT_COMMAND//coptic_source_hook;}/
    fi
    }
    
    PROMPT_COMMAND="coptic_source_hook; $PROMPT_COMMAND"

    Keep in mind that the Python venv will continue to be activated afterwards, and the environment variables will still be set, as long as you're in the same shell session. You can deactivate the Python venv by running deactivate. Alternatively, you can just exit the shell window and start a new one.

  3. Running make install should take care of most of the installations. Sourcing .env is necessary for this to work. Though make install only needs to be run once, while .env needs to be sourced for each session.

    If there are missing binaries that you need to download, make install will let you know. You may also need to log in with gh.

  4. Our pipelines are defined in Makefile. Though some pipelines in Makefile are only used during development and testing, and are not relevant for output regeneration.

  5. Keep in mind that parameters are written with the assumption that scripts are being invoked from the repo's root directory, rather than from the directory where the script lives. You should do most of your development in the root directory.

  6. This file is the only README.md in the repo (and this is enforced by a pre-commit hook). Technical documentation is intentionally centralized. Besides this file, docs can be found in:

    User-facing documentation shouldn't live on the repo, but should go on the website instead.

  7. We use pre-commit hooks extensively, and they have helped us discover a lot of bugs and issues with our code, and also keep our repo organized. They are not optional. Their installation should be covered by make install. They are defined in .pre-commit-config.yaml. They run automatically before a commit. You can execute the following to appease them (keep running them and applying their changes until they all pass), though keep in mind that make test runs git add --all:

    make test

    Our pipelines currently have minimal dependencies. For a pair of dependent pipelines (where one downstream pipeline consumes the output of another upstream pipeline), the downstream will fare well even if pre-commits haven't been executed on the output of the upstream pipeline. If this were to change, reopen #120.

    pre-commit

Planning

We use GitHub to track our plans and TODO's.

Components

This list of components helps us group our work into a number of well-defined focus areas. Milestones usually concern themselves with one of the components, and issues and commit messages should be prefixed with a component name between square brackets.

  1. Crum: Crum's dictionary
  2. KELLIA: KELLIA's dictionary
  3. Andreas: Andreas's dictionary
  4. Dawoud: Dawoud's dictionary
  5. Bible: The Coptic Bible
  6. Lexicon: ⲡⲓⲖⲉⲝⲓⲕⲟⲛ
  7. Site: Our website
  8. Morphology: Our morphological analysis pipelines
  9. platform: The development platform and tooling.
  10. Community: Community of contributors and users.
  • Milestones represent long-term, complex goals or deliverables. They help us draw our project path, and what it is that we're trying to achieve in the long run. Milestones are a translation of the project's mission.

  • Besides the more specific milestones that represent concrete goals, we have (Backlog) milestones, that represent miscellaneous pending improvements, technical debt, optimizations, or desired changes; but which don't block the achievement of one of the project's main goals.

  • Milestone priorities are assigned using due dates.

  • The number of milestones should remain under control.

  • When work on a milestone is good enough, it's closed, the achievement is celebrated, and its remaining issues move to an appropriate backlog milestone.

  • As much as possible, each milestone should be concerned with a given component.

  • Every issue must belong to a milestone.

  • Issues need to be as specific and isolated as possible. Most of the time, they span a single component and involve a local change or set of local changes, although they can sometimes work mainly in one component and spill to others, and sometimes they're generic and span one aspect of multiple components (such as the conventions set for the whole repo).

  • High-priority issues are marked in a number of ways:

    • The favorable label.
    • Assignment to a developer
    • Belonging to a high-priority milestone.
  • Add TODOs to the code whenever appropriate, always following TODO with a colon, a space, and an issue number (with the pound sign) surrounded by parenthesis. This format is enforced by a pre-commit hook, though the hook only picks up a TODO if it's immediately followed by :. If the TODO is low-priority, and isn't worth an associated issue, you can assign it to the pseudo-issue #0.

Wherever possible, use labels to help track and organize issues. Issues mostly have exactly one How, and usually one Why.

Refer to labels for the most recent definitions, but they should belong to the following categories:

  • How
    • How can the task be achieved?
      • architect: Planning and design.
      • diplomacy: Diplomacy, connections, and reachout.
      • documentation: Writing documentation.
      • labor: Manual data collection.
      • code: There is no code label, because that includes most tasks. A task that doesn't have another How label is probably a code task.
  • Who
    • Is the issue user-facing or developer-oriented?
      • user: A user-oriented improvement.
      • dev: A developer-oriented, not user-visible, improvement.
  • Why
    • What is the purpose of this issue?
      • data: Expand the data that we own.
      • rigor: Improve the rigor (particularly when it comes to such issues parsing, or inflection generation).
      • UI: Improve the user interface.
      • bug: Fix a bug.
      • community: Grow the ⲣⲉⲙⲛ̀Ⲭⲏⲙⲓ community.
  • What: A generic set of labels:
    • favorable: Nice to do soon.
    • backlog: Low-impact / low-priority.
    • reports: User reports.

The project page offers alternative views of the issues, which can come in handy for planning purposes.

  • Use the following format for the first line of the commit message:

    [#${ISSUE}][${COMPONENT}/${SUBCOMPONENT}] ${DESCRIPTION}
    
  • Use proper punctuation and capitalization.

  • The subcomponent is optional.

  • Use fix #${ISSUE} to automatically close an issue with the commit.

  • Besides the description line, include more details in the body of the commit message, though make sure that the more important docs live in the code.

Guidelines

  1. Add excessive in-code assertions, and validate your assumptions whenever possible. This is our first line of defense, and has been the champion when it comes to ensuring correctness and catching bugs.

  2. When it comes to error checking:

    • Employ assertions for sanity checks, such as catching logic errors, or situations that are impossible if your code is correct.
    • Employ exceptions for errors that may occur – such as potential typos in the input data.

    Exceptions tend to have error messages, which may be helpful. Assertions tend to simply crash without context. Use exceptions when the presence of an error message may be helpful.

  3. Use our utils packages where appropriate:

  4. Use our paths packages to store (1) the project's internal structure, including subdirectories to other components, and (2) external dependencies:

  5. Document the code extensively.

  6. Use type hints extensively.

  7. Minimize dependence on HTML, and implement behaviours in TypeScript when possible.

  8. Avoid using a generic utils package. It can easily become a catch-all for unrelated logic, grow excessively large, and lose clear purpose. Instead, organize utilities into purpose-specific packages based on functionality.

  9. Some of our projects have a data subdirectory. Pay attention to the following distinction:

    • raw/: Data that is copied from elsewhere. This would, for example, include the Marcion SQL tables copied as is, unmodified. The contents of this directory remain true to the original source.

    • input/: Data that we either modified or created. If we want to fix typos to data that we copied, we don't touch the data under raw/, but we take the liberty to modify the copies that live under input/.

  10. It has been helpful to be able to know, from a quick glance at a TypeScript file:

    1. What the classes used are.
    2. What listeners are registered.
    3. What elements are retrieved from the document.

    Therefore, whenever possible, try to abide by the following:

    1. Group all classes in a CLS enum.
    2. Group event listener registrations to one addEventListeners function (or a use function name that starts with this prefix, so it's easy to find in search).
    3. Also prefer the following syntax:
      element.addEventListeners('click', () => {});
      over this:
      element.onclick = () => {};
    4. Use querySelector or querySelectorAll instead of such methods as getElementsByClassName or getElementsByTagName. The only exception is when retrieving an element by ID, in which case we enforce getElementById.

Languages

  • Our pipelines are primarily written in Python. There is minimal logic in Bash.

  • We have a strong bias for Python over Bash. Use Bash if you expect the number of lines of code of an equivalent Python piece to be significantly more.

  • We use TypeScript for static site logic. It then gets transpiled to JavaScript by running make transpile. We don't write JavaScript directly.

  • We expect to make a similar platform-specific expansion into another territory for the app.

  • In the past, we voluntarily used Java (for an archived project). Won't happen again! We also used VBA and JS for Microsoft Excel and Google Sheet macros (also archived at the moment) because they were required by the platform.

  • It is desirable to strike a balance between the benefits of focusing on a small number of languages, and the different powers that different language can uniquely exhibit. We won't compromise the latter for the former. Use the right language for a task. When two languages can do a job equally well, uncompromisingly choose the one that is more familiar.

  • We collect extensive stats, and we remind you of them using a pre-commit. The primary targets of our statistics are:
    • The size of our code (represented by the number of lines of code). We also collect this stat for each subproject or pipeline step independently.
    • The number of data items we've collected for data collection tasks.
    • We also record the number of commits, and the number of contributors.

Project-specific

This directory contains the data and logic for processing our dictionaries.

Image Collection

Why?

There are many reasons we have decided to add pictures to our dictionary, and heavily invested in the image pipeline. They have become one of the integral pieces of our dictionary framework.

  1. The meaning of a word is much more strongly and concretely conveyed by an image than by a word. Learning is not about knowing vocabulary or grammar. Learning is ultimately about creating the neural pathways that enable language to flow out of you naturally. A given word needs to settle and connect with nodes in your associative memory in order for you to be able to use it. If our goal is to create or strengthen the neural pathways between a Coptic word and related nodes in your brain, then it aids the learning process to achieve as much neural activation as possible during learning. This is much better achieved by an image than by a mere translation, given the way human brains work. After all, the visual processing areas of our brains are bigger, faster, and far more ancient and primordial (even reptiles can see) compared to the language processing areas. You will often find that, when you learn a new word, the associated images pop up in your brain more readily than the translation. Thus the use of images essentially revolutionizes the language learning process.

  2. Oftentimes, the words describe an entity or concept that is unfamiliar to many users. Things like ancient crafts, plant or fish species, farmer's tools, and the like, are unfamiliar. Showing a user the English translation of a word doesn't suffice for the user to understand what it is, and they would often look up images themselves in order to find out what the word actually means. By embedding the pictures in the dictionary, we save users some time so they don't have to look it up themselves.

  3. Translations are often taken lightly by users. Pictures are not. When a dictionary author translates a given Coptic word into different English words, for example, the extra translations are often seen by users as auxiliary - tokens added there to convey a meaning that the dictionary author couldn't convey using fewer words.

    That's not the case for pictures. Pictures are taken seriously by users, and are more readily accepted as bearing a true, authentic, independent meaning of the word. Listing images (especially after we have started ascribing each image to a sense that the word conveys) is a way to recognize and legitimize those different senses and meanings that a word possesses.

    It's for this reason that images must be deeply contemplated, and a word must be digested well, before we add explanatory images for it. Collecting images is tantamount to authoring a dictionary.

Technical Guidelines

Our experience collecting images has taught us a few lessons. We tend to follow the following guidelines when we search for pictures:

  1. Each image ends up being resized to a width of 300 pixel and a height proportional to the original. We prefer images with a minimum width of 300 pixels, though down to 200 is acceptable.

  2. As for image height, short images are rarely ugly, but long images usually are. So we set a generously low lower bound of 100 pixels on the resized height, but set a stricter upper bound of 500 pixels. Although we tend to prefer the height to fall within a range of 200 to 400 pixels.

  3. Collecting sources is mandatory. We always record the URL that an image is retrieved from. Our img_helper script, which we use to process images, can be supplied by a URL, and it will download the image and store the source (and also resize the image to the final version). This simplifies the process.

  4. We make extensive use of icons. They can capture the meaning of a word in situations when it's otherwise hard to describe a word using an image (example).

  5. This hasn't been contemplated, but when given a choice, prefer an ancient Egyptian explanatory image, followed by an old (not necessarily Egyptian) image, followed by a modern image (example). We prefer to keep the images as close as possible to their reflections in the mind of a native speaker. We also want to stress the fact that those Coptic words can be equally used to refer to entities from other cultures, or modern entities.

    This could be revisited later.

Undialected Entries

Some entries have no dialect specified in Crum, so they get treated as belonging to all dialects. More information at #237.

Entries that are Absent in Crum

The following entries are absent from Crum's dictionary. They were added to our database from other sources:

  1. 3380
  2. 3381
  3. 3382
  4. 3385

copticocc_org/ contains a digital scan of Moawad Dawoud's dictionary.

TLA data:

The TLA data, which comprises the core of the dictionary, is retrieved from Comprehensive Coptic Lexicon: Including Loanwords from Ancient Greek v 1.2.

  • 84c104 integrates some changes made by Coptic Scriptorium to CDO's copy of the XML.
  • We may have made some changes afterwards. Use git log or git diff to find them.

Supplemental forms:

Coptic Scriptorium has attempted to grow the TLA by adding supplemental forms. As of the time of writing, CDO is capable of expanding an entry by adding variant forms, but it can't add any new entries that lack a TLA ID.

  1. Bohairic supplemental forms are being directly retrieved from the sheet maintained by Coptic Scriptorium. The data that CDO actually uses is unavailable to us, but it's derived from the sheet.

  2. Sahidic supplemental forms have been snapshotted from the CDO's inflections.tab in October 2025. As of the time of writing, they remain the CDO's dev branch.

Supplemental forms have been, well, problematic! They seem to be poorly maintained by Coptic Scriptorium. As of October 2025, besides the issues above with accessing the latest data or a stable snapshot, their processing code seems to also suffer from at least the following:

  • Parts-of-speech of supplemental forms are completely ignored.
  • Markers of prenominal (-), pronominal (), and qualitative () forms, are omitted.
  • For a given entry, Bohairic supplemental forms are taken on an all-or-nothing basis. No deduplication or merging is performed.

As of the time of writing, Lexicon often doesn't show the same set of supplemental forms that CDO shows. CDO doesn't seem to be under active development at the moment, and the above issues aren't expected to be resolved. We are considering reverting the addition of supplemental forms, and relying only on the TLA data.

Code:

We based our TLA processing logic on the CDO's dictionary_reader.py. Parts of the logic, particularly those pertaining to supplemental forms, are derived from pieces that, as of October 2025, lives in the dev version of the file.

The original code is very badly written and is completely unmaintainable, and it has several (small) bugs. Our code has since significantly diverged from the original, and there is little overlap left.

There are TLA and CDO artifacts that we chose to ignore in our own pipeline, such as Egyptian Etymologies, entity types, and oRef tags.

This directory contains the data and logic for processing the Bible corpus.

There are several published versions of the Coptic Bible. The most recent, and most complete, is that of St. Shenouda the Archmandrite Coptic Society. It is the Coptic Bible project that is most worthy of investment at the moment.

This directory contains the data and logic for processing dictionaries into flashcards and Lexicon. It is named as such because our first use case was a flashcard app, although our use of the dictionaries has since become more versatile.

This directory contains the data and logic for generating the morphological dictionaries (to support inflections).

This directory contains the static data for our website.

Data Collection

We need data collectors. Data collection tasks bear the labor label. The data label is related, but is more generic.


Ⲉ̀ϣⲱⲡ ⲁⲓϣⲁⲛⲉⲣⲡⲉⲱⲃϣ Ⲓⲗ̅ⲏ̅ⲙ̅, ⲉⲓⲉ̀ⲉⲣⲡⲱⲃϣ ⲛ̀ⲧⲁⲟⲩⲓⲛⲁⲙ: Ⲡⲁⲗⲁⲥ ⲉϥⲉ̀ϫⲱⲗϫ ⲉ̀ⲧⲁϣ̀ⲃⲱⲃⲓ ⲉ̀ϣⲱⲡ ⲁⲓϣ̀ⲧⲉⲙⲉⲣⲡⲉⲙⲉⲩⲓ.

About

This is a project that aims to make the Coptic language more learnable.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •