-
Notifications
You must be signed in to change notification settings - Fork 1
Questions generator & validator #446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kiyro7
wants to merge
13
commits into
master
Choose a base branch
from
questions_generator
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
835b5ce
initial commit
kiyro7 39d54cf
first prototype
kiyro7 d813c88
added LLM questions marker
kiyro7 8e42c8e
removed methodology
kiyro7 48ed43c
requirements.txt added versions
kiyro7 1bcf046
simplified docker
kiyro7 8a54af1
heuristic patterns update
kiyro7 d7a57d7
updated questions ranking and added examples
kiyro7 e20a3e0
docker-compose finally done
kiyro7 5694ae7
interactive mode
kiyro7 6ec4877
logging added
kiyro7 0b28da7
logging update
kiyro7 39f7626
docker fix (builds aprox 40 mins)
kiyro7 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| FROM python:3.10-slim | ||
|
|
||
| RUN apt-get update && apt-get install -y --no-install-recommends \ | ||
| git wget gcc g++ \ | ||
| libprotobuf-dev protobuf-compiler \ | ||
| && rm -rf /var/lib/apt/lists/* | ||
|
|
||
| WORKDIR /app | ||
|
|
||
| ENV PIP_DISABLE_PIP_VERSION_CHECK=1 \ | ||
| PIP_DEFAULT_TIMEOUT=120 | ||
|
|
||
| COPY requirements.txt ./ | ||
| RUN pip install --no-cache-dir -r requirements.txt \ | ||
| "huggingface_hub[cli]" | ||
|
|
||
| COPY . . | ||
|
|
||
| COPY --chmod=755 docker/init-volumes.sh /usr/local/bin/init-volumes.sh | ||
|
|
||
| CMD ["bash"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,97 @@ | ||
| ## Запуск (контейнер вечно крутится) | ||
| `docker-compose up` - ВАЖНО: Первый раз ОЧЕНЬ ДОЛГО билдится (30-40 минут)!!! | ||
|
|
||
| ## Использование (интерактивное) | ||
| `docker compose exec app python run.py /app/vkr_examples/VKR1.docx --no-overflow-logs` - папка `vkr_examples` локальная, лежит рядом с композом | ||
|
|
||
| ## Пример сгенерированных вопросов по тексту ВКР | ||
|
|
||
| [✔ OK] Как цель и задачи, сформулированные во введении, отражены в итоговых выводах заключения? | ||
| - relevance: True | ||
| - clarity: True | ||
| - difficulty:False | ||
|
|
||
| [✔ OK] Какие термины и подходы из обзора предметной области легли в основу формальной постановки задачи? | ||
| - relevance: True | ||
| - clarity: True | ||
| - difficulty:False | ||
|
|
||
| [✖ FAIL] В каких требованиях к решению, указанных в постановке задачи, находят отражение цели работы? | ||
| - relevance: False | ||
| - clarity: True | ||
| - difficulty:False | ||
|
|
||
| [✖ FAIL] Какие количественные или качественные свойства решения подтверждены в разделе «Исследования» и как они связаны с задачами введения? | ||
| - relevance: True | ||
| - clarity: False | ||
| - difficulty:False | ||
|
|
||
| [✔ OK] Какие дополнительные материалы из приложений необходимы для проверки воспроизводимости результатов? | ||
| - relevance: True | ||
| - clarity: True | ||
| - difficulty:False | ||
|
|
||
| [✔ OK] Как практическая значимость работы следует из задач и результатов исследования? | ||
| - relevance: True | ||
| - clarity: True | ||
| - difficulty:False | ||
|
|
||
| [✔ OK] Какие ограничения метода решения указаны в тексте и как они влияют на достижение цели? | ||
| - relevance: True | ||
| - clarity: True | ||
| - difficulty:False | ||
|
|
||
| [✖ FAIL] --- rut5-base-multitask вопросы --- | ||
| - relevance: False | ||
| - clarity: False | ||
| - difficulty:False | ||
|
|
||
| [✖ FAIL] Что такое ЛЭТИ? | ||
| - relevance: False | ||
| - clarity: False | ||
| - difficulty:False | ||
|
|
||
| [✖ FAIL] Что является целью работы в веб-приложении? | ||
| - relevance: True | ||
| - clarity: False | ||
| - difficulty:False | ||
|
|
||
| [✖ FAIL] Что было проведено в конце работы? | ||
| - relevance: False | ||
| - clarity: False | ||
| - difficulty:False | ||
|
|
||
| [✔ OK] Что могут изменять объекты, располагаемые на карте? | ||
| - relevance: True | ||
| - clarity: True | ||
| - difficulty:False | ||
|
|
||
| [✔ OK] Что представляет собой создание набора программных средств для отображения объектов на карте? | ||
| - relevance: True | ||
| - clarity: True | ||
| - difficulty:False | ||
|
|
||
| [✖ FAIL] Сформировать требования к набору программных средств? | ||
| - relevance: True | ||
| - clarity: False | ||
| - difficulty:False | ||
|
|
||
| [✖ FAIL] Что является объектом исследования? | ||
| - relevance: True | ||
| - clarity: False | ||
| - difficulty:False | ||
|
|
||
| [✖ FAIL] Что существует уже давно? | ||
| - relevance: True | ||
| - clarity: False | ||
| - difficulty:False | ||
|
|
||
| [✔ OK] Что можно дать в контексте набора программных средств? | ||
| - relevance: True | ||
| - clarity: True | ||
| - difficulty:False | ||
|
|
||
| [✖ FAIL] ГИС является интегрированной информационной системой? | ||
| - relevance: True | ||
| - clarity: False | ||
| - difficulty:False |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| services: | ||
| init: | ||
| build: . | ||
| entrypoint: ["/usr/local/bin/init-volumes.sh"] | ||
| volumes: | ||
| - rut5_model:/app/question_generator/rut5-base | ||
| - nltk_data:/nltk_data | ||
| restart: "no" | ||
|
|
||
| app: | ||
| build: . | ||
| depends_on: | ||
| init: | ||
| condition: service_completed_successfully | ||
| stdin_open: true | ||
| tty: true | ||
| volumes: | ||
| - rut5_model:/app/question_generator/rut5-base | ||
| - nltk_data:/nltk_data | ||
| - ./vkr_examples:/app/vkr_examples | ||
| command: ["bash", "-lc", "sleep infinity"] | ||
|
|
||
| volumes: | ||
| rut5_model: | ||
| nltk_data: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| #!/usr/bin/env bash | ||
| set -e | ||
|
|
||
| MODEL_DIR="/app/question_generator/rut5-base" | ||
|
|
||
| echo "MODEL_DIR=${MODEL_DIR}" | ||
|
|
||
| mkdir -p "$MODEL_DIR" | ||
|
|
||
| if [ -z "$(ls -A "$MODEL_DIR" 2>/dev/null)" ]; then | ||
| echo "Model directory is empty. Downloading model to $MODEL_DIR..." | ||
| huggingface-cli download \ | ||
| cointegrated/rut5-base-multitask \ | ||
| --local-dir "$MODEL_DIR" \ | ||
| --local-dir-use-symlinks False | ||
| echo "Model downloaded." | ||
| else | ||
| echo "Model directory is not empty, skipping download." | ||
| fi |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,203 @@ | ||
| import re | ||
| import logging | ||
| import time | ||
| from contextlib import contextmanager | ||
| from typing import List, Dict | ||
|
|
||
| from nltk.tokenize import sent_tokenize, word_tokenize | ||
| from nltk.corpus import stopwords | ||
| from transformers import AutoTokenizer, AutoModelForSeq2SeqLM | ||
|
|
||
|
|
||
| @contextmanager | ||
| def timed(logger: logging.Logger, operation: str, level: int = logging.INFO, **extra): | ||
| start = time.perf_counter() | ||
| logger.log(level, "START %s %s", operation, (extra if extra else "")) | ||
| try: | ||
| yield | ||
| finally: | ||
| elapsed_ms = (time.perf_counter() - start) * 1000.0 | ||
| logger.log(level, "END %s | %.2f ms %s", operation, elapsed_ms, (extra if extra else "")) | ||
|
|
||
|
|
||
| class VkrQuestionGenerator: | ||
| """Гибридный генератор вопросов по ВКР: NLTK + rut5-base-multitask.""" | ||
|
|
||
| SECTION_PATTERNS: Dict[str, str] = { | ||
| "Введение": r"Введение.*?(?=\n[A-ZА-Я][^\n]*\n)", | ||
| "Обзор предметной области": r"Обзор предметной области.*?(?=\n[A-ZА-Я][^\n]*\n)", | ||
| "Постановка задачи": r"Постановка задачи.*?(?=\n[A-ZА-Я][^\n]*\n)", | ||
| "Метод решения": r"Метод решения.*?(?=\n[A-ZА-Я][^\n]*\n)", | ||
| "Исследования": r"Исследования.*?(?=\n[A-ZА-Я][^\n]*\n)", | ||
| "Заключение": r"Заключение.*?(?=\n[A-ZА-Я][^\n]*\n)", | ||
| "Приложения": r"Приложения.*?(?=\n[A-ZА-Я][^\n]*\n)", | ||
| } | ||
|
|
||
| def __init__(self, vkr_text: str, model_path: str = "ai-forever/rut5-base-multitask"): | ||
| self.logger = logging.getLogger(__name__) | ||
|
|
||
| with timed(self.logger, "generator_init"): | ||
| self.vkr_text = vkr_text | ||
|
|
||
| with timed(self.logger, "sent_tokenize"): | ||
| self.sentences = sent_tokenize(vkr_text) | ||
|
|
||
| with timed(self.logger, "load_stopwords"): | ||
| self.stopwords = set(stopwords.words("russian")) | ||
|
|
||
| # Модель rut5-base-multitask для языкового оформления вопросов | ||
| with timed(self.logger, "load_tokenizer", model_path=model_path): | ||
| self.tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False) | ||
|
|
||
| with timed(self.logger, "load_model", model_path=model_path): | ||
| self.model = AutoModelForSeq2SeqLM.from_pretrained(model_path) | ||
|
|
||
| self.logger.info( | ||
| "Generator ready: sentences=%d stopwords=%d model_path=%s", | ||
| len(self.sentences), len(self.stopwords), model_path | ||
| ) | ||
|
|
||
| def extract_section(self, title: str) -> str: | ||
| """Извлекает раздел по шаблону заголовка.""" | ||
| pattern = self.SECTION_PATTERNS.get(title, rf"{title}.*?(?=\n[A-ZА-Я][^\n]*\n)") | ||
| m = re.search(pattern, self.vkr_text, re.DOTALL | re.IGNORECASE) | ||
| return m.group(0) if m else "" | ||
|
|
||
| def extract_intro(self) -> str: | ||
| return self.extract_section("Введение") | ||
|
|
||
| def extract_conclusion(self) -> str: | ||
| return self.extract_section("Заключение") | ||
|
|
||
| def extract_keywords(self, text: str) -> List[str]: | ||
| """Извлекает ключевые слова из текста.""" | ||
| with timed(self.logger, "extract_keywords", text_len=len(text)): | ||
| tokens = word_tokenize(text.lower()) | ||
| result = [ | ||
| t for t in tokens | ||
| if t.isalnum() and t not in self.stopwords and len(t) > 4 | ||
| ] | ||
| self.logger.info("Keywords extracted: %d", len(result)) | ||
| return result | ||
|
|
||
| def llm_generate_question(self, text_fragment: str) -> str: | ||
| """Генерирует формулировку вопроса через rut5 ask.""" | ||
| prompt = f"ask: {text_fragment}" | ||
| with timed(self.logger, "llm_generate_question", fragment_len=len(text_fragment)): | ||
| enc = self.tokenizer(prompt, return_tensors="pt", truncation=True) | ||
| out = self.model.generate( | ||
| **enc, | ||
| max_length=64, | ||
| num_beams=5, | ||
| early_stopping=True, | ||
| ) | ||
| decoded = self.tokenizer.decode(out[0], skip_special_tokens=True) | ||
| return decoded | ||
|
|
||
| def heuristic_questions(self) -> List[str]: | ||
| """Эвристики, завязанные на структуру ВКР.""" | ||
| with timed(self.logger, "heuristic_questions_total"): | ||
| intro = self.extract_intro() | ||
| overview = self.extract_section("Обзор предметной области") | ||
| objectives = self.extract_section("Постановка задачи") | ||
| method = self.extract_section("Метод решения") | ||
| research = self.extract_section("Исследования") | ||
| conc = self.extract_conclusion() | ||
| apps = self.extract_section("Приложения") | ||
|
|
||
| q: List[str] = [] | ||
|
|
||
| # Введение ↔ Заключение | ||
| if intro and conc: | ||
| q.append( | ||
| "Как цель и задачи, сформулированные во введении, отражены в итоговых выводах заключения?" | ||
| ) | ||
|
|
||
| # Обзор предметной области | ||
| if overview: | ||
| q.append( | ||
| "Какие термины и подходы из обзора предметной области легли в основу формальной постановки задачи?" | ||
| ) | ||
|
|
||
| # Постановка задачи | ||
| if objectives: | ||
| q.append( | ||
| "В каких требованиях к решению, указанных в постановке задачи, находят отражение цели работы?" | ||
| ) | ||
|
|
||
| # Метод решения | ||
| if method: | ||
| q.append( | ||
| "Как архитектура и алгоритмы, описанные в разделе «Метод решения», обеспечивают достижение поставленных требований?" | ||
| ) | ||
|
|
||
| # Исследования | ||
| if research: | ||
| q.append( | ||
| "Какие количественные или качественные свойства решения подтверждены в разделе «Исследования» и как они связаны с задачами введения?" | ||
| ) | ||
|
|
||
| # Приложения | ||
| if apps: | ||
| q.append( | ||
| "Какие дополнительные материалы из приложений необходимы для проверки воспроизводимости результатов?" | ||
| ) | ||
|
|
||
| # Обязательные общие вопросы | ||
| q.extend([ | ||
| "Как практическая значимость работы следует из задач и результатов исследования?", | ||
| "Какие ограничения метода решения указаны в тексте и как они влияют на достижение цели?", | ||
| ]) | ||
|
|
||
| self.logger.info("Heuristic questions created: %d", len(q)) | ||
| return q | ||
|
|
||
| def generate_llm_questions(self, count: int = 5) -> List[str]: | ||
| """Генерирует N вопросов через rut5 по ключевым фрагментам документа.""" | ||
| q: List[str] = [] | ||
| fragments = self.sentences[:40] | ||
| step = max(1, len(fragments) // count) | ||
|
|
||
| self.logger.info("LLM generation setup: count=%d fragments=%d step=%d", count, len(fragments), step) | ||
|
|
||
| with timed(self.logger, "generate_llm_questions_total", count=count): | ||
| for i in range(0, len(fragments), step): | ||
| frag = fragments[i] | ||
| try: | ||
| # Требование: для ИИ — логгировать время каждого вопроса | ||
| with timed(self.logger, "llm_question_item", fragment_index=i): | ||
| llm_q = self.llm_generate_question(frag) | ||
|
|
||
| if len(llm_q) > 10: | ||
| q.append(llm_q) | ||
| self.logger.info("LLM question accepted: idx=%d len=%d", len(q), len(llm_q)) | ||
| else: | ||
| self.logger.info("LLM question rejected (too short): len=%d", len(llm_q)) | ||
|
|
||
| except Exception as e: # noqa: BLE001 | ||
| self.logger.exception("LLM generation failed at fragment_index=%d: %s", i, e) | ||
| continue | ||
|
|
||
| if len(q) >= count: | ||
| break | ||
|
|
||
| self.logger.info("LLM questions created: %d", len(q)) | ||
| return q | ||
|
|
||
| def generate_all(self) -> List[str]: | ||
| """Генерирует полный набор вопросов: эвристики + LLM.""" | ||
| with timed(self.logger, "generate_all_total"): | ||
| result: List[str] = [] | ||
| # Требование: для эвристической генерации можно время создания всех вопросов | ||
| with timed(self.logger, "generate_heuristic_block"): | ||
| result.extend(self.heuristic_questions()) | ||
|
|
||
| result.extend(["--- rut5-base-multitask вопросы ---"]) | ||
|
|
||
| with timed(self.logger, "generate_llm_block"): | ||
| result.extend(self.generate_llm_questions(count=10)) | ||
|
|
||
| deduped = list(dict.fromkeys(result)) | ||
|
|
||
| self.logger.info("generate_all done: raw=%d deduped=%d", len(result), len(deduped)) | ||
| return deduped |
kiyro7 marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| transformers==4.57.3 | ||
| sentencepiece==0.2.1 | ||
| nltk==3.9.2 | ||
| huggingface_hub==0.36.0 | ||
| python-docx==1.2.0 | ||
| torch==2.5.1 |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.