Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
104 commits
Select commit Hold shift + click to select a range
3b438fd
Add templates for new methods
rvashurin Oct 14, 2024
e8c11cd
PPL + Reweighted
silvimica Oct 14, 2024
aa1f199
PPL upd
silvimica Oct 14, 2024
3ba0f91
PPL upd
silvimica Oct 14, 2024
94b8b18
PPL upd
silvimica Oct 14, 2024
0a437e6
Fix ppl
rvashurin Oct 14, 2024
a6a49ec
Fix yamls
rvashurin Oct 15, 2024
a5d5dc7
Fix small bugs
rvashurin Oct 16, 2024
d8a7278
Remove redundant line
rvashurin Oct 16, 2024
bb25a12
Add distilled sars
rvashurin Oct 25, 2024
0c76449
WiP
rvashurin Oct 25, 2024
d064d90
Add new sar variants
rvashurin Oct 30, 2024
ff39d6e
Fix various errors
rvashurin Oct 31, 2024
af30cb7
Add xsum config, different alignscore versions
rvashurin Oct 31, 2024
3fad16b
Smallfix
rvashurin Nov 8, 2024
b1f0346
Sample entropy
silvimica Nov 9, 2024
73e6a6f
Add entropy-based sentence sar
silvimica Nov 9, 2024
7e210af
Reconcile with remote
rvashurin Nov 11, 2024
d3d6723
New sentences
silvimica Nov 12, 2024
3504d82
Small fix to sample entropy
silvimica Nov 18, 2024
877a3dd
Save some additional stats for samples
rvashurin Nov 19, 2024
8185a35
Merge branch 'sent_sar_variants' of github.com:silvimica/lm-polygraph…
rvashurin Nov 19, 2024
9906efc
Use MTE sar
rvashurin Nov 19, 2024
7c89d11
Add batch iteration in MTE sar
rvashurin Nov 21, 2024
4fcd030
Make xsum work
rvashurin Nov 21, 2024
65c91f0
Prevent generation of newlines only
rvashurin Nov 22, 2024
2007f63
Consistent method set
rvashurin Nov 22, 2024
303d579
Fix lookback procedure
rvashurin Nov 23, 2024
0978d2a
One more stopping criterion fix
rvashurin Nov 23, 2024
c8d06dd
Use stop string criteria from transformers to stop generation early
rvashurin Nov 25, 2024
49fbbc2
Rollback
rvashurin Nov 25, 2024
6f41f7e
Top K entropy
rvashurin Nov 25, 2024
bc45583
Rename stuff
rvashurin Nov 25, 2024
cfeff82
Use renamed methods everywhere
rvashurin Nov 25, 2024
14dfb28
Fix ccp, add consistency between sampling and greedy, and ccpgsu
rvashurin Nov 27, 2024
b37936f
Fix sample entropy
rvashurin Nov 28, 2024
22a3a70
Expand configs
rvashurin Nov 28, 2024
2295028
Use only PRR
rvashurin Nov 28, 2024
ef8258b
Simplify and unify GSU
rvashurin Dec 4, 2024
a79620e
Add missing stats to yamls
rvashurin Dec 4, 2024
8ecc1d5
Add tqdm to ce similarity
rvashurin Dec 4, 2024
6f32205
Add model config for llama
rvashurin Dec 4, 2024
457f94b
Add dtype to load args
rvashurin Dec 5, 2024
1e1a3c8
Remove redundant methods
rvashurin Dec 5, 2024
41a0849
Add possibility of continuing estimation from saved manager, sampled …
rvashurin Dec 10, 2024
f570400
Add quality metrics based off first sample generated
rvashurin Dec 11, 2024
31b38e2
MaxSampledMaximumSequenceProbability
silvimica Dec 11, 2024
fca871d
Add average sample ue baseline, base manager params to all sentsar co…
rvashurin Dec 11, 2024
b9646f0
add MaxSampledPerplexity
silvimica Dec 11, 2024
6faa776
Use common scorers for alignscores and comets
rvashurin Dec 11, 2024
70f13a7
Merge branch 'sent_sar_variants' of github.com:silvimica/lm-polygraph…
rvashurin Dec 11, 2024
52166f5
Do not recalculate dependencies
rvashurin Dec 11, 2024
58c7f6d
Consider sampling-based evaluation in gen metric wrappers
rvashurin Dec 11, 2024
518de01
Correctly handle the case when last batch is not whole
rvashurin Dec 13, 2024
c579591
Use common batch size for full batches
rvashurin Dec 13, 2024
b67933c
Fix MTESAR
rvashurin Dec 13, 2024
19535ee
Import average baselines, correct batch initiation
rvashurin Dec 13, 2024
1b67bbd
Only check input stats for consistency
rvashurin Dec 13, 2024
f034cbd
Lighten the prr calculation, use logs as base for GSU and other new m…
rvashurin Dec 17, 2024
04b3d52
Save first sample texts separately
rvashurin Dec 17, 2024
3f98b59
Add degmat based on CE, log/exp differentiation for semantic methods …
rvashurin Dec 25, 2024
93aecdf
Add sample-based gen metrics from best samples
rvashurin Dec 25, 2024
6577279
Add sample-based gen metrics from best samples
rvashurin Dec 25, 2024
bc991b0
Save new stats in manager
rvashurin Dec 25, 2024
744b108
Small fixes
rvashurin Dec 25, 2024
65f9513
Add Comet against best
rvashurin Dec 26, 2024
2d8fa7e
Fix
rvashurin Dec 26, 2024
c5509ea
Avesimilarity
silvimica Dec 29, 2024
fb601db
UE metric enriched with average dissimilarity
silvimica Jan 2, 2025
ab5f055
Set sample selection strategy for sample-focused methods, add greedy-…
rvashurin Jan 9, 2025
f973da8
Fix class names
rvashurin Jan 9, 2025
5e7df32
Fix naming
rvashurin Jan 10, 2025
7cf569e
Add missing stats to experimental configs
rvashurin Jan 10, 2025
faf3e14
Make experiments work
rvashurin Jan 10, 2025
3074b6a
Add falcon model
rvashurin Jan 10, 2025
edf9ee1
Prevent tokenizer outputting token type ids
rvashurin Jan 10, 2025
e07c62a
Save manager state between evaluation steps
rvashurin Jan 13, 2025
c3c63ad
Fix saving
rvashurin Jan 13, 2025
6f9fee1
Fix stat name issue
rvashurin Jan 14, 2025
3e7b953
Fix tokensars, add greedy-based method to save stat
rvashurin Jan 15, 2025
0bba562
Add rouge sim matrix calculator, some fixes
rvashurin Jan 20, 2025
5701a47
Add align matrix
rvashurin Jan 21, 2025
7ab3691
add tqdm
rvashurin Jan 21, 2025
90eacd1
Fix issue with empty samples
rvashurin Jan 21, 2025
823ba26
Add ablation-related methods
rvashurin Jan 29, 2025
5906266
Final fixes before submit
rvashurin Jan 31, 2025
0ab5eaf
Uncommited changes from cluster
rvashurin Feb 18, 2025
d03d080
Fixed x metric for samples
silvimica Mar 27, 2025
bc52d3e
Merge pull request #6 from silvimica/cocoa_x_metric_fixed
rvashurin Mar 27, 2025
fc90a1e
Add semantic density
rvashurin Mar 27, 2025
3b1bdcb
Remove breakpoint
rvashurin Mar 27, 2025
2510325
Fix some typos
rvashurin Mar 27, 2025
a7bc19c
Gpt as a judge + Fixes to X Metric 24
silvimica Mar 29, 2025
75ad725
Polygraph eval code + remove redundant funcion form x metric
silvimica Mar 29, 2025
fb8082b
Merge branch 'sent_sar_variants' into cocoa_x_metric_fixed
silvimica Mar 29, 2025
aa577fc
Add multiref support without aggregation, some other tweaks
rvashurin Mar 29, 2025
66fc737
Show metricx progress
rvashurin Mar 29, 2025
dae4f26
Fix tqdm
rvashurin Mar 29, 2025
dd6bac4
Fix loading manager with torch 2.6+
rvashurin Mar 30, 2025
af8ce4c
Fix greedy semantic dens
rvashurin Mar 30, 2025
ff96ac9
Turn semantic density around
rvashurin Mar 30, 2025
c652a51
Fix gpt naming
rvashurin Mar 30, 2025
ff1ac61
Merge pull request #7 from silvimica/cocoa_x_metric_fixed
rvashurin Mar 30, 2025
886c881
Merge branch 'sent_sar_variants' of github.com:silvimica/lm-polygraph…
rvashurin Mar 30, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions examples/configs/model/default_causal.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch


def load_model(model_path: str, device_map: str):
def load_model(model_path: str, device_map: str, dtype: str = "float32"):
dtype = getattr(torch, dtype)
model = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True, device_map=device_map
model_path, trust_remote_code=True, device_map=device_map, torch_dtype=dtype
)
model.eval()

Expand Down
11 changes: 11 additions & 0 deletions examples/configs/model/falcon3.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
defaults:
- default

path: tiiuae/Falcon3-7B-Base
type: CausalLM
path_to_load_script: model/default_causal.py

load_model_args:
device_map: balanced_low_0
dtype: bfloat16
load_tokenizer_args: {}
11 changes: 11 additions & 0 deletions examples/configs/model/llama.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
defaults:
- default

path: meta-llama/Meta-Llama-3.1-8B
type: CausalLM
path_to_load_script: model/default_causal.py

load_model_args:
device_map: balanced_low_0
dtype: bfloat16
load_tokenizer_args: {}
162 changes: 162 additions & 0 deletions examples/configs/polygraph_eval_coqa_sentsar.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
hydra:
run:
dir: ${cache_path}/coqa/${model.path}/${dataset}/${now:%Y-%m-%d}/${now:%H-%M-%S}

defaults:
- model: bloomz-560m
- _self_

cache_path: ./workdir/output
save_path: '${hydra:run.dir}'

task: qa

base_manager: null
overwrite_base_estimations: false

dataset: coqa
text_column: questions
label_column: answers
description: "The following are stories and questions about them. Each story is followed by a question and answer to a given question.\n\nStory: {story}"
prompt: "Question: {question}\nAnswer:{answer}"
train_split: train
eval_split: validation
max_new_tokens: 20
load_from_disk: false
normalize: true
generation_params:
generate_until:
- "\n"
save_stats:
- greedy_tokens
- greedy_log_likelihoods
- greedy_tokens_alternatives
- greedy_sentence_similarity
- token_similarity
- entropy
- sample_tokens
- sample_tokens_alternatives
- sample_texts
- sample_log_probs
- sample_log_likelihoods
- sample_sentence_similarity
- sample_token_similarity
- sample_entropy
- first_sample_texts
- best_sample_texts
- best_sample_text_ids
- best_normalized_sample_texts
- best_normalized_sample_text_ids
entropy_top_k: 50

train_dataset: null
train_test_split: false
test_split_size: 1

background_train_dataset: allenai/c4
background_train_dataset_text_column: text
background_train_dataset_label_column: url
background_train_dataset_data_files: en/c4-train.00000-of-01024.json.gz
background_load_from_disk: false

subsample_background_train_dataset: 1000
subsample_train_dataset: 1000
subsample_eval_dataset: -1

use_density_based_ue: false
use_seq_ue: false
use_tok_ue: false
use_ens_ue: false
generation_metrics: null

additional_estimators:
- module: lm_polygraph.estimators.monte_carlo_sequence_entropy
class_name: MonteCarloSequenceEntropy
kwargs: {}
- module: lm_polygraph.estimators.monte_carlo_normalized_sequence_entropy
class_name: MonteCarloNormalizedSequenceEntropy
kwargs: {}
- module: lm_polygraph.estimators.semantic_entropy
class_name: SemanticEntropy
kwargs: {}

- module: lm_polygraph.estimators.max_probability
class_name: MaximumSequenceProbability
kwargs: {}
- module: lm_polygraph.estimators.max_probability
class_name: SampledMaximumSequenceProbability
kwargs: {}
- module: lm_polygraph.estimators.sentence_sar
class_name: SentenceSAR
kwargs: {}
- module: lm_polygraph.estimators.gsu
class_name: MaxprobGSU
kwargs: {}

- module: lm_polygraph.estimators.token_sar
class_name: TokenSAR
kwargs: {}
- module: lm_polygraph.estimators.token_sar
class_name: SampledTokenSAR
kwargs: {}
- module: lm_polygraph.estimators.sar
class_name: SAR
kwargs: {}
- module: lm_polygraph.estimators.gsu
class_name: TokenSARGSU
kwargs: {}

- module: lm_polygraph.estimators.perplexity
class_name: Perplexity
kwargs: {}
- module: lm_polygraph.estimators.perplexity
class_name: SampledPerplexity
kwargs: {}
- module: lm_polygraph.estimators.sentence_sar
class_name: PPLSAR
kwargs: {}
- module: lm_polygraph.estimators.gsu
class_name: PPLGSU
kwargs: {}

- module: lm_polygraph.estimators.token_entropy
class_name: MeanTokenEntropy
kwargs: {}
- module: lm_polygraph.estimators.token_entropy
class_name: SampledMeanTokenEntropy
kwargs: {}
- module: lm_polygraph.estimators.sentence_sar
class_name: MTESAR
kwargs: {}
- module: lm_polygraph.estimators.gsu
class_name: MTEGSU
kwargs: {}

- module: lm_polygraph.estimators.average_ue
class_name: AveMaxprob
kwargs: {}
- module: lm_polygraph.estimators.average_ue
class_name: AvePPL
kwargs: {}
- module: lm_polygraph.estimators.average_ue
class_name: AveTokenSAR
kwargs: {}
- module: lm_polygraph.estimators.average_ue
class_name: AveMTE
kwargs: {}

- module: lm_polygraph.estimators.semantic_average_ue_average_similarity
class_name: SemanticAveMaxprobAveSimilarity
kwargs: {}

- module: lm_polygraph.estimators.greedy_semantic_average_ue_average_similarity
class_name: GreedySemanticAveMaxprobAveSimilarity
kwargs: {}

ignore_exceptions: false

batch_size: 1
deberta_batch_size: 1

seed:
- 1
166 changes: 166 additions & 0 deletions examples/configs/polygraph_eval_gsm8k_sentsar_cot.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
hydra:
run:
dir: ${cache_path}/gsm8k_cot/${model}/${dataset}/${now:%Y-%m-%d}/${now:%H-%M-%S}

defaults:
- model: bloomz-560m
- _self_

cache_path: ./workdir/output
save_path: '${hydra:run.dir}'

task: qa

base_manager: null
overwrite_base_estimations: false

dataset: [gsm8k, main]
text_column: question
label_column: answer
prompt: "Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nA: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. The answer is 6.\n\nQ: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nA: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The answer is 5.\n\nQ: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nA: Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. The answer is 39.\n\nQ: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\nA: Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 - 12 = 8. The answer is 8.\n\nQ: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?\nA: Shawn started with 5 toys. If he got 2 toys each from his mom and dad, then that is 4 more toys. 5 + 4 = 9. The answer is 9.\n\nQ: There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?\nA: There were originally 9 computers. For each of 4 days, 5 more computers were added. So 5 * 4 = 20 computers were added. 9 + 20 is 29. The answer is 29.\n\nQ: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?\nA: Michael started with 58 golf balls. After losing 23 on tuesday, he had 58 - 23 = 35. After losing 2 more, he had 35 - 2 = 33 golf balls. The answer is 33.\n\nQ: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?\nA: Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. The answer is 8.\n\nQ: {question}\nA:"
train_split: train
few_shot_split: train
eval_split: test
max_new_tokens: 256
load_from_disk: false
n_shot: 0
normalize: true
generation_params:
generate_until:
- "\n"
save_stats:
- greedy_tokens
- greedy_log_likelihoods
- greedy_tokens_alternatives
- greedy_sentence_similarity
- token_similarity
- entropy
- sample_tokens
- sample_tokens_alternatives
- sample_texts
- sample_log_probs
- sample_log_likelihoods
- sample_sentence_similarity
- sample_token_similarity
- sample_entropy
- first_sample_texts
- best_sample_texts
- best_sample_text_ids
- best_normalized_sample_texts
- best_normalized_sample_text_ids
entropy_top_k: 50

target_ignore_regex: "(?s).*#### "
output_ignore_regex: "(?s).*The answer is "

train_dataset: null
train_test_split: false
test_split_size: 1

background_train_dataset: allenai/c4
background_train_dataset_text_column: text
background_train_dataset_label_column: url
background_train_dataset_data_files: en/c4-train.00000-of-01024.json.gz
background_load_from_disk: false

subsample_background_train_dataset: 1000
subsample_train_dataset: 1000
subsample_eval_dataset: -1

use_density_based_ue: false
use_seq_ue: false
use_tok_ue: false
use_ens_ue: false
generation_metrics: null

additional_estimators:
- module: lm_polygraph.estimators.monte_carlo_sequence_entropy
class_name: MonteCarloSequenceEntropy
kwargs: {}
- module: lm_polygraph.estimators.monte_carlo_normalized_sequence_entropy
class_name: MonteCarloNormalizedSequenceEntropy
kwargs: {}
- module: lm_polygraph.estimators.semantic_entropy
class_name: SemanticEntropy
kwargs: {}

- module: lm_polygraph.estimators.max_probability
class_name: MaximumSequenceProbability
kwargs: {}
- module: lm_polygraph.estimators.max_probability
class_name: SampledMaximumSequenceProbability
kwargs: {}
- module: lm_polygraph.estimators.sentence_sar
class_name: SentenceSAR
kwargs: {}
- module: lm_polygraph.estimators.gsu
class_name: MaxprobGSU
kwargs: {}

- module: lm_polygraph.estimators.token_sar
class_name: TokenSAR
kwargs: {}
- module: lm_polygraph.estimators.token_sar
class_name: SampledTokenSAR
kwargs: {}
- module: lm_polygraph.estimators.sar
class_name: SAR
kwargs: {}
- module: lm_polygraph.estimators.gsu
class_name: TokenSARGSU
kwargs: {}

- module: lm_polygraph.estimators.perplexity
class_name: Perplexity
kwargs: {}
- module: lm_polygraph.estimators.perplexity
class_name: SampledPerplexity
kwargs: {}
- module: lm_polygraph.estimators.sentence_sar
class_name: PPLSAR
kwargs: {}
- module: lm_polygraph.estimators.gsu
class_name: PPLGSU
kwargs: {}

- module: lm_polygraph.estimators.token_entropy
class_name: MeanTokenEntropy
kwargs: {}
- module: lm_polygraph.estimators.token_entropy
class_name: SampledMeanTokenEntropy
kwargs: {}
- module: lm_polygraph.estimators.sentence_sar
class_name: MTESAR
kwargs: {}
- module: lm_polygraph.estimators.gsu
class_name: MTEGSU
kwargs: {}

- module: lm_polygraph.estimators.average_ue
class_name: AveMaxprob
kwargs: {}
- module: lm_polygraph.estimators.average_ue
class_name: AvePPL
kwargs: {}
- module: lm_polygraph.estimators.average_ue
class_name: AveTokenSAR
kwargs: {}
- module: lm_polygraph.estimators.average_ue
class_name: AveMTE
kwargs: {}

- module: lm_polygraph.estimators.semantic_average_ue_average_similarity
class_name: SemanticAveMaxprobAveSimilarity
kwargs: {}

- module: lm_polygraph.estimators.greedy_semantic_average_ue_average_similarity
class_name: GreedySemanticAveMaxprobAveSimilarity
kwargs: {}

ignore_exceptions: false

batch_size: 1
deberta_batch_size: 1

seed:
- 1
Loading