Added confidence interval calculation
#6
by
giannor - opened
README.md
CHANGED
|
@@ -43,7 +43,9 @@ and two models from the Pleias family ([Pleias-350M-Preview](https://huggingface
|
|
| 43 |
All comparison models were trained exclusively on open data, either in the public domain or under a permissive license.
|
| 44 |
|
| 45 |
The following tables show the performance on each dataset.
|
| 46 |
-
For each, we report the respective main metric from EuroEval and the confidence interval.
|
|
|
|
|
|
|
| 47 |
|
| 48 |
| Model | scala-da (MCC)| dala (MCC) | angry-tweets (MCC) | dansk (Micro F1, No Misc) | danske-talemaader (MCC) | danish-citizen-tests (MCC) | multi-wiki-qa-da (F1) | hellaswag-da (MCC) | nordjylland-news (BERTScore) | average |
|
| 49 |
| ----------------------------------- | ------------- | ------------- | ------------------ | ------------------------- | ----------------------- | -------------------------- | --------------------- | ------------------ | ---------------------------- | ------- |
|
|
|
|
| 43 |
All comparison models were trained exclusively on open data, either in the public domain or under a permissive license.
|
| 44 |
|
| 45 |
The following tables show the performance on each dataset.
|
| 46 |
+
For each, we report the respective main metric from EuroEval and the confidence interval.
|
| 47 |
+
The latter is calculated as the mean of the metric scores across all evaluation runs ± 1.96 times the standard error of the mean:
|
| 48 |
+
$$\hat{\mu} \pm 1.96 \times SEM \quad \textrm{where} \quad SEM = \frac{s}{\sqrt{n}} \quad \textrm{and} \quad s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \hat{\mu})^2}{n-1}} \quad \textrm{and} \quad \hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} x_i$$
|
| 49 |
|
| 50 |
| Model | scala-da (MCC)| dala (MCC) | angry-tweets (MCC) | dansk (Micro F1, No Misc) | danske-talemaader (MCC) | danish-citizen-tests (MCC) | multi-wiki-qa-da (F1) | hellaswag-da (MCC) | nordjylland-news (BERTScore) | average |
|
| 51 |
| ----------------------------------- | ------------- | ------------- | ------------------ | ------------------------- | ----------------------- | -------------------------- | --------------------- | ------------------ | ---------------------------- | ------- |
|