LLM360
/

K2-Chat

@@ -17,31 +17,31 @@ We utilized the following datasets:
 |                         | K2-Chat-060124 | K2-Chat |
 |-------------------------|---------|----------|
 | **Natural Language Benchmarks** |         |          |
-| MMLU (0-shot)           | 63.5    | 69.14    |
-| RACE (0-shot)           | 46.1    | 46.60    |
-| HellaSwag (10-shot)     | 81.7    | 80.80    |
-| PIQA (5-shot)           | 82.3    | 81.34    |
-| ARC-easy (5-shot)       | 84.6    | 79.00    |
-| ARC-challenge (25-shot) | 61.3    | 61.09    |
-| OpenBookQA (5-shot)     | 48.0    | 47.00    |
-| Winogrande (5-shot)     | 79.5    | 78.30    |
-| TruthfulQA (0-shot)     | 44.7    | 57.32    |
-| CrowS-Pairs (0-shot)    | 64.2    | 65.32    |
-| GSM8K (5-shot)          | 60.7    | 77.10    |
-| MathQA (5-shot)         | 44.8    | 43.12    |
-| LogiQA2.0 (0-shot)      | 38.0    | 36.83    |
-| BBH CoT (0-shot)        | 64.9    | 70.37    |
 | **Code Benchmarks**     |         |          |
-| HumanEval (pass@1)      | 47.9    | 71.20    |
 | **Domain Specific (Medical)** |   |          |
-| MedQA (0-shot)          | 53.6    | 52.87    |
-| MedMCQA (5-shot)        | 51.3    | 50.71    |
-| PubMedQA (0-shot)       | 75.0    | 71.20    |
 | **Other**               |         |          |
-| MT-Bench               | 6.87     | 7.55     |
-| JSON-Mode-Eval          | 77.21   | 90.09    |
 | **Overall Average Score**|         |          |
-| Avg Score               | 58.88   | 61.30    |
 # Function Calling

 |                         | K2-Chat-060124 | K2-Chat |
 |-------------------------|---------|----------|
 | **Natural Language Benchmarks** |         |          |
+| MMLU (0-shot)           | 63.5    | **69.14**    |
+| RACE (0-shot)           | 46.1    | **46.60**    |
+| HellaSwag (10-shot)     | **81.7**    | 80.80    |
+| PIQA (5-shot)           | **82.3**    | 81.34    |
+| ARC-easy (5-shot)       | **84.6**    | 79.00    |
+| ARC-challenge (25-shot) | **61.3**    | 61.09    |
+| OpenBookQA (5-shot)     | **48.0**    | 47.00    |
+| Winogrande (5-shot)     | **79.5**    | 78.30    |
+| TruthfulQA (0-shot)     | 44.7    | **57.32**    |
+| CrowS-Pairs (0-shot)    | 64.2    | **65.32**    |
+| GSM8K (5-shot)          | 60.7    | **77.10**    |
+| MathQA (5-shot)         | **44.8**    | 43.12    |
+| LogiQA2.0 (0-shot)      | **38.0**    | 36.83    |
+| BBH CoT (0-shot)        | 64.9    | **70.37**    |
 | **Code Benchmarks**     |         |          |
+| HumanEval (pass@1)      | 47.9    | **71.20**    |
 | **Domain Specific (Medical)** |   |          |
+| MedQA (0-shot)          | **53.6**    | 52.87    |
+| MedMCQA (5-shot)        | **51.3**    | 50.71    |
+| PubMedQA (0-shot)       | **75.0**    | 71.20    |
 | **Other**               |         |          |
+| MT-Bench               | 6.87     | **7.55**     |
+| JSON-Mode-Eval          | 77.21   | **90.09**    |
 | **Overall Average Score**|         |          |
+| Avg Score               | 58.88   | **61.30**    |
 # Function Calling