Spaces:

Jellyfish042
/

UncheatableEval

Running

App Files Files Community

Jellyfish042 commited on 15 days ago

Commit

14e0ea5

1 Parent(s): eb54354

brand new version

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

__pycache__/data_manager.cpython-311.pyc +0 -0
__pycache__/title.cpython-311.pyc +0 -0
about.md +0 -10
app.py +196 -258
data/2024-10/7b.xlsx +0 -0
data/2024-10/xb.xlsx +0 -0
data/2025-12/2025-12-21_11-34-39.json +24 -0
data/2025-12/2025-12-21_11-35-15.json +24 -0
data/2025-12/2025-12-21_11-36-04.json +24 -0
data/2025-12/2025-12-21_11-36-44.json +24 -0
data/2025-12/2025-12-21_11-37-00.json +24 -0
data/2025-12/2025-12-21_11-37-31.json +24 -0
data/2025-12/2025-12-21_11-37-59.json +24 -0
data/2025-12/2025-12-21_11-38-27.json +24 -0
data/2025-12/2025-12-21_11-38-57.json +24 -0
data/2025-12/2025-12-21_11-39-11.json +24 -0
data/2025-12/2025-12-21_11-39-42.json +26 -0
data/2025-12/2025-12-21_11-40-01.json +26 -0
data/2025-12/2025-12-21_11-40-26.json +26 -0
data/2025-12/2025-12-21_11-40-48.json +26 -0
data/2025-12/2025-12-21_11-41-02.json +26 -0
data/2025-12/2025-12-21_11-41-20.json +26 -0
data/2025-12/2025-12-21_11-41-38.json +26 -0
data/2025-12/2025-12-21_11-41-55.json +26 -0
data/2025-12/2025-12-21_11-42-12.json +26 -0
data/2025-12/2025-12-21_11-42-26.json +26 -0
data/2025-12/2025-12-21_11-42-49.json +26 -0
data/2025-12/2025-12-21_11-43-05.json +26 -0
data/2025-12/2025-12-21_11-43-28.json +26 -0
data/2025-12/2025-12-21_11-43-47.json +26 -0
data/2025-12/2025-12-21_11-43-58.json +26 -0
data/2025-12/2025-12-21_11-44-14.json +26 -0
data/2025-12/2025-12-21_11-44-30.json +26 -0
data/2025-12/2025-12-21_11-44-45.json +26 -0
data/2025-12/2025-12-21_11-45-01.json +26 -0
data/2025-12/2025-12-21_11-45-11.json +26 -0
data/2025-12/2025-12-21_11-45-38.json +26 -0
data/2025-12/2025-12-21_11-45-55.json +26 -0
data/2025-12/2025-12-21_11-46-17.json +26 -0
data/2025-12/2025-12-21_11-46-35.json +26 -0
data/2025-12/2025-12-21_11-46-50.json +26 -0
data/2025-12/2025-12-21_11-47-06.json +26 -0
data/2025-12/2025-12-21_11-47-21.json +26 -0
data/2025-12/2025-12-21_11-47-36.json +26 -0
data/2025-12/2025-12-21_11-47-52.json +26 -0
data/2025-12/2025-12-21_11-48-04.json +26 -0
data/2025-12/2025-12-21_11-48-25.json +26 -0
data/2025-12/2025-12-21_11-48-37.json +26 -0
data/2025-12/2025-12-21_11-48-52.json +26 -0
data/2025-12/2025-12-21_11-49-05.json +26 -0

__pycache__/data_manager.cpython-311.pyc ADDED Viewed

Binary file (14.8 kB). View file

__pycache__/title.cpython-311.pyc ADDED Viewed

Binary file (873 Bytes). View file

about.md CHANGED Viewed

@@ -24,13 +24,3 @@ Therefore, the compression rate of a model can be directly calculated through th
 ### Can Models Using Different Tokenizers Be Directly Compared?
 Yes. When calculating the sum of negative log probabilities, we essentially treat the model + tokenizer as a single entity or system. As long as this system has a high probability of generating real text, we consider it better. From the perspective of compression, you can choose any tokenizer. From the compression rate perspective, we don't care; we only care about whether your system can compress the text more effectively.
-### Is It Really Uncheatable? Can't I train my model on a large number of arXiv papers to improve its test performance on arXiv papers?
-Uncheatable Eval's data sources currently include new arXiv papers, new GitHub projects, BBC news, AO3 fanfictions, and new Wikipedia entries, with more sources to be added in the future. If you genuinely achieve excellent results across these data by training extensively on these sources, I would consider you to have developed a genuinely good language model rather than cheating.
-From my test results, accurately modeling these data is very challenging. I believe Uncheatable Eval more accurately reflects the value of every bit of data and computing you invest compared to other benchmarks. Models trained with more data and computing are almost always better, and there are no shortcuts. This is a key strength of Uncheatable Eval.
-### Is This Too "Random"? Why Consider Random Texts from the Internet as Ground Truth?
-This is why we choose rigorous and verified texts such as arXiv papers and news reports, which typically have better quality. Additionally, a round of Uncheatable Eval evaluates a model over millions of tokens, increasing the reliability of the results.
-In fact, the model rankings obtained through Uncheatable Eval are very stable. For instance, the model ranked first in January's data is highly likely to remain first in February, March, April, May, and June, indicating that the data obtained through this method is sufficiently representative.


24
25	### Can Models Using Different Tokenizers Be Directly Compared?
26	Yes. When calculating the sum of negative log probabilities, we essentially treat the model + tokenizer as a single entity or system. As long as this system has a high probability of generating real text, we consider it better. From the perspective of compression, you can choose any tokenizer. From the compression rate perspective, we don't care; we only care about whether your system can compress the text more effectively.

app.py CHANGED Viewed

@@ -1,36 +1,22 @@
 import pandas as pd
 import gradio as gr
 import os
-import re
 import requests
 from dotenv import load_dotenv
 from matplotlib.colors import LinearSegmentedColormap
-import plotly.express as px
 import plotly.graph_objects as go
-# from sklearn.linear_model import LinearRegression
 import numpy as np
 from huggingface_hub import HfApi
 from huggingface_hub.hf_api import HTTPError
 from huggingface_hub.utils import GatedRepoError
 from gradio_rangeslider import RangeSlider
 import datetime
-from gradio.themes.utils.colors import slate
 load_dotenv()
 webhook_url = os.environ.get("WEBHOOK_URL")
-file_name_list = [
-    "14b",
-    "9b",
-    "7b",
-    "3b",
-    "1b5",
-    "other",
-]
-sheet_name_list = [
-    "cr",
-    "bpc",
-    "bpb",
-]
 metric_list = [
     "Compression Rate (%)",
     "Bits Per Character (BPC)",
@@ -58,92 +44,54 @@ model_size_to_file_name = {
     "Other": "other",
 }
 def read_about_md():
-    with open('about.md', 'r', encoding='utf-8') as f:
         return f.read()
-def rename_columns(df):
-    df.columns = [col.rsplit("_", maxsplit=1)[0] for col in df.columns]
-    return df
-def get_folders_matching_format(directory):
-    pattern = re.compile(r"^\d{4}-\d{2}$")
-    folders = []
-    if not os.path.exists(directory):
-        return folders
-    for item in os.listdir(directory):
-        full_path = os.path.join(directory, item)
-        if os.path.isdir(full_path) and pattern.match(item):
-            folders.append(full_path)
-    return folders
-def get_unique_column_names(data=None):
-    return [
-        "ao3_\u200benglish",
-        "bbc_\u200bnews",
-        "wikipedia_\u200benglish",
-        "arxiv_\u200bcomputer_\u200bscience",
-        "arxiv_\u200bphysics",
-        "github_\u200bcpp",
-        "github_\u200bpython",
-    ]
-def color_cell(value):
-    return "background-color: #fffdd0" if pd.notna(value) else "default"
-# def color_cell_themed(value):
-#     return "background-color: rgba(255, 253, 208, 1.0)" if pd.notna(value) else "default"
-# --- 核心改动点 1: 修改 update_table 函数 ---
-# 添加 request: gr.Request = None 参数来接收主题模式信息
-# 默认值为 None 是为了处理初始加载
-def update_table(period: str, models_size: list, metric: str, visible_columns: list, color_columns: list, size_range: list, midpoint: float = 0.5, sort_by: str = "Average (lower=better)", ascending: bool = True, request: gr.Request = None):
-    # 打印日志并检查当前模式
     is_dark_mode = request.is_dark if request else False
-    print(f"Updating - time: {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}, period: {period}, models: {models_size}, metric: {metric}, visible_columns: {visible_columns}, color_columns: {color_columns}, size_range: {size_range}, sort_by: {sort_by}, ascending: {ascending}, is_dark: {is_dark_mode}\n")
-    if not models_size:
-        return "No data available for the selected models and period."
-    target_period_data = all_data[period]
     target_file_name = [model_size_to_file_name[model] for model in models_size]
-    sheet_name = metric_to_sheet[metric]
-    combined_data = pd.concat([df.dropna(axis=1, how="all") for df in [target_period_data[file_name][sheet_name] for file_name in target_file_name]], axis=0)
-    if len(combined_data) == 0:
-        return "No data available for the selected models and period."
-    combined_data = combined_data[combined_data["Parameters Count (B)"].between(size_range[0], size_range[1])]
-    combined_data.reset_index(drop=True, inplace=True)
-    if len(combined_data) == 0:
         return "No data available for the selected models and period."
-    combined_data["Name"] = combined_data["Name"].apply(lambda x: x.replace(".pth", ""))
-    ordered_columns = get_unique_column_names()
-    relevant_columns = [col for col in ordered_columns if col in visible_columns and col not in ["Name", "Parameters Count (B)", "Average (The lower the better)"]]
-    if len(combined_data) > 0 and relevant_columns:
-        combined_data["Average (The lower the better)"] = round(combined_data[relevant_columns].mean(axis=1), 3)
-    combined_data = combined_data.rename(columns={"Parameters Count (B)": "Params (B)", "Average (The lower the better)": "Average (lower=better)"})
-    sorted_data = combined_data.sort_values(by=sort_by, ascending=ascending)
-    visible_columns_final = ["Name", "Params (B)", "Average (lower=better)"] + relevant_columns
-    filtered_data = sorted_data[visible_columns_final]
-    filtered_data.columns = [col.replace("_", " ") for col in filtered_data.columns]
-    formatter = {col: "{:.3f}" for col in filtered_data.columns if filtered_data[col].dtype in ["float64", "float32"]}
-    # --- 核心改动点 2: 根据主题模式选择不同的配色方案 ---
-    if is_dark_mode:
-        # 夜间模式配色 (绿 -> 深灰 -> 红)
-        colors = ["#2ca02c", "#2b2b2b", "#d62728"]
-    else:
-        # 日间模式配色 (绿 -> 白 -> 红)
-        colors = ["#63be7b", "#ffffff", "#f8696b"]
     vmin, vmax, vmid = {}, {}, {}
     for column in filtered_data.columns:
-        if column in ["Name", "Params (B)"]: continue
         col_values = filtered_data[column].dropna()
         if len(col_values) > 1:
             sorted_values = np.sort(col_values)
@@ -152,93 +100,84 @@ def update_table(period: str, models_size: list, metric: str, visible_columns: l
             idx = int(len(sorted_values) * midpoint)
             vmid[column] = sorted_values[idx]
-    # --- 核心改动点 3: 修改样式函数以包含固定的黑色字体 ---
     def custom_background_gradient(series, cmap, vmin_val, vmax_val, vmid_val):
-        if len(series) == 0: return series
         def normalize(x):
-            if pd.isna(x): return 0.5 # Neutral for NaN
-            if vmid_val == vmin_val and x <= vmid_val: return 0.0
-            if vmid_val == vmax_val and x >= vmid_val: return 1.0
-            if vmid_val == vmin_val or vmid_val == vmax_val: return 0.5
             if x <= vmid_val:
                 return 0.5 * (x - vmin_val) / (vmid_val - vmin_val)
             else:
                 return 0.5 + 0.5 * (x - vmid_val) / (vmax_val - vmid_val)
         normed = series.apply(normalize)
         cmap_colors = [cmap(x) for x in normed]
-        # 在返回的CSS中同时设置 background-color 和 color
-        return [
-            "background-color: rgba({}, {}, {}, {}); color: black;".format(*[int(255 * c) for c in color[:3]], color[3])
-            for color in cmap_colors
-        ]
     target_color_columns = []
-    if "Average" in color_columns: target_color_columns.append("Average (lower=better)")
-    if "Individual Tests" in color_columns: target_color_columns.extend([col for col in filtered_data.columns if col not in ["Name", "Params (B)", "Average (lower=better)"]])
     def color_params_column_dynamic(value):
         if not pd.notna(value):
             return "default"
-        # 2. 根据 is_dark_mode 返回不同的颜色
         if is_dark_mode:
-            # 为夜间模式选择一个柔和、不刺眼的暗金色
-            # 字体颜色也设置为浅色以保证对比度
             return "background-color: #4b4936; color: #f0f0f0;"
         else:
-            # 为日间模式使用明亮的奶油色，字体为黑色
             return "background-color: #fffdd0; color: black;"
-    styler = filtered_data.style.format(formatter).map(color_params_column_dynamic, subset=["Params (B)"])
     for column in target_color_columns:
         if column in vmin:
             custom_cmap = LinearSegmentedColormap.from_list("custom_cmap", colors)
-            styler = styler.apply(custom_background_gradient, cmap=custom_cmap, vmin_val=vmin[column], vmax_val=vmax[column], vmid_val=vmid[column], subset=[column])
     styler = styler.hide(axis="index")
-    widths = [300, 150, 150, 100, 100, 100, 100, 100, 100, 100, 100]
     table_styles = []
-    table_styles.append({"selector": "th", "props": [("background-color", "var(--background-fill-secondary)"), ("color", "var(--body-text-color)"), ("padding", "8px"), ("font-weight", "bold")]})
     table_styles.append({"selector": "table", "props": [("border-collapse", "collapse"), ("border", f"1px solid var(--border-color-primary)")]})
     for i, w in enumerate(widths):
-        table_styles.append({"selector": f"th.col{i}, td.col{i}", "props": [("min-width", f"{w}px"), ("max-width", f"{w}px"), ("text-align", "center"), ("border", f"1px solid var(--border-color-primary)")]})
     styler = styler.set_table_styles(table_styles)
     return styler.to_html()
-def create_world_languages_gdp_chart():
-    languages = ["English", "Chinese", "Spanish", "Japanese", "German", "French", "Arabic", "Italian", "Portuguese", "Korean", "Other"]
-    shares = [27, 18, 8, 6, 5, 4, 3, 2, 2, 2, 23]
-    colors = ["#FF7F7F", "#FFA07A", "#FFDB58", "#90EE90", "#98FB98", "#87CEFA", "#B0C4DE", "#DDA0DD", "#D8BFD8", "#F0E68C", "#E0FFFF"]
-    fig = go.Figure(
-        data=[
-            go.Pie(
-                labels=languages,
-                values=shares,
-                hole=0.3,
-                marker=dict(colors=colors, line=dict(color="#FFFFFF", width=2)),
-                textinfo="label+percent",
-                textposition="outside",
-                insidetextorientation="radial",
-                textfont=dict(size=12),
-            )
-        ]
-    )
-    fig.update_layout(
-        title={
-            "text": "World Languages by Share of Global GDP",
-            "y": 0.95,
-            "x": 0.5,
-            "xanchor": "center",
-            "yanchor": "top",
-            "font": dict(size=20, color="black"),
-        },
-        showlegend=False,
-        width=700,
-        height=500,
-        margin=dict(t=80, b=20, l=20, r=20),
-    )
-    return fig
 def check_model_exists(model_id):
     api = HfApi()
@@ -253,6 +192,7 @@ def check_model_exists(model_id):
         else:
             return "Error: " + str(e)
 def submit_model(name):
     if "Exists" not in check_model_exists(name):
         return f"# ERROR: Model {name} does not exist on Hugging Face!"
@@ -271,14 +211,24 @@ def submit_model(name):
     except Exception as e:
         print(e)
         return "ERROR: Unexpected error. Please try again later."
-def create_scaling_plot(all_data, period):
-    selected_columns = ["Name", "Parameters Count (B)", "Average (The lower the better)"]
-    target_data = all_data[period]
-    new_df = pd.DataFrame()
-    for size in target_data.keys():
-        new_df = pd.concat([new_df, target_data[size]["cr"].loc[:, selected_columns].dropna(axis=1, how="all")], axis=0)
-    x_values = new_df["Parameters Count (B)"].astype(float).tolist()
-    y_values = new_df["Average (The lower the better)"].astype(float).tolist()
     names = new_df["Name"].tolist()
     x_min, x_max = np.log10(min(x_values)), np.log10(max(x_values))
     y_min, y_max = np.log10(min(y_values)), np.log10(max(y_values))
@@ -326,100 +276,88 @@ def create_scaling_plot(all_data, period):
     )
     return fig
-def read_all_data(folder_name):
-    all_data = {}
-    time_list = []
-    for folder in get_folders_matching_format(folder_name):
-        folder_name = os.path.basename(folder)
-        time_list.append(folder_name)
-        if all_data.get(folder) is None:
-            all_data[folder_name] = {}
-        for file_name in file_name_list:
-            if all_data.get(file_name) is None:
-                all_data[folder_name][file_name] = {}
-            for sheet_name in sheet_name_list:
-                final_file_name = os.path.join(folder, file_name)
-                all_data[folder_name][file_name][sheet_name] = rename_columns(pd.read_excel(final_file_name + ".xlsx", sheet_name=sheet_name))
-    return all_data, time_list
-all_data, time_list = read_all_data("data")
-time_list.sort()
-last_period = time_list[-1]
-initial_fig = create_scaling_plot(all_data, last_period)
-initial_metric = metric_list[0]
-initial_columns = get_unique_column_names(all_data)
-initial_colors = ["Average", "Individual Tests"]
-initial_size_range = [0, 40]
-# 初始调用 update_table 时，request 参数将为默认的 None
-initial_data = update_table(last_period, model_size_list, initial_metric, initial_columns, initial_colors, initial_size_range)
-css = """
-.gradio-container {
-    max-width: 95% !important;
-    margin: 0 auto;
-}
-.tab-buttons button {
-    font-size: 1.3em;
-}
-.gr-dataframe th {
-    white-space: normal;
-    word-break: break-word;
-}
-table {
-    margin-left: auto !important;
-    margin-right: auto !important;
-    width: 100% !important;
-}
-"""
-TITLE_HTML = '<h1 style="text-align:center"><span style="font-size:1.3em">🏆 LLM Compression Leaderboard</span></h1>'
-SUBTITLE_HTML = "<h1 style='text-align:center'><span style='font-size:0.8em'>Welcome to Uncheatable Eval LLM Compression Leaderboard, where fancy fine-tuning and cheating won't work 🚫; only compute 💻, data 📊, and real innovation 🔥 can prevail!</span></h1>"
-# theme = gr.themes.Default(primary_hue=slate, secondary_hue=slate)
-theme = gr.themes.Default()
-with gr.Blocks(theme=theme, css=css) as demo:
-    gr.HTML(TITLE_HTML)
-    gr.HTML(SUBTITLE_HTML)
-    with gr.Tabs() as tabs:
-        with gr.Tab("🏆 Leaderboard"):
-            with gr.Row():
-                with gr.Column():
-                    period_selector = gr.Dropdown(label="Period", choices=time_list, value=last_period)
-                    model_selector = gr.CheckboxGroup(label="Model Size", choices=model_size_list, value=model_size_list)
-                    size_range_slider = RangeSlider(minimum=0, maximum=40, value=[0, 40], step=0.1, label="Model Size Range")
-                    metric_selector = gr.Dropdown(label="Metric", choices=metric_list, value=initial_metric)
-                with gr.Column():
-                    midpoint_slider = gr.Slider(minimum=0.1, maximum=0.9, value=0.5, step=0.01, label="Color Gradient Midpoint")
-                    color_selector = gr.CheckboxGroup(label="Colored Columns", choices=["Average", "Individual Tests"], value=initial_colors)
-                    colfilter = gr.CheckboxGroup(label="Data Source", choices=get_unique_column_names(all_data), value=initial_columns)
-            table = gr.HTML(initial_data)
-            # --- 核心改动点 4: 更新所有 .change() 事件，添加 gr.Request() ---
-            # 定义共享的输入列表，避免重复
-            shared_inputs = [period_selector, model_selector, metric_selector, colfilter, color_selector, size_range_slider, midpoint_slider]
-            period_selector.change(update_table, inputs=shared_inputs, outputs=table)
-            model_selector.change(update_table, inputs=shared_inputs, outputs=table)
-            metric_selector.change(update_table, inputs=shared_inputs, outputs=table)
-            colfilter.change(update_table, inputs=shared_inputs, outputs=table)
-            color_selector.change(update_table, inputs=shared_inputs, outputs=table)
-            size_range_slider.change(update_table, inputs=shared_inputs, outputs=table)
-            midpoint_slider.change(update_table, inputs=shared_inputs, outputs=table)
-        with gr.Tab("🌍 MultiLang"):
-            gr.Markdown("## Coming soon...")
-            # world_languages_plot = gr.Plot(create_world_languages_gdp_chart())
-        with gr.Tab("📈 Scaling Law"):
-            period_selector_2 = gr.Dropdown(label="Period", choices=time_list, value=last_period)
-            def update_plot(period):
-                new_fig = create_scaling_plot(all_data, period)
-                return new_fig
-            plot = gr.Plot(initial_fig)
-            period_selector_2.change(update_plot, inputs=period_selector_2, outputs=plot)
-        with gr.Tab("ℹ️ About"):
-            gr.Markdown(read_about_md())
-        with gr.Tab("🚀 Submit"):
-            with gr.Group():
                 with gr.Row():
-                    model_name = gr.Textbox(max_lines=1, placeholder="Enter model name...", show_label=False, scale=4)
-                    submit = gr.Button("Submit", variant="primary", scale=0)
-            output = gr.Markdown("# Enter a public HF repo id, then hit Submit to add it to the evaluation queue.")
-            submit.click(fn=submit_model, inputs=model_name, outputs=output)
-demo.launch(share=False)

+from operator import is_
 import pandas as pd
 import gradio as gr
 import os
 import requests
 from dotenv import load_dotenv
 from matplotlib.colors import LinearSegmentedColormap
 import plotly.graph_objects as go
 import numpy as np
 from huggingface_hub import HfApi
 from huggingface_hub.hf_api import HTTPError
 from huggingface_hub.utils import GatedRepoError
 from gradio_rangeslider import RangeSlider
 import datetime
+from title import css, TITLE_HTML, SUBTITLE_HTML
+from data_manager import DataManager
 load_dotenv()
 webhook_url = os.environ.get("WEBHOOK_URL")
 metric_list = [
     "Compression Rate (%)",
     "Bits Per Character (BPC)",
     "Other": "other",
 }
 def read_about_md():
+    with open("about.md", "r", encoding="utf-8") as f:
         return f.read()
+def update_table(
+    data_manager: DataManager,
+    period: str,
+    models_size: list,
+    metric: str,
+    visible_columns: list,
+    color_columns: list,
+    size_range: list,
+    midpoint: float = 0.5,
+    ascending: bool = True,
+    request: gr.Request = None,
+):
     is_dark_mode = request.is_dark if request else False
+    print(
+        f"Updating - time: {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}, period: {period}, models: {models_size}, metric: {metric}, visible_columns: {visible_columns}, color_columns: {color_columns}, size_range: {size_range}, ascending: {ascending}, is_dark: {is_dark_mode}\n"
+    )
     target_file_name = [model_size_to_file_name[model] for model in models_size]
+    metric_code = metric_to_sheet[metric]
+    # 过滤掉不在当前 period 可用列中的列名，避免错误
+    if visible_columns:
+        available_columns = data_manager.get_available_columns(period)
+        visible_columns = [col for col in visible_columns if col in available_columns]
+    filtered_data = data_manager.query(
+        period=period,
+        metric_code=metric_code,
+        param_range=(size_range[0], size_range[1]),
+        model_groups=target_file_name,
+        visible_columns=visible_columns,
+    )
+    if len(filtered_data) == 0:
         return "No data available for the selected models and period."
+    colors = ["#2ca02c", "#2b2b2b", "#d62728"] if is_dark_mode else ["#63be7b", "#ffffff", "#f8696b"]
     vmin, vmax, vmid = {}, {}, {}
     for column in filtered_data.columns:
+        if column in ["Name", "Params (B)"]:
+            continue
         col_values = filtered_data[column].dropna()
         if len(col_values) > 1:
             sorted_values = np.sort(col_values)
             idx = int(len(sorted_values) * midpoint)
             vmid[column] = sorted_values[idx]
     def custom_background_gradient(series, cmap, vmin_val, vmax_val, vmid_val):
+        if len(series) == 0:
+            return series
         def normalize(x):
+            if pd.isna(x):
+                return 0.5  # Neutral for NaN
+            if vmid_val == vmin_val and x <= vmid_val:
+                return 0.0
+            if vmid_val == vmax_val and x >= vmid_val:
+                return 1.0
+            if vmid_val == vmin_val or vmid_val == vmax_val:
+                return 0.5
             if x <= vmid_val:
                 return 0.5 * (x - vmin_val) / (vmid_val - vmin_val)
             else:
                 return 0.5 + 0.5 * (x - vmid_val) / (vmax_val - vmid_val)
         normed = series.apply(normalize)
         cmap_colors = [cmap(x) for x in normed]
+        return ["background-color: rgba({}, {}, {}, {}); color: black;".format(*[int(255 * c) for c in color[:3]], color[3]) for color in cmap_colors]
     target_color_columns = []
+    if "Average" in color_columns:
+        target_color_columns.append("Average (lower=better)")
+    if "Individual Tests" in color_columns:
+        target_color_columns.extend([col for col in filtered_data.columns if col not in ["Name", "Params (B)", "Average (lower=better)"]])
     def color_params_column_dynamic(value):
         if not pd.notna(value):
             return "default"
         if is_dark_mode:
             return "background-color: #4b4936; color: #f0f0f0;"
         else:
             return "background-color: #fffdd0; color: black;"
+    formatter = {col: "{:.3f}" for col in filtered_data.columns if filtered_data[col].dtype in ["float64", "float32"]}
+    styler = filtered_data.style.format(formatter)
+    styler = styler.map(color_params_column_dynamic, subset=["Params (B)"])
     for column in target_color_columns:
         if column in vmin:
             custom_cmap = LinearSegmentedColormap.from_list("custom_cmap", colors)
+            styler = styler.apply(
+                custom_background_gradient, cmap=custom_cmap, vmin_val=vmin[column], vmax_val=vmax[column], vmid_val=vmid[column], subset=[column]
+            )
     styler = styler.hide(axis="index")
+    widths = [250, 80, 80, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70]
     table_styles = []
+    table_styles.append(
+        {
+            "selector": "th",
+            "props": [
+                ("background-color", "var(--background-fill-secondary)"),
+                ("color", "var(--body-text-color)"),
+                ("padding", "8px"),
+                ("font-weight", "bold"),
+            ],
+        }
+    )
     table_styles.append({"selector": "table", "props": [("border-collapse", "collapse"), ("border", f"1px solid var(--border-color-primary)")]})
     for i, w in enumerate(widths):
+        table_styles.append(
+            {
+                "selector": f"th.col{i}, td.col{i}",
+                "props": [
+                    ("min-width", f"{w}px"),
+                    ("max-width", f"{w}px"),
+                    ("text-align", "center"),
+                    ("border", f"1px solid var(--border-color-primary)"),
+                ],
+            }
+        )
     styler = styler.set_table_styles(table_styles)
     return styler.to_html()
 def check_model_exists(model_id):
     api = HfApi()
         else:
             return "Error: " + str(e)
 def submit_model(name):
     if "Exists" not in check_model_exists(name):
         return f"# ERROR: Model {name} does not exist on Hugging Face!"
     except Exception as e:
         print(e)
         return "ERROR: Unexpected error. Please try again later."
+def create_scaling_plot(data_manager: DataManager, period: str):
+    new_df = data_manager.query(
+        period=period,
+        metric_code="cr",
+        param_range=(0, 40),
+        model_groups=None,
+        visible_columns=None,
+    )
+    if len(new_df) == 0:
+        fig = go.Figure()
+        fig.update_layout(title={"text": "Compression Rate Scaling Law", "x": 0.5}, width=800, height=600)
+        return fig
+    x_values = new_df["Params (B)"].astype(float).tolist()
+    y_values = new_df["Average (lower=better)"].astype(float).tolist()
     names = new_df["Name"].tolist()
     x_min, x_max = np.log10(min(x_values)), np.log10(max(x_values))
     y_min, y_max = np.log10(min(y_values)), np.log10(max(y_values))
     )
     return fig
+if __name__ == "__main__":
+    data_manager = DataManager("data")
+    time_list = data_manager.get_available_periods()
+    last_period = time_list[-1]
+    initial_fig = create_scaling_plot(data_manager, last_period) if last_period else go.Figure()
+    initial_metric = metric_list[0]
+    initial_columns = data_manager.get_available_columns(last_period)
+    initial_colors = ["Average", "Individual Tests"]
+    initial_size_range = [0, 40]
+    initial_data = update_table(data_manager, last_period, model_size_list, initial_metric, initial_columns, initial_colors, initial_size_range)
+    theme = gr.themes.Default()
+    with gr.Blocks(theme=theme, css=css) as demo:
+        gr.HTML(TITLE_HTML)
+        gr.HTML(SUBTITLE_HTML)
+        with gr.Tabs() as tabs:
+            with gr.Tab("🏆 Leaderboard"):
                 with gr.Row():
+                    with gr.Column():
+                        period_selector = gr.Dropdown(label="Period", choices=time_list, value=last_period)
+                        model_selector = gr.CheckboxGroup(label="Model Size", choices=model_size_list, value=model_size_list)
+                        size_range_slider = RangeSlider(minimum=0, maximum=40, value=[0, 40], step=0.1, label="Model Size Range")
+                        metric_selector = gr.Dropdown(label="Metric", choices=metric_list, value=initial_metric)
+                    with gr.Column():
+                        midpoint_slider = gr.Slider(minimum=0.1, maximum=0.9, value=0.5, step=0.01, label="Color Gradient Midpoint")
+                        color_selector = gr.CheckboxGroup(label="Colored Columns", choices=["Average", "Individual Tests"], value=initial_colors)
+                        colfilter = gr.CheckboxGroup(label="Data Source", choices=initial_columns, value=initial_columns)
+                table = gr.HTML(initial_data)
+                def update_table_wrapper(period, models_size, metric, visible_columns, color_columns, size_range, midpoint):
+                    return update_table(data_manager, period, models_size, metric, visible_columns, color_columns, size_range, midpoint)
+                def update_column_choices(period, current_selected):
+                    if not period:
+                        return gr.update(choices=[], value=[])
+                    columns = data_manager.get_available_columns(period)
+                    # 只保留在新 choices 中存在的已选择值
+                    if current_selected:
+                        valid_selected = [col for col in current_selected if col in columns]
+                        # 如果过滤后为空，默认选择所有列（保持默认行为）
+                        if not valid_selected:
+                            valid_selected = columns
+                    else:
+                        # 如果没有当前选择，默认选择所有列（保持默认行为）
+                        valid_selected = columns
+                    return gr.update(choices=columns, value=valid_selected)
+                shared_inputs = [period_selector, model_selector, metric_selector, colfilter, color_selector, size_range_slider, midpoint_slider]
+                period_selector.change(update_column_choices, inputs=[period_selector, colfilter], outputs=colfilter)
+                period_selector.change(update_table_wrapper, inputs=shared_inputs, outputs=table)
+                model_selector.change(update_table_wrapper, inputs=shared_inputs, outputs=table)
+                metric_selector.change(update_table_wrapper, inputs=shared_inputs, outputs=table)
+                colfilter.change(update_table_wrapper, inputs=shared_inputs, outputs=table)
+                color_selector.change(update_table_wrapper, inputs=shared_inputs, outputs=table)
+                size_range_slider.change(update_table_wrapper, inputs=shared_inputs, outputs=table)
+                midpoint_slider.change(update_table_wrapper, inputs=shared_inputs, outputs=table)
+            with gr.Tab("📚 Long Context"):
+                gr.Markdown("## Coming soon...")
+            with gr.Tab("📈 Scaling Law"):
+                period_selector_2 = gr.Dropdown(label="Period", choices=time_list, value=last_period)
+                def update_plot(period):
+                    new_fig = create_scaling_plot(data_manager, period)
+                    return new_fig
+                plot = gr.Plot(initial_fig)
+                period_selector_2.change(update_plot, inputs=period_selector_2, outputs=plot)
+            with gr.Tab("ℹ️ About"):
+                gr.Markdown(read_about_md())
+            with gr.Tab("🚀 Submit"):
+                with gr.Group():
+                    with gr.Row():
+                        model_name = gr.Textbox(max_lines=1, placeholder="Enter model name...", show_label=False, scale=4)
+                        submit = gr.Button("Submit", variant="primary", scale=0)
+                output = gr.Markdown("# Enter a public HF repo id, then hit Submit to add it to the evaluation queue.")
+                submit.click(fn=submit_model, inputs=model_name, outputs=output)
+    demo.launch(share=False)

data/2024-10/7b.xlsx CHANGED Viewed

Binary files a/data/2024-10/7b.xlsx and b/data/2024-10/7b.xlsx differ

data/2024-10/xb.xlsx DELETED Viewed

Binary file (9.4 kB)

data/2025-12/2025-12-21_11-34-39.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+    "neg_log_prob_sum": 4649.08,
+    "avg tokens": 1909.12,
+    "avg character count": 7857.404,
+    "parameters count": 1.527404544,
+    "avg bytes": 8012.242,
+    "sample_count": 500,
+    "model_name_or_path": "/mnt/Public/rwkv_models/rwkv7-g1b-1.5b-20251202-ctx8192.pth",
+    "tokenizer_name": "rwkv_vocab_v20230424",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-ao3_english",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 10.463994754364728,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-35-15.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+    "neg_log_prob_sum": 4283.474,
+    "avg tokens": 2095.926,
+    "avg character count": 9964.74,
+    "parameters count": 1.527404544,
+    "avg bytes": 9994.128,
+    "sample_count": 500,
+    "model_name_or_path": "/mnt/Public/rwkv_models/rwkv7-g1b-1.5b-20251202-ctx8192.pth",
+    "tokenizer_name": "rwkv_vocab_v20230424",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-arxiv_cs",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 7.729221971112452,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-36-04.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+    "neg_log_prob_sum": 4036.0446875,
+    "avg tokens": 2925.354,
+    "avg character count": 9913.284,
+    "parameters count": 1.527404544,
+    "avg bytes": 9918.674,
+    "sample_count": 500,
+    "model_name_or_path": "/mnt/Public/rwkv_models/rwkv7-g1b-1.5b-20251202-ctx8192.pth",
+    "tokenizer_name": "rwkv_vocab_v20230424",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-arxiv_math",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 7.338155351540054,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-36-44.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+    "neg_log_prob_sum": 4376.222,
+    "avg tokens": 2448.906,
+    "avg character count": 9946.974,
+    "parameters count": 1.527404544,
+    "avg bytes": 9952.8,
+    "sample_count": 500,
+    "model_name_or_path": "/mnt/Public/rwkv_models/rwkv7-g1b-1.5b-20251202-ctx8192.pth",
+    "tokenizer_name": "rwkv_vocab_v20230424",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-arxiv_physics",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 7.9293688424729485,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-37-00.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+    "neg_log_prob_sum": 1719.608,
+    "avg tokens": 739.35,
+    "avg character count": 3394.84,
+    "parameters count": 1.527404544,
+    "avg bytes": 3396.996,
+    "sample_count": 500,
+    "model_name_or_path": "/mnt/Public/rwkv_models/rwkv7-g1b-1.5b-20251202-ctx8192.pth",
+    "tokenizer_name": "rwkv_vocab_v20230424",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-bbc_news",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 9.128911006492899,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-37-31.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+    "neg_log_prob_sum": 1347.243,
+    "avg tokens": 1773.934,
+    "avg character count": 5773.33,
+    "parameters count": 1.527404544,
+    "avg bytes": 5853.154,
+    "sample_count": 500,
+    "model_name_or_path": "/mnt/Public/rwkv_models/rwkv7-g1b-1.5b-20251202-ctx8192.pth",
+    "tokenizer_name": "rwkv_vocab_v20230424",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-github_cpp",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 4.150883427491335,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-37-59.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+    "neg_log_prob_sum": 1377.357875,
+    "avg tokens": 1654.562,
+    "avg character count": 5774.754,
+    "parameters count": 1.527404544,
+    "avg bytes": 5870.628,
+    "sample_count": 500,
+    "model_name_or_path": "/mnt/Public/rwkv_models/rwkv7-g1b-1.5b-20251202-ctx8192.pth",
+    "tokenizer_name": "rwkv_vocab_v20230424",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-github_javascript",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 4.231036645040064,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-38-27.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+    "neg_log_prob_sum": 2226.4415625,
+    "avg tokens": 1598.294,
+    "avg character count": 5024.17,
+    "parameters count": 1.527404544,
+    "avg bytes": 5522.098,
+    "sample_count": 500,
+    "model_name_or_path": "/mnt/Public/rwkv_models/rwkv7-g1b-1.5b-20251202-ctx8192.pth",
+    "tokenizer_name": "rwkv_vocab_v20230424",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-github_markdown",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 7.270959789757048,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-38-57.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+    "neg_log_prob_sum": 1621.03725,
+    "avg tokens": 1791.012,
+    "avg character count": 6339.622,
+    "parameters count": 1.527404544,
+    "avg bytes": 6497.474,
+    "sample_count": 500,
+    "model_name_or_path": "/mnt/Public/rwkv_models/rwkv7-g1b-1.5b-20251202-ctx8192.pth",
+    "tokenizer_name": "rwkv_vocab_v20230424",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-github_python",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 4.49917614458958,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-39-11.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+    "neg_log_prob_sum": 1502.122,
+    "avg tokens": 718.362,
+    "avg character count": 3043.39,
+    "parameters count": 1.527404544,
+    "avg bytes": 3062.292,
+    "sample_count": 500,
+    "model_name_or_path": "/mnt/Public/rwkv_models/rwkv7-g1b-1.5b-20251202-ctx8192.pth",
+    "tokenizer_name": "rwkv_vocab_v20230424",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-wikipedia_english",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 8.845923087226053,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-39-42.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 5066.424,
+    "avg tokens": 1833.724,
+    "avg character count": 7857.404,
+    "parameters count": 1.720574976,
+    "avg bytes": 8012.242,
+    "sample_count": 500,
+    "model_name_or_path": "Qwen/Qwen3-1.7B-Base",
+    "tokenizer_name": "Qwen/Qwen3-1.7B-Base",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-ao3_english",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 11.403338759364772,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-40-01.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 4186.624,
+    "avg tokens": 2071.622,
+    "avg character count": 9964.74,
+    "parameters count": 1.720574976,
+    "avg bytes": 9994.128,
+    "sample_count": 500,
+    "model_name_or_path": "Qwen/Qwen3-1.7B-Base",
+    "tokenizer_name": "Qwen/Qwen3-1.7B-Base",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-arxiv_cs",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 7.554463084306498,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-40-26.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 3646.42,
+    "avg tokens": 3000.148,
+    "avg character count": 9913.284,
+    "parameters count": 1.720574976,
+    "avg bytes": 9918.674,
+    "sample_count": 500,
+    "model_name_or_path": "Qwen/Qwen3-1.7B-Base",
+    "tokenizer_name": "Qwen/Qwen3-1.7B-Base",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-arxiv_math",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 6.6297572273752685,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-40-48.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 4222.864,
+    "avg tokens": 2501.464,
+    "avg character count": 9946.974,
+    "parameters count": 1.720574976,
+    "avg bytes": 9952.8,
+    "sample_count": 500,
+    "model_name_or_path": "Qwen/Qwen3-1.7B-Base",
+    "tokenizer_name": "Qwen/Qwen3-1.7B-Base",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-arxiv_physics",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 7.651496251241524,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-41-02.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 1826.096,
+    "avg tokens": 720.27,
+    "avg character count": 3394.84,
+    "parameters count": 1.720574976,
+    "avg bytes": 3396.996,
+    "sample_count": 500,
+    "model_name_or_path": "Qwen/Qwen3-1.7B-Base",
+    "tokenizer_name": "Qwen/Qwen3-1.7B-Base",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-bbc_news",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 9.69422558705976,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-41-20.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 1175.4235,
+    "avg tokens": 1617.712,
+    "avg character count": 5773.33,
+    "parameters count": 1.720574976,
+    "avg bytes": 5853.154,
+    "sample_count": 500,
+    "model_name_or_path": "Qwen/Qwen3-1.7B-Base",
+    "tokenizer_name": "Qwen/Qwen3-1.7B-Base",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-github_cpp",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 3.6215040096210274,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-41-38.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 1212.134,
+    "avg tokens": 1498.248,
+    "avg character count": 5774.754,
+    "parameters count": 1.720574976,
+    "avg bytes": 5870.628,
+    "sample_count": 500,
+    "model_name_or_path": "Qwen/Qwen3-1.7B-Base",
+    "tokenizer_name": "Qwen/Qwen3-1.7B-Base",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-github_javascript",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 3.723493701808611,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-41-55.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 2129.001,
+    "avg tokens": 1446.138,
+    "avg character count": 5024.17,
+    "parameters count": 1.720574976,
+    "avg bytes": 5522.098,
+    "sample_count": 500,
+    "model_name_or_path": "Qwen/Qwen3-1.7B-Base",
+    "tokenizer_name": "Qwen/Qwen3-1.7B-Base",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-github_markdown",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 6.952745099660591,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-42-12.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 1409.987,
+    "avg tokens": 1585.12,
+    "avg character count": 6339.622,
+    "parameters count": 1.720574976,
+    "avg bytes": 6497.474,
+    "sample_count": 500,
+    "model_name_or_path": "Qwen/Qwen3-1.7B-Base",
+    "tokenizer_name": "Qwen/Qwen3-1.7B-Base",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-github_python",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 3.9134078347560655,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-42-26.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 1590.208,
+    "avg tokens": 750.442,
+    "avg character count": 3043.39,
+    "parameters count": 1.720574976,
+    "avg bytes": 3062.292,
+    "sample_count": 500,
+    "model_name_or_path": "Qwen/Qwen3-1.7B-Base",
+    "tokenizer_name": "Qwen/Qwen3-1.7B-Base",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-wikipedia_english",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 9.36465723868738,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-42-49.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 4847.352,
+    "avg tokens": 1949.908,
+    "avg character count": 7857.404,
+    "parameters count": 1.711376384,
+    "avg bytes": 8012.242,
+    "sample_count": 500,
+    "model_name_or_path": "HuggingFaceTB/SmolLM2-1.7B",
+    "tokenizer_name": "HuggingFaceTB/SmolLM2-1.7B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-ao3_english",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 10.910258782503071,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-43-05.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 4517.79,
+    "avg tokens": 2182.888,
+    "avg character count": 9964.74,
+    "parameters count": 1.711376384,
+    "avg bytes": 9994.128,
+    "sample_count": 500,
+    "model_name_or_path": "HuggingFaceTB/SmolLM2-1.7B",
+    "tokenizer_name": "HuggingFaceTB/SmolLM2-1.7B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-arxiv_cs",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 8.152028407052809,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-43-28.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 4149.150625,
+    "avg tokens": 3143.934,
+    "avg character count": 9913.284,
+    "parameters count": 1.711376384,
+    "avg bytes": 9918.674,
+    "sample_count": 500,
+    "model_name_or_path": "HuggingFaceTB/SmolLM2-1.7B",
+    "tokenizer_name": "HuggingFaceTB/SmolLM2-1.7B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-arxiv_math",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 7.543799491984567,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-43-47.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 4609.628,
+    "avg tokens": 2602.328,
+    "avg character count": 9946.974,
+    "parameters count": 1.711376384,
+    "avg bytes": 9952.8,
+    "sample_count": 500,
+    "model_name_or_path": "HuggingFaceTB/SmolLM2-1.7B",
+    "tokenizer_name": "HuggingFaceTB/SmolLM2-1.7B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-arxiv_physics",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 8.352282091400047,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-43-58.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 1738.216,
+    "avg tokens": 755.956,
+    "avg character count": 3394.84,
+    "parameters count": 1.711376384,
+    "avg bytes": 3396.996,
+    "sample_count": 500,
+    "model_name_or_path": "HuggingFaceTB/SmolLM2-1.7B",
+    "tokenizer_name": "HuggingFaceTB/SmolLM2-1.7B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-bbc_news",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 9.227695599265683,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-44-14.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 1344.355,
+    "avg tokens": 1998.214,
+    "avg character count": 5773.33,
+    "parameters count": 1.711376384,
+    "avg bytes": 5853.154,
+    "sample_count": 500,
+    "model_name_or_path": "HuggingFaceTB/SmolLM2-1.7B",
+    "tokenizer_name": "HuggingFaceTB/SmolLM2-1.7B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-github_cpp",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 4.141985440017216,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-44-30.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 1449.103,
+    "avg tokens": 1865.214,
+    "avg character count": 5774.754,
+    "parameters count": 1.711376384,
+    "avg bytes": 5870.628,
+    "sample_count": 500,
+    "model_name_or_path": "HuggingFaceTB/SmolLM2-1.7B",
+    "tokenizer_name": "HuggingFaceTB/SmolLM2-1.7B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-github_javascript",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 4.4514268998080775,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-44-45.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 2527.254,
+    "avg tokens": 1888.098,
+    "avg character count": 5024.17,
+    "parameters count": 1.711376384,
+    "avg bytes": 5522.098,
+    "sample_count": 500,
+    "model_name_or_path": "HuggingFaceTB/SmolLM2-1.7B",
+    "tokenizer_name": "HuggingFaceTB/SmolLM2-1.7B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-github_markdown",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 8.25333236766804,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-45-01.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 1684.25,
+    "avg tokens": 1931.562,
+    "avg character count": 6339.622,
+    "parameters count": 1.711376384,
+    "avg bytes": 6497.474,
+    "sample_count": 500,
+    "model_name_or_path": "HuggingFaceTB/SmolLM2-1.7B",
+    "tokenizer_name": "HuggingFaceTB/SmolLM2-1.7B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-github_python",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 4.674622635306498,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-45-11.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 1584.196,
+    "avg tokens": 779.642,
+    "avg character count": 3043.39,
+    "parameters count": 1.711376384,
+    "avg bytes": 3062.292,
+    "sample_count": 500,
+    "model_name_or_path": "HuggingFaceTB/SmolLM2-1.7B",
+    "tokenizer_name": "HuggingFaceTB/SmolLM2-1.7B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-wikipedia_english",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 9.32925286434202,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-45-38.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 5079.304,
+    "avg tokens": 1833.724,
+    "avg character count": 7857.404,
+    "parameters count": 1.543714304,
+    "avg bytes": 8012.242,
+    "sample_count": 500,
+    "model_name_or_path": "Qwen/Qwen2.5-1.5B",
+    "tokenizer_name": "Qwen/Qwen2.5-1.5B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-ao3_english",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 11.432328635305005,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-45-55.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 4373.472,
+    "avg tokens": 2071.622,
+    "avg character count": 9964.74,
+    "parameters count": 1.543714304,
+    "avg bytes": 9994.128,
+    "sample_count": 500,
+    "model_name_or_path": "Qwen/Qwen2.5-1.5B",
+    "tokenizer_name": "Qwen/Qwen2.5-1.5B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-arxiv_cs",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 7.891616914785782,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-46-17.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 3793.949,
+    "avg tokens": 3000.148,
+    "avg character count": 9913.284,
+    "parameters count": 1.543714304,
+    "avg bytes": 9918.674,
+    "sample_count": 500,
+    "model_name_or_path": "Qwen/Qwen2.5-1.5B",
+    "tokenizer_name": "Qwen/Qwen2.5-1.5B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-arxiv_math",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 6.897987835477859,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-46-35.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 4389.584,
+    "avg tokens": 2501.464,
+    "avg character count": 9946.974,
+    "parameters count": 1.543714304,
+    "avg bytes": 9952.8,
+    "sample_count": 500,
+    "model_name_or_path": "Qwen/Qwen2.5-1.5B",
+    "tokenizer_name": "Qwen/Qwen2.5-1.5B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-arxiv_physics",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 7.9535797317909775,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-46-50.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 1785.08,
+    "avg tokens": 720.27,
+    "avg character count": 3394.84,
+    "parameters count": 1.543714304,
+    "avg bytes": 3396.996,
+    "sample_count": 500,
+    "model_name_or_path": "Qwen/Qwen2.5-1.5B",
+    "tokenizer_name": "Qwen/Qwen2.5-1.5B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-bbc_news",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 9.476483279602297,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-47-06.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 1258.625,
+    "avg tokens": 1617.712,
+    "avg character count": 5773.33,
+    "parameters count": 1.543714304,
+    "avg bytes": 5853.154,
+    "sample_count": 500,
+    "model_name_or_path": "Qwen/Qwen2.5-1.5B",
+    "tokenizer_name": "Qwen/Qwen2.5-1.5B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-github_cpp",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 3.877849544533749,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-47-21.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 1324.075,
+    "avg tokens": 1498.248,
+    "avg character count": 5774.754,
+    "parameters count": 1.543714304,
+    "avg bytes": 5870.628,
+    "sample_count": 500,
+    "model_name_or_path": "Qwen/Qwen2.5-1.5B",
+    "tokenizer_name": "Qwen/Qwen2.5-1.5B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-github_javascript",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 4.067359651014027,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-47-36.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 2284.521,
+    "avg tokens": 1446.14,
+    "avg character count": 5024.17,
+    "parameters count": 1.543714304,
+    "avg bytes": 5522.098,
+    "sample_count": 500,
+    "model_name_or_path": "Qwen/Qwen2.5-1.5B",
+    "tokenizer_name": "Qwen/Qwen2.5-1.5B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-github_markdown",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 7.4606316238563135,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-47-52.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 1501.532,
+    "avg tokens": 1585.12,
+    "avg character count": 6339.622,
+    "parameters count": 1.543714304,
+    "avg bytes": 6497.474,
+    "sample_count": 500,
+    "model_name_or_path": "Qwen/Qwen2.5-1.5B",
+    "tokenizer_name": "Qwen/Qwen2.5-1.5B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-github_python",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 4.1674902626314605,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-48-04.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 1596.3,
+    "avg tokens": 750.442,
+    "avg character count": 3043.39,
+    "parameters count": 1.543714304,
+    "avg bytes": 3062.292,
+    "sample_count": 500,
+    "model_name_or_path": "Qwen/Qwen2.5-1.5B",
+    "tokenizer_name": "Qwen/Qwen2.5-1.5B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-wikipedia_english",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 9.400532729125164,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-48-25.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 5037.112,
+    "avg tokens": 1832.424,
+    "avg character count": 7857.404,
+    "parameters count": 1.2358144,
+    "avg bytes": 8012.242,
+    "sample_count": 500,
+    "model_name_or_path": "meta-llama/Llama-3.2-1B",
+    "tokenizer_name": "meta-llama/Llama-3.2-1B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-ao3_english",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 11.337364283933086,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-48-37.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 4519.312,
+    "avg tokens": 2045.48,
+    "avg character count": 9964.74,
+    "parameters count": 1.2358144,
+    "avg bytes": 9994.128,
+    "sample_count": 500,
+    "model_name_or_path": "meta-llama/Llama-3.2-1B",
+    "tokenizer_name": "meta-llama/Llama-3.2-1B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-arxiv_cs",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 8.154774747018926,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-48-52.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 4072.908,
+    "avg tokens": 2984.08,
+    "avg character count": 9913.284,
+    "parameters count": 1.2358144,
+    "avg bytes": 9918.674,
+    "sample_count": 500,
+    "model_name_or_path": "meta-llama/Llama-3.2-1B",
+    "tokenizer_name": "meta-llama/Llama-3.2-1B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-arxiv_math",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 7.405178572252937,
+    "track_byte_wise_data": false
+}

data/2025-12/2025-12-21_11-49-05.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "neg_log_prob_sum": 4462.048,
+    "avg tokens": 2454.32,
+    "avg character count": 9946.974,
+    "parameters count": 1.2358144,
+    "avg bytes": 9952.8,
+    "sample_count": 500,
+    "model_name_or_path": "meta-llama/Llama-3.2-1B",
+    "tokenizer_name": "meta-llama/Llama-3.2-1B",
+    "data_path": "Jellyfish042/UncheatableEval-2025-12-arxiv_physics",
+    "chunk_size": 4000,
+    "ensure_bos_token": true,
+    "model_args": {
+        "device_map": "auto",
+        "trust_remote_code": true,
+        "attn_implementation": "flash_attention_2",
+        "torch_dtype": "torch.bfloat16"
+    },
+    "tokenizer_args": {
+        "trust_remote_code": true
+    },
+    "requirements": [],
+    "batch_size": 1,
+    "compression_rate": 8.084878780102732,
+    "track_byte_wise_data": false
+}