# AWDE 0617 CHSIMS English 实验报告

日期：2026-06-17  
代码目录：`/root/AWDE/0617-CHSIMS-eng`  
输出目录：`/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/AWDE/0617-CHSIMS-eng`  
PKL：`/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/pkl/0617-CHSIMS-eng/chsims_awde_0617_eng_encoder_raw512_fp16.pkl`

## 0. 指标修正说明

上一版报告里 `Mult_acc_3 / Mult_acc_5 / Mult_acc_7` 三列完全一样，是指标口径错误，不是模型现象。

原因是训练脚本从 MOSEI 继承了 `[-3, 3]` 整数 rounding/clipping 指标：

```text
Mult_acc_3: clip to [-1, 1], then round
Mult_acc_5: clip to [-2, 2], then round
Mult_acc_7: clip to [-3, 3], then round
```

但 CHSIMS/SIMS 标签本身在 `[-1, 1]`，实际档位是 `-1.0, -0.8, ..., 0.8, 1.0`。按 MOSEI 的 `np.round` 后只剩 `-1/0/1` 三类，所以 5 类和 7 类退化，三列自然相同。

本报告已按 MMSA SIMS 标准口径复评同一批已保存 best checkpoint：

```text
Mult_acc_2: [-1.0, 0.0], (0.0, 1.0]
Mult_acc_3: [-1.0, -0.1], (-0.1, 0.1], (0.1, 1.0]
Mult_acc_5: [-1.0, -0.7], (-0.7, -0.1], (-0.1, 0.1], (0.1, 0.7], (0.7, 1.0]
```

SIMS 不定义 `Mult_acc_7`，所以修正版报告删除该列。训练脚本也已修正，后续 CHSIMS 新跑不会再产出假的 `Mult_acc_7`。

注意：现有 checkpoint 是按旧 composite 保存的，其中旧 `Mult_acc_5` 实际退化成三类 rounding 指标。下表是“保存 checkpoint 的正确 SIMS 复评”。如果要完全严格地按 SIMS `Mult_acc_5` 参与 validation selection，需要用修后的脚本重新训练。

## 1. 数据构建

数据源：

```text
/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/data/CH-SIMS
/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/CHSIMS-Encoder-FRA
/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/outputChsims
```

| Item | Value |
| --- | --- |
| Dataset | CHSIMS |
| Language | English text |
| Samples | 2281 |
| Split | train 1368 / valid 456 / test 457 |
| PKL size | 1.588 GiB |
| Missing rows | 0 |
| Missing English text | 0 |
| Missing explanation/final/weights | 0 |

| Modality | Feature directory | Shape per sample | Dtype |
| --- | --- | ---: | --- |
| text | `Baichuan-13B-Base-langeng-FRA-50` | `(50, 5120)` | float16 |
| audio | `chinese-hubert-large-FRA-50` | `(50, 1024)` | float16 |
| vision | `clip-vit-large-patch14-FRA-50` | `(50, 768)` | float16 |

## 2. 训练与复评设置

原训练共同设置：

```text
Encoder-FRA features + feature_layers=2 + align_layers=2
+ pre-align FD micro gate + Directed EATS
+ floor-bounded prior SMoE + SmoothL1(beta=0.25)
+ EMA, ema_decay=0.997, ema_start_epoch=4
```

原 selection：

```text
valid Corr - 0.50 * valid MAE + 0.20 * old-valid Mult_acc_5
```

复评脚本：

```text
/root/AWDE/0617-CHSIMS-eng/scripts/reeval_chsims_sims_metrics.py
```

复评输出：

```text
/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/AWDE/0617-CHSIMS-eng/chsims_sims_metric_reeval.json
```

## 3. Search1 SIMS 复评结果

Search1 四个 run 均使用 seed `20261700`。下表为保存的 best checkpoint 在 test split 上的 SIMS 标准复评。

| Run | Best | Source | Old composite | Mult_acc_2 | F1_score | Non0_acc_2 | Non0_F1_score | Mult_acc_3 | Mult_acc_5 | MAE | Corr | Zero_recall | Zero_precision | Zero_F1 | Zero_pred_rate | Router [T,A,V] |
| --- | ---: | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- |
| `chsims_eng_h128_lr1e4_b12_d12` | 13 | EMA | 0.80776 | 0.8403 | 0.8421 | 0.8763 | 0.8750 | 0.7593 | 0.5777 | 0.2861 | 0.7922 | 0.3478 | 0.4211 | 0.3810 | 0.1247 | [0.386454, 0.364212, 0.249335] |
| `chsims_eng_h128_lr1e4_b8_d12` | 8 | EMA | 0.80330 | 0.8621 | 0.8644 | 0.9046 | 0.9041 | 0.7724 | 0.5602 | 0.2820 | 0.8133 | 0.3188 | 0.4490 | 0.3729 | 0.1072 | [0.368937, 0.354252, 0.276811] |
| `chsims_eng_h128_lr5e5_b8_d15` | 8 | EMA | 0.77228 | 0.8403 | 0.8421 | 0.8840 | 0.8824 | 0.7462 | 0.5317 | 0.2972 | 0.7990 | 0.2464 | 0.3542 | 0.2906 | 0.1050 | [0.376313, 0.356440, 0.267247] |
| `chsims_eng_h160_lr5e5_b8_d15` | 8 | EMA | 0.76890 | 0.8578 | 0.8601 | 0.8918 | 0.8914 | 0.7549 | 0.5339 | 0.2983 | 0.7981 | 0.2754 | 0.3958 | 0.3248 | 0.1050 | [0.366587, 0.346516, 0.286897] |

Search1 mean/std，std 为 4-run sample std：

| Metric | Mean | Std |
| --- | ---: | ---: |
| Mult_acc_2 | 0.8501 | 0.0115 |
| F1_score | 0.8522 | 0.0118 |
| Non0_acc_2 | 0.8892 | 0.0121 |
| Non0_F1_score | 0.8882 | 0.0125 |
| Mult_acc_3 | 0.7582 | 0.0109 |
| Mult_acc_5 | 0.5509 | 0.0221 |
| MAE | 0.2909 | 0.0081 |
| Corr | 0.8007 | 0.0090 |
| Zero_recall | 0.2971 | 0.0450 |
| Zero_precision | 0.4050 | 0.0403 |
| Zero_F1 | 0.3423 | 0.0425 |
| Zero_pred_rate | 0.1105 | 0.0095 |

Search1 结论：`chsims_eng_h128_lr1e4_b8_d12` 仍是主分类和相关性最强，`Mult_acc_2=0.8621`、`F1=0.8644`、`Non0=0.9046`、`Mult_acc_3=0.7724`、`Corr=0.8133`。但 `Mult_acc_5` 最强的是 `b12`，为 `0.5777`。

## 4. Search2 SIMS 复评结果

Search2 根据 Search1 补跑 4 个 targeted variants。

| Run | Purpose |
| --- | --- |
| `chsims_eng_s2_h128_lr1e4_b8_d12_seed1` | winner 换 seed，检查稳定性 |
| `chsims_eng_s2_h112_lr1e4_b8_d12` | 降低容量，检查连续回归校准 |
| `chsims_eng_s2_h128_lr1e4_b8_d20` | 提高 dropout 到 0.20 |
| `chsims_eng_s2_h128_lr1e4_b8_d12_rkl002` | 加 `route_kl_weight=0.02` |

| Run | Best | Source | Old composite | Mult_acc_2 | F1_score | Non0_acc_2 | Non0_F1_score | Mult_acc_3 | Mult_acc_5 | MAE | Corr | Zero_recall | Zero_precision | Zero_F1 | Zero_pred_rate | Router [T,A,V] |
| --- | ---: | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | --- |
| `chsims_eng_s2_h112_lr1e4_b8_d12` | 16 | EMA | 0.79522 | 0.8490 | 0.8517 | 0.8892 | 0.8887 | 0.7615 | 0.5711 | 0.2775 | 0.8089 | 0.4058 | 0.4118 | 0.4088 | 0.1488 | [0.408392, 0.352634, 0.238974] |
| `chsims_eng_s2_h128_lr1e4_b8_d12_rkl002` | 11 | EMA | 0.81935 | 0.8468 | 0.8492 | 0.8789 | 0.8784 | 0.7593 | 0.5602 | 0.2799 | 0.8097 | 0.3188 | 0.4231 | 0.3636 | 0.1138 | [0.347068, 0.382365, 0.270567] |
| `chsims_eng_s2_h128_lr1e4_b8_d12_seed1` | 11 | EMA | 0.77564 | 0.8446 | 0.8478 | 0.8840 | 0.8839 | 0.7681 | 0.5339 | 0.2967 | 0.7885 | 0.3333 | 0.4340 | 0.3770 | 0.1160 | [0.428329, 0.311010, 0.260661] |
| `chsims_eng_s2_h128_lr1e4_b8_d20` | 8 | EMA | 0.79476 | 0.8556 | 0.8572 | 0.8892 | 0.8881 | 0.7484 | 0.5492 | 0.2939 | 0.7913 | 0.3188 | 0.4151 | 0.3607 | 0.1160 | [0.374855, 0.345430, 0.279715] |

Search2 mean/std，std 为 4-run sample std：

| Metric | Mean | Std |
| --- | ---: | ---: |
| Mult_acc_2 | 0.8490 | 0.0048 |
| F1_score | 0.8515 | 0.0041 |
| Non0_acc_2 | 0.8853 | 0.0049 |
| Non0_F1_score | 0.8848 | 0.0048 |
| Mult_acc_3 | 0.7593 | 0.0082 |
| Mult_acc_5 | 0.5536 | 0.0159 |
| MAE | 0.2870 | 0.0097 |
| Corr | 0.7996 | 0.0113 |
| Zero_recall | 0.3442 | 0.0416 |
| Zero_precision | 0.4210 | 0.0099 |
| Zero_F1 | 0.3775 | 0.0220 |
| Zero_pred_rate | 0.1236 | 0.0168 |

Search2 结论：Search2 未超过 Search1 winner 的 `Mult_acc_2/F1/Non0/Corr`，但改善了 MAE 与 neutral bucket。`h112/lr1e-4/b8/d0.12` 的 `MAE=0.2775`、`Zero_F1=0.4088` 是全场最好。

## 5. 全部 8 run 汇总

All-8 mean/std，std 为 8-run sample std：

| Metric | Mean | Std |
| --- | ---: | ---: |
| Mult_acc_2 | 0.8496 | 0.0082 |
| F1_score | 0.8518 | 0.0082 |
| Non0_acc_2 | 0.8872 | 0.0088 |
| Non0_F1_score | 0.8865 | 0.0090 |
| Mult_acc_3 | 0.7588 | 0.0090 |
| Mult_acc_5 | 0.5522 | 0.0179 |
| MAE | 0.2889 | 0.0085 |
| Corr | 0.8001 | 0.0094 |
| Zero_recall | 0.3206 | 0.0474 |
| Zero_precision | 0.4130 | 0.0284 |
| Zero_F1 | 0.3599 | 0.0365 |
| Zero_pred_rate | 0.1171 | 0.0145 |

Best-by-test-metric：

| Metric | Best run | Value |
| --- | --- | ---: |
| Mult_acc_2 | `chsims_eng_h128_lr1e4_b8_d12` | 0.8621 |
| F1_score | `chsims_eng_h128_lr1e4_b8_d12` | 0.8644 |
| Non0_acc_2 | `chsims_eng_h128_lr1e4_b8_d12` | 0.9046 |
| Non0_F1_score | `chsims_eng_h128_lr1e4_b8_d12` | 0.9041 |
| Mult_acc_3 | `chsims_eng_h128_lr1e4_b8_d12` | 0.7724 |
| Mult_acc_5 | `chsims_eng_h128_lr1e4_b12_d12` | 0.5777 |
| MAE | `chsims_eng_s2_h112_lr1e4_b8_d12` | 0.2775 |
| Corr | `chsims_eng_h128_lr1e4_b8_d12` | 0.8133 |
| Zero_recall | `chsims_eng_s2_h112_lr1e4_b8_d12` | 0.4058 |
| Zero_precision | `chsims_eng_h128_lr1e4_b8_d12` | 0.4490 |
| Zero_F1 | `chsims_eng_s2_h112_lr1e4_b8_d12` | 0.4088 |
| Zero_pred_rate | `chsims_eng_s2_h112_lr1e4_b8_d12` | 0.1488 |

## 6. Validation 端复评

下表为同一批 selected checkpoint 在 valid split 上的 SIMS 标准复评。这里可以看到 `rkl002` 的 valid composite 仍然强，但 test 端没有超过 Search1 winner。

| Run | Mult_acc_2 | F1_score | Non0_acc_2 | Non0_F1_score | Mult_acc_3 | Mult_acc_5 | MAE | Corr | Zero_F1 |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| `chsims_eng_h128_lr1e4_b12_d12` | 0.8070 | 0.8109 | 0.8605 | 0.8592 | 0.7237 | 0.5307 | 0.2891 | 0.8080 | 0.2459 |
| `chsims_eng_h128_lr1e4_b8_d12` | 0.7917 | 0.7957 | 0.8398 | 0.8384 | 0.7281 | 0.5307 | 0.2905 | 0.8034 | 0.3167 |
| `chsims_eng_h128_lr5e5_b8_d15` | 0.7829 | 0.7889 | 0.8346 | 0.8349 | 0.6996 | 0.5022 | 0.3094 | 0.7823 | 0.1930 |
| `chsims_eng_h160_lr5e5_b8_d15` | 0.7807 | 0.7859 | 0.8398 | 0.8387 | 0.7039 | 0.4583 | 0.3167 | 0.7860 | 0.2124 |
| `chsims_eng_s2_h112_lr1e4_b8_d12` | 0.7829 | 0.7882 | 0.8450 | 0.8439 | 0.7171 | 0.5088 | 0.2986 | 0.7995 | 0.2764 |
| `chsims_eng_s2_h128_lr1e4_b8_d12_rkl002` | 0.8048 | 0.8075 | 0.8605 | 0.8578 | 0.7281 | 0.5417 | 0.2797 | 0.8105 | 0.3140 |
| `chsims_eng_s2_h128_lr1e4_b8_d12_seed1` | 0.8004 | 0.8043 | 0.8475 | 0.8464 | 0.7149 | 0.4956 | 0.3070 | 0.7858 | 0.2586 |
| `chsims_eng_s2_h128_lr1e4_b8_d20` | 0.8092 | 0.8129 | 0.8630 | 0.8617 | 0.7018 | 0.5022 | 0.2980 | 0.7971 | 0.2017 |

## 7. 结论

1. 旧版报告的 `Mult_acc_3/5/7` 三列相同是 bug：CHSIMS 不应使用 MOSEI 的 `[-3,3]` rounding 指标。

2. 修正后，主分类/相关性 winner 仍是 `chsims_eng_h128_lr1e4_b8_d12`：

```text
Mult_acc_2 = 0.8621
F1_score = 0.8644
Non0_acc_2 = 0.9046
Non0_F1_score = 0.9041
Mult_acc_3 = 0.7724
Mult_acc_5 = 0.5602
MAE = 0.2820
Corr = 0.8133
```

3. 如果重点看细粒度 `Mult_acc_5`，Search1 的 `b12` 更强，`Mult_acc_5=0.5777`；如果重点看 MAE/neutral bucket，Search2 的 `h112` 更强，`MAE=0.2775`、`Zero_F1=0.4088`。

4. `route_kl_weight=0.02` 在 valid 端强，但 test 主指标没有超过 winner；当前不建议作为主配置。

5. 因为当前 checkpoint 是旧 selection 保存的，严格论文口径建议用修正后的 `train_awde.py` 重跑 4 卡，selection 改为：

```text
valid Corr - 0.50 * valid MAE + 0.20 * valid SIMS Mult_acc_5
```

或按 CH-SIMS/MERBench 固定协议，先用 validation WAF/F1 选超参，再做多 seed 复跑。

## 8. 复查路径

修正后的训练脚本：

```text
/root/AWDE/0617-CHSIMS-eng/scripts/train_awde.py
```

复评脚本与输出：

```text
/root/AWDE/0617-CHSIMS-eng/scripts/reeval_chsims_sims_metrics.py
/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/AWDE/0617-CHSIMS-eng/chsims_sims_metric_reeval.json
```

核心 checkpoint：

```text
/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/AWDE/0617-CHSIMS-eng/*/best_awde_no_bert.pt
```
