# 大模型输出提示词汇总

本文件根据 `store` 下现有输出文件、输出记录里的 `prompt_file` 字段，以及对应生成脚本反查整理。

重点链路是：

1. 原始视频/音频解释：`store/outputs2/*raw_explanations.jsonl`
2. 三模态权重生成：`store/outputs2/mosei_qwen3_omni_modal_fusion_weights.jsonl`
3. 最终综合描述生成：`store/outputE/mosei_qwen3_omni_final_descriptions.jsonl`

另附 `store/outputs` 旧版 VideoLLaMA3 + Qwen refinement 链路，因为它也是用大模型生成解释。

## 1. Qwen3-Omni 视频解释

输出文件：

```text
/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/outputs2/mosei_qwen3_omni_video_raw_explanations.jsonl
```

生成脚本：

```text
/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/scripts/generate_mosei_qwen3_omni_video_explanations.py
```

脚本默认提示词文件：

```text
/root/exp/test/video.txt
```

说明：当前这个 `/root/exp/test/video.txt` 文件已经不存在；但是输出 JSONL 里的 `prompt_file` 字段明确记录了该路径。根据 VS Code 历史文件恢复到最匹配当前输出标题的版本：

```text
/root/.vscode-server/data/User/History/-2b02f40f/4kO7.txt
```

该版本和输出中的 `[Scene and Situational Context]`、`[Visual Affective Summary]` 等标题一致。

调用方式：脚本把原始视频作为 `video` 输入，把下面文本作为同一个 user message 里的 `text` 输入；`use_audio_in_video=False`，所以这是 video-only 运行。

提示词内容：

```text
For the given video clip, act as an affective visual evidence extractor for multimodal sentiment analysis.

Your task is not to simply caption the video, and not to make a final sentiment classification. Instead, generate a detailed, evidence-grounded explanation of the visual cues that may be useful for downstream sentiment or emotion analysis.

Focus only on visually observable information from the video. Separate what is directly visible from what is only a possible emotional interpretation.

Please analyze the clip using the following aspects:

1. Scene and situational context
   - Describe the background, environment, and social setting only when they are relevant to emotional interpretation.
   - Note whether the scene appears formal, casual, tense, isolated, crowded, celebratory, confrontational, or otherwise emotionally meaningful.
   - Do not over-rely on the scene alone to infer emotion.

2. Interaction context
   - Describe whether the person is speaking to someone, reacting to others, performing, listening, arguing, waiting, or acting alone.
   - Note whether the visual behavior appears responsive to another person or event in the scene.

3. Facial expression cues
   - Describe visible facial affective signals such as smiling, frowning, brow furrowing, eyebrow raising, eye widening, narrowed eyes, crying, tears, lip compression, mouth tension, or emotionally flat expression.
   - Mention whether the expression appears strong, subtle, natural, restrained, forced, or ambiguous.
   - If the face is unclear, partially occluded, or too small, state that explicitly.

4. Gaze, eye behavior, and head movement
   - Describe eye contact, gaze aversion, downward gaze, eye rolling, restless gaze, staring, or lack of visible eye cues.
   - Describe head movements such as nodding, shaking the head, lowering the head, tilting, turning away, or repeated head motion.
   - Explain how these cues may contribute to visual affective interpretation when appropriate.

5. Body posture and global body state
   - Describe posture such as upright, slumped, leaning forward, leaning backward, withdrawn, rigid, relaxed, closed, or open.
   - Note body orientation relative to other people or objects.
   - Mention whether the posture suggests engagement, avoidance, fatigue, tension, passivity, confidence, or uncertainty, but keep the interpretation grounded in visible evidence.

6. Hand, arm, and gesture cues
   - Describe meaningful gestures such as waving, pointing, shrugging, open-palmed gestures, clenched fists, crossing arms, emphatic hand movements, weak or minimal gestures, or dismissive movements.
   - Note whether gestures are expansive, restrained, repetitive, abrupt, slow, or emotionally expressive.

7. Self-adaptor and tension-regulation behaviors
   - Identify visually observable behaviors such as touching the face, scratching the head, rubbing the hands, covering the mouth, holding the forehead, adjusting posture repeatedly, pulling clothing, or other self-directed movements that may indicate discomfort, hesitation, nervousness, embarrassment, or emotional regulation.
   - Only mention these if they are clearly visible.

8. Temporal emotional dynamics
   - Describe how the visual cues change over time within the clip.
   - Note whether the person becomes more relaxed, more tense, more expressive, more withdrawn, or remains visually stable.
   - Mention any transition such as “initially neutral, later visibly frustrated” or “brief smile followed by a tense expression,” if supported by the clip.

9. Visual consistency, conflict, or ambiguity
   - Assess whether the visual cues are mutually consistent or mixed.
   - Note possible contradictions, such as smiling with tense posture, cheerful setting with withdrawn body language, or restrained facial expression with agitated gestures.
   - If the visual evidence is weak or conflicting, state that clearly instead of forcing a conclusion.

10. Visual affective summary
   - Summarize the strongest emotionally relevant visual evidence in 2–4 sentences.
   - State the likely visual affective tendency as one of:
     [visually positive / visually negative / visually neutral / mixed or ambiguous / insufficient visual evidence]
   - This should be a visual-only tendency, not the final multimodal sentiment label.

11. Uncertainty and visibility limitations
   - Explicitly mention if the clip is low-resolution, occluded, poorly lit, too short, or lacks clear emotional visual evidence.
   - Distinguish strong evidence from weak speculation.

Please output the analysis in the following structured format:

[Scene and Situational Context]
...

[Interaction Context]
...

[Facial Expression Cues]
...

[Gaze, Eye Behavior, and Head Movement]
...

[Body Posture and Global Body State]
...

[Hand, Arm, and Gesture Cues]
...

[Self-Adaptor and Tension-Regulation Behaviors]
...

[Temporal Emotional Dynamics]
...

[Visual Consistency, Conflict, or Ambiguity]
...

[Visual Affective Summary]
...

[Uncertainty and Visibility Limitations]
...
```

## 2. Qwen3-Omni 音频解释

输出文件：

```text
/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/outputs2/mosei_qwen3_omni_audio_loudnorm_raw_explanations.jsonl
```

生成脚本：

```text
/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/scripts/qwen3_omni_audio_only/generate_mosei_qwen3_omni_audio_explanations.py
```

提示词文件：

```text
/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/outputs2/audio.txt
```

调用方式：脚本把 loudness-normalized 音频作为 `audio` 输入，把下面文本作为同一个 user message 里的 `text` 输入。

提示词内容：

```text
For the given audio segment, perform a comprehensive paralinguistic and affective analysis rather than only classifying the emotion.
Focus on extracting emotionally relevant acoustic, prosodic, vocal, conversational, and psychological cues that may contribute to sentiment and emotion understanding.
Analyze the audio from the following perspectives:
1. Prosodic Characteristics
- Pitch level, variability, instability, sudden changes, monotonicity
- Speaking rate, hesitation, acceleration, fragmentation
- Loudness and energy dynamics
- Rhythm, pause frequency, pause duration, broken speech flow
- Emotional intensity and temporal variation
2. Voice Quality and Vocal Texture
- Timbre characteristics (warm, cold, tense, soft, nasal, breathy, husky, metallic, resonant, compressed, relaxed)
- Voice stability (shaky, trembling, cracking, emotionally restrained)
- Breathing-related cues (heavy breathing, shaky breathing, sighing, suppressed breathing, exhalation emphasis)
- Signs of vocal fatigue, tension, or emotional suppression
3. Non-verbal and Paralinguistic Events
- Laughter and its possible type (genuine, nervous, sarcastic, forced, restrained, mocking)
- Hesitation sounds, fillers, hums, scoffs, clicks, grunts, gasps, crying-related cues
- Signs of sarcasm, contempt, embarrassment, defensiveness, passive aggression, uncertainty, or discomfort
4. Recording Quality and Acoustic Reliability
- Background noise level and type, such as air conditioner noise, traffic noise, music, crowd noise, echo, wind, or electronic noise.
- Recording clarity, microphone distance, reverberation, clipping, distortion, or low-volume speech.
- Whether the acoustic condition makes emotional cues unreliable or ambiguous.
5. Emotional and Psychological Interpretation(If any of the following entries does not exist in the original audio, please do not mention them in outputs)
- Primary and secondary emotions
- Mixed or conflicting emotions
- Emotional transitions
- Emotional authenticity (genuine, masked, exaggerated, socially controlled)
- Possible psychological states such as anxiety, confidence, frustration, fear, emotional fatigue, defensiveness, empathy, nervousness, or emotional suppression
6. Ambiguity and Reliability
- Mention uncertainty when the emotional cues are ambiguous
- Provide confidence estimation for major emotional interpretations
- Avoid forcing a single emotion label when evidence is insufficient
Do not summarize the transcript semantics unless they directly contribute to emotional interpretation.
Focus primarily on affective and paralinguistic information conveyed by the audio itself.
Please output the analysis in the following structured format:

[Prosodic Characteristics]
...

[Voice Quality and Vocal Texture]
...

[Non-verbal and Paralinguistic Events]
...

[Recording Quality and Acoustic Reliability]
...

[Emotional and Psychological Interpretation]
...

[Ambiguity and Reliability]
...
```

## 3. Qwen3-Omni 三模态权重生成

输出文件：

```text
/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/outputs2/mosei_qwen3_omni_modal_fusion_weights.jsonl
```

生成脚本：

```text
/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/scripts/qwen3_omni_modal_fusion/generate_mosei_qwen3_omni_modal_fusion.py
```

提示词文件：

```text
/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/outputs2/modal.txt
```

调用方式：脚本把原始视频、原始音频、填充后的文本 prompt 一起放进同一个 user message。模板变量填充如下：

```text
{clip_id} -> sample["sample_id"]
{video_or_multimodal_input} -> The attached original MOSEI clip contains raw video frames sampled at 2 fps and its audio track.
{transcript} -> 原始字幕/文本，最多约 3000 字符
{audio_explanation} -> 音频解释的 audio_clue/reason/raw_explanation，最多约 2200 字符
{visual_explanation} -> 视频解释的 visual_clue/reason/raw_explanation，最多约 2200 字符
{optional_model_outputs} -> None.
```

提示词内容：

```text
You are an expert annotator for Multimodal Sentiment Analysis (MSA). 
Your task is to assign fusion weights for three modalities: text, audio, and visual, for a Sparse Mixture-of-Experts (SMoE) fusion module.

The weights must represent the relative importance of each modality for predicting the speaker’s sentiment in this specific clip. 
They are NOT confidence scores, NOT modality quality scores alone, and NOT emotion intensity scores. 
They should reflect how much each modality contributes to the final sentiment prediction after considering raw evidence, modality explanations, reliability, redundancy, and cross-modal conflicts.

Inputs:
- Clip ID: {clip_id}
- Raw video/audio: {video_or_multimodal_input}
- Transcript / text modality: {transcript}
- Audio modality explanation: {audio_explanation}
- Visual modality explanation: {visual_explanation}
- Optional prior model outputs, if any: {optional_model_outputs}

Important principles:
1. Treat the raw video/audio and transcript as primary evidence. Treat modality explanations as auxiliary evidence.
2. Check whether each modality explanation is faithful to the raw modality content. If an explanation is inconsistent with the raw content, reduce the reliability of that modality explanation.
3. Evaluate each modality for:
   - raw-content faithfulness,
   - sentiment evidence strength,
   - reliability or noise level,
   - unique contribution beyond other modalities,
   - temporal salience,
   - consistency or conflict with other modalities.
4. Text often provides explicit semantic sentiment, but do not automatically give text the highest weight. Audio or visual cues may be more important when there is sarcasm, irony, discomfort, fake smile, polite smile, nervous laughter, monotone voice, hesitation, facial masking, or mismatch between literal words and non-verbal cues.
5. If all modalities agree, assign weights according to informativeness and reliability, not necessarily equally.
6. If modalities conflict, assign higher weights to the modalities that better explain the speaker’s true affect in context.
7. If a modality is missing, corrupted, unclear, or emotionally uninformative, its weight should be low but still non-negative.
8. The three weights must be decimals between 0 and 1 and must sum to 1.00 after rounding. A small tolerance is allowed only for external checking: 0.99 <= sum <= 1.01.
9. Before finalizing, compute the weight sum yourself. If the sum is outside [0.99, 1.01], revise the weights and regenerate the JSON for this same clip.
10. Output ONLY valid JSON. Do not include Markdown, comments, or extra text.

Use the following JSON schema exactly:

{
  "clip_id": "string",
  "sentiment_prediction": {
    "label": "positive | negative | neutral | mixed | uncertain",
    "confidence": 0.0,
    "brief_reason": "string"
  },
  "modality_assessment": {
    "text": {
      "raw_content_faithfulness": 0,
      "sentiment_evidence_strength": 0,
      "reliability": 0,
      "unique_contribution": 0,
      "conflict_role": "supports | contradicts | ambiguous | unavailable",
      "brief_reason": "string"
    },
    "audio": {
      "raw_content_faithfulness": 0,
      "sentiment_evidence_strength": 0,
      "reliability": 0,
      "unique_contribution": 0,
      "conflict_role": "supports | contradicts | ambiguous | unavailable",
      "brief_reason": "string"
    },
    "visual": {
      "raw_content_faithfulness": 0,
      "sentiment_evidence_strength": 0,
      "reliability": 0,
      "unique_contribution": 0,
      "conflict_role": "supports | contradicts | ambiguous | unavailable",
      "brief_reason": "string"
    }
  },
  "cross_modal_analysis": {
    "modalities_consistent": true,
    "main_conflict": "string",
    "sarcasm_or_irony_possible": false,
    "fake_or_polite_smile_possible": false,
    "single_modality_risk": "string",
    "overall_explanation": "string"
  },
  "weights": {
    "text": 0.00,
    "audio": 0.00,
    "visual": 0.00
  },
  "weight_sum": 1.00,
  "sum_valid": true
}

Scoring rules for the assessment fields:
- Use integers from 0 to 10.
- raw_content_faithfulness: whether the explanation and observed content match the actual modality.
- sentiment_evidence_strength: how directly this modality indicates sentiment polarity or intensity.
- reliability: clarity and trustworthiness of the modality signal.
- unique_contribution: how much this modality adds beyond the other modalities.
- conflict_role: whether this modality supports, contradicts, is ambiguous, or is unavailable relative to the final sentiment interpretation.

Weighting rules:
- Use exactly two decimal places for each weight.
- Each weight must be between 0.00 and 1.00.
- The sum of text + audio + visual must be 1.00 whenever possible.
- If rounding causes a small deviation, the sum must still be within [0.99, 1.01].
- Do not output sum_valid as true unless the computed sum is within [0.99, 1.01].
```

## 4. Qwen3-Omni 最终综合描述

输出文件：

```text
/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/outputE/mosei_qwen3_omni_final_descriptions.jsonl
```

生成脚本：

```text
/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/scripts/qwen3_omni_final_description/generate_mosei_qwen3_omni_final_description.py
```

提示词文件：

```text
/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/outputE/提示词.txt
```

调用方式：脚本把原始视频、原始音频、填充后的文本 prompt 一起放进同一个 user message。这个提示词文件本身没有 `{clip_id}` 等占位符，所以脚本实际会在提示词后追加下面字段：

```text
Clip ID: {sample_id}
Transcript: {原始字幕/文本，最多约 3000 字符}
Pre-extracted audio clues: {音频解释，最多约 2200 字符}
Pre-extracted visual clues: {视频解释，最多约 2200 字符}
```

提示词文件内容：

```text
You are given a short multimodal sentiment analysis segment, including the raw transcript, raw audio, raw video, and pre-extracted audio and visual clues.

Write one concise integrated description to support downstream sentiment recognition. Use the transcript as semantic context, but do not produce a separate text-only explanation. Combine the raw modalities with the extracted clues, and describe only the most emotion-relevant evidence from speech content, voice/prosody, pauses, vocal energy, facial expression, gaze, posture, gestures, and temporal dynamics.

Also analyze meaningful cross-modal relations, such as consistency, conflict, ambiguity, sarcasm, irony, fake smile, emotional restraint, solemnity, confidence, tension, or uncertainty, but mention these only when supported by the evidence.

Do not assign a final sentiment label, polarity, score, or prediction conclusion. Do not infer private traits or facts beyond the clip. Do not include headings, bullet points, JSON, or explanations of your reasoning process. Output only the integrated description, within 200 words.
```

实际发送给模型的文本结构：

```text
You are given a short multimodal sentiment analysis segment, including the raw transcript, raw audio, raw video, and pre-extracted audio and visual clues.

Write one concise integrated description to support downstream sentiment recognition. Use the transcript as semantic context, but do not produce a separate text-only explanation. Combine the raw modalities with the extracted clues, and describe only the most emotion-relevant evidence from speech content, voice/prosody, pauses, vocal energy, facial expression, gaze, posture, gestures, and temporal dynamics.

Also analyze meaningful cross-modal relations, such as consistency, conflict, ambiguity, sarcasm, irony, fake smile, emotional restraint, solemnity, confidence, tension, or uncertainty, but mention these only when supported by the evidence.

Do not assign a final sentiment label, polarity, score, or prediction conclusion. Do not infer private traits or facts beyond the clip. Do not include headings, bullet points, JSON, or explanations of your reasoning process. Output only the integrated description, within 200 words.

Clip ID: {sample_id}
Transcript: {transcript}
Pre-extracted audio clues: {audio_explanation}
Pre-extracted visual clues: {visual_explanation}
```

## 5. 旧版 store/outputs：VideoLLaMA3 原始解释

输出文件：

```text
/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/outputs/mosei_vl3_raw_explanations.jsonl
/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/outputs/mosei_vl3_missing37_raw.jsonl
```

生成脚本：

```text
/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/scripts/generate_mosei_vl3_explanations.py
```

该链路没有外部 prompt 文件，提示词内联在脚本里。

system message：

```text
You are a careful multimodal sentiment analysis assistant.
```

user text 模板：

```text
Watch the video clip and read the subtitle. Generate a very concise raw explanation for multimodal sentiment analysis in exactly four short lines with these labels: Visual evidence, Audio/spoken-delivery evidence, Subtitle/comment evidence, Integrated rationale. If an acoustic detail is not directly available, describe only what can be inferred from the visible speaking behavior and subtitle. Keep the whole answer under 70 words. Do not output a sentiment score.

Subtitle: {subtitle}
```

## 6. 旧版 store/outputs：Qwen 解释精炼

输出文件：

```text
/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/outputs/mosei_qwen_refined_explanations.jsonl
```

生成脚本：

```text
/root/siton-data-531cb60d91bd4013b805b412b0be2176/tlw/store/scripts/refine_mosei_qwen_explanations.py
```

该链路没有外部 prompt 文件，提示词内联在脚本里。

精炼 system message：

```text
You output strict JSON for multimodal sentiment explanations.
```

精炼 user prompt 模板：

```text
Convert raw multimodal sentiment-analysis explanations into compact strict JSON. Return a single JSON object with exactly these five string keys: visual_clue, audio_clue, text_clue, reason, clue_summary.

Rules:
1. visual_clue: facial expression, gaze, posture, gesture, setting, or visible speaking behavior.
2. audio_clue: vocal tone, pace, emphasis, affect, or spoken-delivery evidence. If raw audio is not explicit, state the cautious inference from speaking behavior and transcript.
3. text_clue: sentiment evidence from the subtitle/comment words.
4. reason: integrate visual, audio, and text evidence for the likely sentiment.
5. clue_summary: one concise sentence summarizing all clues.
Use one short sentence per value. Do not invent a numeric sentiment score. Do not use markdown fences. Do not include any text outside the JSON object.

Subtitle: {text}

Raw explanation:
{raw_explanation}
```

修复 malformed JSON 的 system message：

```text
You repair malformed JSON and output only valid JSON.
```

修复 user prompt 模板：

```text
Repair the malformed answer into one valid minified JSON object. Use exactly these keys: visual_clue, audio_clue, text_clue, reason, clue_summary. All values must be strings. Return only JSON.

Subtitle: {text}
Raw explanation: {raw_explanation}
Malformed answer: {bad_text}
```

同一个旧版脚本 `generate_mosei_vl3_explanations.py` 里也有一个较短的 Qwen refine prompt：

```text
You refine noisy multimodal sentiment explanations into valid compact JSON. Do not invent labels or numeric sentiment scores. Keep all five fields short and specific.

Required JSON keys: visual_clue, audio_clue, text_clue, reason, clue_summary.

Subtitle: {subtitle}

Raw explanation:
{raw_text}
```

对应 system message：

```text
You refine explanations into strict JSON.
```