Ajusta contexto, falas e foco, tremulação do video e demais bugs

2026-01-03 19:42:23 -03:00
parent c1914dad00
commit 3f7329869d
7 changed files with 932 additions and 455 deletions
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -9,7 +9,7 @@ services:
      - RABBITMQ_PASS=${RABBITMQ_PASS}
      - OPENROUTER_API_URL=${OPENROUTER_API_URL:-https://openrouter.ai/api/v1/chat/completions}
      - OPENROUTER_API_KEY=${OPENROUTER_API_KEY}
-      - OPENROUTER_MODEL=${OPENROUTER_MODEL:-openai/gpt-oss-20b:free}
+      - OPENROUTER_MODEL=${OPENROUTER_MODEL:-mistralai/mistral-small-3.1-24b-instruct:free}
      - OPENROUTER_PROMPT_PATH=${OPENROUTER_PROMPT_PATH:-prompts/generate.txt}
      - FASTER_WHISPER_MODEL_SIZE=${FASTER_WHISPER_MODEL_SIZE:-medium}
      - SMART_FRAMING_SMOOTHING_WINDOW=${SMART_FRAMING_SMOOTHING_WINDOW:-30}
--- a/prompts/generate.txt
+++ b/prompts/generate.txt
@@ -1,118 +1,111 @@
-Você é especialista em viralidade de redes sociais (TikTok, Instagram Reels, YouTube Shorts). Sua missão: EXTRAIR O MÁXIMO de clips virais possíveis, priorizando QUANTIDADE + QUALIDADE.
+# TAREFA: Extrair clips virais de uma transcrição de vídeo
-🎯 OBJETIVO: Transformar cada vídeo em MÚLTIPLOS clips que podem viralizar
+Você é um especialista em conteúdo viral para TikTok, Instagram Reels e YouTube Shorts.
-PROCESSO DE ANÁLISE:
+## REGRA MAIS IMPORTANTE - DURAÇÃO DOS CLIPS
 1. Mapear TODOS os potenciais trechos virais na transcrição
 2. Avaliar cada trecho usando sistema de pontuação abaixo
 3. Rankear do maior para menor score viral
 4. Selecionar TODOS os trechos com score ≥ 60 (não seja conservador!)
-SISTEMA DE PONTUAÇÃO VIRAL (0-100 pontos):
+**CADA CLIP DEVE TER ENTRE 60 E 120 SEGUNDOS DE DURAÇÃO.**
-🪝 GANCHO INICIAL (0-30 pontos) - CRÍTICO PARA VIRALIZAÇÃO:
+- MÍNIMO ABSOLUTO: 60 segundos (end - start >= 60)
-[30] Frase CHOCANTE, pergunta POLÊMICA ou promessa OUSADA nos primeiros 3 segundos
+- MÁXIMO: 120 segundos (end - start <= 120)
-[25] Hook forte: "Você não vai acreditar...", "O segredo que ninguém conta...", "Isso mudou tudo..."
+- IDEAL: 60-90 segundos
 [20] Pergunta intrigante ou afirmação controversa
 [15] História interessante mas gancho fraco
 [10] Início genérico mas aceitável
 [0] "Oi", "então", "bem", silêncio - DESCARTAR
-🔥 GATILHO EMOCIONAL (0-25 pontos):
+**CLIPS COM MENOS DE 60 SEGUNDOS SERÃO REJEITADOS PELO SISTEMA.**
 [25] Emoção EXTREMA: raiva, choque, riso intenso, WTF moment, revelação bombástica
 [20] Emoção forte: surpresa, indignação, humor, curiosidade intensa
 [15] Emoção moderada: interesse, leve humor, insight interessante
 [10] Emoção fraca: informativo sem impacto
 [0] Monótono, técnico, sem apelo emocional - EVITAR
-💎 VALOR/UTILIDADE (0-20 pontos):
+Antes de incluir um clip, SEMPRE calcule: end - start >= 60
 [20] Segredo VALIOSO, insight transformador, informação EXCLUSIVA
 [15] Ensina algo prático e IMEDIATAMENTE aplicável
 [10] Opinião interessante ou perspectiva única
 [5] Informação genérica ou conhecimento comum
 [0] Nenhum valor prático, puro "enrolation" - DESCARTAR
-📖 ESTRUTURA NARRATIVA (0-15 pontos):
+## QUANTIDADE DE CLIPS
 [15] História COMPLETA com início, conflito/clímax e resolução satisfatória
 [10] Segmento com começo e fim coerentes, faz sentido isolado
 [5] Trecho com sentido mas cortado abruptamente
 [0] Fragmento sem contexto - NÃO USAR
-⚡ RITMO E ENERGIA (0-10 pontos):
+Baseado na duração total do vídeo:
-[10] DINÂMICO, sem pausas longas, alta energia, palavras impactantes
+- Até 10 min: 2-4 clips
-[7] Bom ritmo com pausas naturais curtas (< 2s)
+- 10-20 min: 4-6 clips
-[3] Ritmo lento mas aceitável
+- 20-30 min: 6-10 clips
-[0] Muitas pausas (> 3s), hesitações, monotonia - EVITAR
+- 30+ min: 8-15 clips
-REGRAS DE QUANTIDADE (SER AGRESSIVO):
+## CRITÉRIOS DE SELEÇÃO
 📊 Quantidade MÍNIMA por duração:
 - 5-10 min: MÍNIMO 4-6 clips
 - 10-15 min: MÍNIMO 6-8 clips
 - 15-20 min: MÍNIMO 8-10 clips
 - 20-30 min: MÍNIMO 10-15 clips
 - 30+ min: MÍNIMO 15-20 clips
-🎯 REGRA DE OURO: 1 clip a cada 2-3 minutos de vídeo (NO MÍNIMO)
+Um bom clip viral possui:
 - Se encontrar momentos virais, SEMPRE selecione!
 - Melhor ter 3 clips perfeitos que 10 clips bons
-CRITÉRIOS DE SELEÇÃO:
+1. GANCHO FORTE nos primeiros 3 segundos (pergunta, afirmação chocante, promessa)
- Score viral ≥ 60 pontos (idealmente ≥ 70)
+2. EMOÇÃO (humor, surpresa, indignação, curiosidade)
- Duração ideal: 60-120s (formato ideal para Reels/Shorts)
+3. VALOR (ensina algo, revela segredo, dá dica prática)
- Duração mínima: 60s | Duração máxima: 120s
+4. ESTRUTURA (início, meio e fim coerentes)
- Sem sobreposição temporal
+5. RITMO (sem pausas longas, dinâmico)
 - DEVE ter gancho forte nos primeiros 3 segundos
 - Início e fim coerentes
-GANCHOS QUE FAZEM VIRALIZAR (use como filtro):
+## O QUE EVITAR
 - "O que ninguém te conta sobre..."
 - "O erro que 90% das pessoas cometem..."
 - "Você não vai acreditar o que aconteceu..."
 - Revelações chocantes ou contraintuitivas
 - Antes vs Depois, transformações
 - Segredos, bastidores, verdades ocultas
 - Polêmicas, opiniões fortes, hot takes
 - Histórias dramáticas com reviravolta
 - Dicas práticas e acionáveis
 - Momentos de humor genuíno
-❌ EVITE (mas não descarte se score alto):
+- Introduções genéricas ("oi pessoal", "então", "bem")
- Introduções genéricos SEM gancho
+- Trechos com pausas longas (> 3 segundos de silêncio)
- Trechos com pausas > 3s consecutivas
+- Segmentos sem contexto ou conclusão
- Explicações técnicas SEM gancho emocional
+- Explicações técnicas monótonas
 - Segmentos sem conclusão clara
 - Momentos de transição vazios
-FORMATO JSON (retorne APENAS isto, SEM texto adicional):
+## FORMATO DE RESPOSTA
 Retorne APENAS um JSON válido, sem texto antes ou depois:
 ```json
 {
  "highlights": [
    {
-      "start": <float>,
+      "start": 0.0,
-      "end": <float>,
+      "end": 75.0,
-      "summary": "Score: XX/100 | Gancho: [descreva] | Gatilho: [descreva]",
+      "summary": "Descrição do que acontece neste trecho"
    },
    {
      "start": 120.5,
      "end": 195.0,
      "summary": "Descrição do que acontece neste trecho"
    }
  ]
 }
 ```
-REGRAS TÉCNICAS:
+## REGRAS DO JSON
 - Float com ponto decimal (45.5 NÃO 45,5)
 - Timestamps exatos dos segments fornecidos
 - Ordem cronológica (start crescente)
 - Summary conciso mas informativo (2-3 frases)
-TAREFA PASSO A PASSO:
+- "start" e "end" são números decimais (float) em SEGUNDOS
-1. Leia transcrição completa
+- Use ponto como separador decimal (60.5, não 60,5)
-2. Identifique TODOS os momentos potencialmente virais
+- "summary" é uma descrição breve do conteúdo (1-2 frases)
-3. Avalie e pontue cada trecho (seja generoso!)
+- Clips em ordem cronológica (start crescente)
-4. Rankear por score viral
+- Clips não podem se sobrepor
 5. Selecione TODOS com score ≥ 60
 6. Garanta mínimo de 1 clip a cada 5 minutos
 7. Retorne JSON completo
-⚠️ IMPORTANTE:
+## CHECKLIST ANTES DE RESPONDER
 - NÃO seja conservador! Se encontrou 10 momentos bons, retorne os 10!
 - Pense em MAXIMIZAR alcance: mais clips = mais chances de viralizar
 - Se vídeo tem conteúdo fraco, seja criterioso, mas SEMPRE retorne pelo menos 3-5 clips
 - Priorize clips com GANCHOS FORTES - gancho fraco = baixo alcance
-🎯 MINDSET: Você é um criador de conteúdo viral. Seu objetivo é extrair MÁXIMO valor do vídeo original.
+Para CADA clip, verifique:
 - [ ] end - start >= 60 segundos?
 - [ ] end - start <= 120 segundos?
 - [ ] Tem gancho forte no início?
 - [ ] Faz sentido isolado do resto do vídeo?
 - [ ] JSON está válido?
 ## EXEMPLO
 Se o vídeo tem 15 minutos e você encontrou 4 momentos virais:
 ```json
 {
  "highlights": [
    {
      "start": 60.0,
      "end": 120.0,
      "summary": "Revelação sobre como economizar 50% nas compras"
    },
    {
      "start": 180.0,
      "end": 255.0,
      "summary": "História engraçada sobre cliente que tentou enganar a loja"
    },
    {
      "start": 400.0,
      "end": 480.0,
      "summary": "Dica prática de negociação com fornecedores"
    },
    {
      "start": 600.0,
      "end": 690.0,
      "summary": "Conclusão motivacional sobre empreendedorismo"
    }
  ]
 }
 ```
 Agora analise a transcrição fornecida e extraia os clips virais seguindo estas instruções.
--- a/video_render/config.py
+++ b/video_render/config.py
@@ -62,13 +62,16 @@ class RenderingSettings:
    subtitle_font_size: int = int(os.environ.get("RENDER_SUBTITLE_FONT_SIZE", 64))
    caption_min_words: int = int(os.environ.get("CAPTION_MIN_WORDS", 2))
    caption_max_words: int = int(os.environ.get("CAPTION_MAX_WORDS", 2))
    # Smart framing settings - CONTAINMENT TRACKING mode
    enable_smart_framing: bool = os.environ.get("ENABLE_SMART_FRAMING", "true").lower() in ("true", "1", "yes")
-    smart_framing_min_confidence: float = float(os.environ.get("SMART_FRAMING_MIN_CONFIDENCE", 0.3))  # Lowered for better cartoon detection
+    smart_framing_min_confidence: float = float(os.environ.get("SMART_FRAMING_MIN_CONFIDENCE", 0.3))
-    smart_framing_smoothing_window: int = int(os.environ.get("SMART_FRAMING_SMOOTHING_WINDOW", 30))  # Reduced - not needed with containment tracking
+    smart_framing_smoothing_window: int = int(os.environ.get("SMART_FRAMING_SMOOTHING_WINDOW", 30))
-    smart_framing_frame_skip: int = int(os.environ.get("SMART_FRAMING_FRAME_SKIP", 1))  # Process every frame for smooth 30 FPS tracking
+    smart_framing_frame_skip: int = int(os.environ.get("SMART_FRAMING_FRAME_SKIP", 1))
-    smart_framing_max_velocity: int = int(os.environ.get("SMART_FRAMING_MAX_VELOCITY", 20))  # Moderate - only used during transitions
+    smart_framing_max_velocity: int = int(os.environ.get("SMART_FRAMING_MAX_VELOCITY", 25))
-    smart_framing_person_switch_cooldown: int = int(os.environ.get("SMART_FRAMING_PERSON_SWITCH_COOLDOWN", 999999))  # DISABLED - never switch people
+    smart_framing_person_switch_cooldown: int = int(os.environ.get("SMART_FRAMING_PERSON_SWITCH_COOLDOWN", 30))
    smart_framing_response_time: float = float(os.environ.get("SMART_FRAMING_RESPONSE_TIME", 0.6))
    smart_framing_group_padding: float = float(os.environ.get("SMART_FRAMING_GROUP_PADDING", 0.15))
    smart_framing_max_zoom_out: float = float(os.environ.get("SMART_FRAMING_MAX_ZOOM_OUT", 2.0))
    smart_framing_dead_zone: int = int(os.environ.get("SMART_FRAMING_DEAD_ZONE", 60))
@dataclass(frozen=True)
--- a/video_render/context_detection.py
+++ b/video_render/context_detection.py
@@ -41,6 +41,18 @@ class PersonTracking:
    frame_number: int
@dataclass
 class GroupBoundingBox:
    """Bounding box containing all tracked faces."""
    x: int
    y: int
    width: int
    height: int
    center_x: int
    center_y: int
    face_count: int
@dataclass
 class FrameContext:
    """Context information for a video frame."""
@@ -50,7 +62,8 @@ class FrameContext:
    active_speakers: List[int]  # indices of speaking faces
    primary_focus: Optional[Tuple[int, int]]  # (x, y) center point
    layout_mode: str  # "single", "dual_split", "grid"
-    selected_people: List[int] = field(default_factory=list)  # indices of people selected for display (max 2)
+    selected_people: List[int] = field(default_factory=list)  # indices of people selected for display
    group_bounds: Optional[GroupBoundingBox] = None  # bounding box for all detected faces
 class MediaPipeDetector:
@@ -385,10 +398,11 @@ class AudioActivityDetector:
 class ContextAnalyzer:
    """Analyzes video context to determine focus and layout."""
-    def __init__(self, person_switch_cooldown: int = 30):
+    def __init__(self, person_switch_cooldown: int = 30, min_face_confidence: float = 0.3):
        self.detector = MediaPipeDetector()
        self.audio_detector = AudioActivityDetector()
        self.previous_faces: List[FaceDetection] = []
        self.min_face_confidence = min_face_confidence
        # Person tracking state
        self.current_selected_people: List[int] = []  # Indices of people currently on screen
@@ -400,9 +414,9 @@ class ContextAnalyzer:
        self.stability_threshold = 20  # Frames needed to confirm a switch (increased for more stability)
        self.last_switched_people: List[int] = []  # People we just switched FROM
        # Focus stability: track recent focus points for temporal smoothing
        self.focus_history: List[Tuple[int, int]] = []
-        self.focus_history_size: int = 5  # Keep last 5 focus points for smoothing
+        self.focus_history_size: int = 20
        self.focus_dead_zone: int = 60
        # Debug logging
        self.frame_log_interval = 30  # Log every N frames
@@ -429,9 +443,11 @@ class ContextAnalyzer:
            FrameContext with detection results
        """
        faces = self.detector.detect_face_landmarks(frame)
        faces = [face for face in faces if face.confidence >= self.min_face_confidence] if faces else []
        if not faces:
            faces = self.detector.detect_faces(frame)
            faces = [face for face in faces if face.confidence >= self.min_face_confidence] if faces else []
        # Determine who is speaking
        active_speakers = []
@@ -440,13 +456,13 @@ class ContextAnalyzer:
        for i, face in enumerate(faces):
            is_speaking = False
-            # Check audio-based speech detection
+            # Prefer visual cues when multiple faces are present.
            if has_audio_speech:
                is_speaking = True
            # Check lip movement (visual speech detection)
            if face.landmarks and len(self.previous_faces) > i:
-                is_speaking = is_speaking or self._detect_lip_movement(face, self.previous_faces[i])
+                is_speaking = self._detect_lip_movement(face, self.previous_faces[i])
            # Audio can confirm speech when there's only one face.
            if has_audio_speech and len(faces) == 1:
                is_speaking = True
            if is_speaking:
                active_speakers.append(i)
@@ -456,6 +472,15 @@ class ContextAnalyzer:
            logger.info(f"Speech detection - Frame {frame_number}: audio_active={has_audio_speech}, "
                       f"speakers={active_speakers}, total_faces={len(faces)}")
        if active_speakers:
            selected_people = active_speakers[:4]
            if len(selected_people) == 1:
                layout_mode = "single"
            elif len(selected_people) == 2:
                layout_mode = "dual_split"
            else:
                layout_mode = "grid"
        else:
            # Select THE person to focus on (always single person)
            # Priority: 1) Who is speaking, 2) Who is most centered
            selected_people = self._select_person_to_focus(
@@ -465,17 +490,23 @@ class ContextAnalyzer:
                frame.shape[1],  # frame width for center calculation
                frame.shape[0]   # frame height for center calculation
            )
        # Always use single-person layout (no split screen)
            layout_mode = "single"
        # Calculate group bounding box for ALL detected faces (multi-person support)
        group_bounds = self._calculate_group_bounding_box(faces)
        # For multi-person mode, use group center as primary focus
        if group_bounds and group_bounds.face_count > 1:
            primary_focus = (group_bounds.center_x, group_bounds.center_y)
        else:
            primary_focus = self._calculate_focus_point(faces, selected_people)
        # Debug logging every N frames
        if frame_number % self.frame_log_interval == 0:
            focus_reason = "speaker" if active_speakers else "no_speech_detected"
            group_info = f", group={group_bounds.face_count} faces" if group_bounds else ""
            logger.info(f"Frame {frame_number}: {len(faces)} faces, "
-                       f"{len(active_speakers)} speakers, focus={selected_people}, reason={focus_reason}")
+                       f"{len(active_speakers)} speakers, focus={selected_people}, reason={focus_reason}{group_info}")
        self.previous_faces = faces
@@ -486,7 +517,8 @@ class ContextAnalyzer:
            active_speakers=active_speakers,
            primary_focus=primary_focus,
            layout_mode=layout_mode,
-            selected_people=selected_people
+            selected_people=selected_people,
            group_bounds=group_bounds
        )
    def _detect_lip_movement(self, current_face: FaceDetection, previous_face: FaceDetection) -> bool:
@@ -543,44 +575,40 @@ class ContextAnalyzer:
            self.current_selected_people = []
            return []
        # If only 1 person, always focus on them
        if len(faces) == 1:
            self.current_selected_people = [0]
            return [0]
        # Check if we can switch people (cooldown period)
        frames_since_last_switch = frame_number - self.last_switch_frame
        can_switch = frames_since_last_switch >= self.person_switch_cooldown
        # Calculate frame center for distance comparison
        frame_center_x = frame_width / 2
        frame_center_y = frame_height / 2
        # ULTRA-STABLE MODE: Select ONE person at start, NEVER switch
        # This completely eliminates switching-related instability
        desired_person_idx = None
-        # If we already have someone selected, ALWAYS KEEP THEM (never switch)
+        if active_speakers:
            if self.current_selected_people and self.current_selected_people[0] in active_speakers:
                desired_person_idx = self.current_selected_people[0]
            else:
                if can_switch or not self.current_selected_people:
                    desired_person_idx = active_speakers[0]
                    if self.current_selected_people and desired_person_idx != self.current_selected_people[0]:
                        logger.info(f"Switching focus to speaker: {desired_person_idx}")
                        self.last_switch_frame = frame_number
                else:
                    desired_person_idx = self.current_selected_people[0] if self.current_selected_people else active_speakers[0]
        else:
            if self.current_selected_people and len(self.current_selected_people) > 0:
                current_idx = self.current_selected_people[0]
                if current_idx < len(faces):
                # Current person still detected - keep them
                    desired_person_idx = current_idx
                else:
                # Current person lost - try to find them again by position/size similarity
                # This handles temporary detection failures
                current_person_found = False
                    if self.previous_faces and current_idx < len(self.previous_faces):
                        prev_face = self.previous_faces[current_idx]
                    # Find most similar face by position and size
                        best_match_idx = None
                        best_match_score = float('inf')
                        for idx, face in enumerate(faces):
                        # Distance between centers
                            dx = face.center_x - prev_face.center_x
                            dy = face.center_y - prev_face.center_y
                            dist = np.sqrt(dx**2 + dy**2)
                        # Size similarity
                            size_diff = abs(face.width - prev_face.width) + abs(face.height - prev_face.height)
                            score = dist + size_diff * 0.5
                            if score < best_match_score:
@@ -589,88 +617,26 @@ class ContextAnalyzer:
                        if best_match_idx is not None and best_match_score < 1000:
                            desired_person_idx = best_match_idx
-                        current_person_found = True
+                        else:
                if not current_person_found:
                    # Really lost - select most confident
                            face_confidences = [(idx, face.confidence) for idx, face in enumerate(faces)]
                            face_confidences.sort(key=lambda x: x[1], reverse=True)
                            desired_person_idx = face_confidences[0][0]
                    logger.warning(f"Current person permanently lost - selecting new: {desired_person_idx}")
                    else:
            # First frame - select most confident person ONCE
                        face_confidences = [(idx, face.confidence) for idx, face in enumerate(faces)]
                        face_confidences.sort(key=lambda x: x[1], reverse=True)
                        desired_person_idx = face_confidences[0][0]
            logger.info(f"INITIAL SELECTION - Person {desired_person_idx} (will be tracked throughout entire video)")
        # IGNORE SPEECH DETECTION - it was causing instability
        # We now track ONE person from start to finish, regardless of who speaks
        # OLD LOGIC (commented out - was causing issues):
        # This logic would switch based on "who is more centered" which caused constant switching
        if False:  # Disabled
            # Calculate distance from center for each face
            center_distances = []
            for idx, face in enumerate(faces):
                # Euclidean distance from frame center
                dx = face.center_x - frame_center_x
                dy = face.center_y - frame_center_y
                distance = np.sqrt(dx**2 + dy**2)
                center_distances.append((idx, distance, face.confidence))
            # Sort by distance (closest first), then by confidence as tiebreaker
            center_distances.sort(key=lambda x: (x[1], -x[2]))
            most_centered_idx = center_distances[0][0]
            most_centered_distance = center_distances[0][1]
            # STICKY BEHAVIOR: If we already have someone selected, only switch if:
            # - New person is SIGNIFICANTLY more centered (30% closer to center)
            # - OR current person is now very far from center (>40% of frame width)
            if self.current_selected_people and len(self.current_selected_people) > 0:
                current_idx = self.current_selected_people[0]
                if current_idx < len(faces):
                    current_face = faces[current_idx]
                    current_dx = current_face.center_x - frame_center_x
                    current_dy = current_face.center_y - frame_center_y
                    current_distance = np.sqrt(current_dx**2 + current_dy**2)
                    # Define "significantly better" threshold
                    max_acceptable_distance = frame_width * 0.4  # 40% of frame width
                    improvement_threshold = 0.7  # New person must be 30% closer (0.7 ratio)
                    # Keep current person if they're still reasonably centered
                    if current_distance < max_acceptable_distance:
                        # Current person is still acceptable - only switch if new is MUCH better
                        if most_centered_distance < current_distance * improvement_threshold:
                            desired_person_idx = most_centered_idx
                            logger.debug(f"Switching: new person MUCH more centered ({most_centered_distance:.0f} vs {current_distance:.0f})")
            else:
-                            desired_person_idx = current_idx  # Keep current
+                face_confidences = [(idx, face.confidence) for idx, face in enumerate(faces)]
-                            logger.debug(f"Keeping current person: still reasonably centered ({current_distance:.0f} px from center)")
+                face_confidences.sort(key=lambda x: x[1], reverse=True)
-                    else:
+                desired_person_idx = face_confidences[0][0]
                        # Current person is too far from center - switch
                        desired_person_idx = most_centered_idx
                        logger.debug(f"Current person too far from center ({current_distance:.0f} px), switching")
                else:
                    # Current selection invalid
                    desired_person_idx = most_centered_idx
            else:
                # First time - select most centered
                desired_person_idx = most_centered_idx
        # Wrap in list for compatibility with existing code
        desired_people = [desired_person_idx] if desired_person_idx is not None else []
        # ULTRA-STABLE MODE: NO SWITCHING LOGIC AT ALL
        # Simply set the person and never change
        if not self.current_selected_people:
            # First time only
            self.current_selected_people = desired_people
            self.last_switch_frame = frame_number
-            logger.info(f"Frame {frame_number}: LOCKED ON person {desired_people} - will never switch")
+            logger.info(f"Frame {frame_number}: Locked on person {desired_people}")
        else:
            # Already have someone - just update to desired (which is same person due to logic above)
            self.current_selected_people = desired_people
        return self.current_selected_people.copy()
@@ -798,24 +764,77 @@ class ContextAnalyzer:
                raw_focus_x = most_confident.center_x
                raw_focus_y = most_confident.center_y
-        # Apply temporal smoothing using focus history
+        if self.focus_history:
            last_x, last_y = self.focus_history[-1]
            dx = abs(raw_focus_x - last_x)
            dy = abs(raw_focus_y - last_y)
            if dx < self.focus_dead_zone and dy < self.focus_dead_zone:
                return self.focus_history[-1]
        self.focus_history.append((raw_focus_x, raw_focus_y))
        if len(self.focus_history) > self.focus_history_size:
            self.focus_history.pop(0)
-        # Calculate smoothed focus as weighted average (more weight to recent frames)
+        if len(self.focus_history) >= 5:
-        if len(self.focus_history) > 1:
+            xs = [x for x, y in self.focus_history]
-            # Exponential weights: recent frames have more influence
+            ys = [y for x, y in self.focus_history]
-            weights = [2 ** i for i in range(len(self.focus_history))]
+            median_x = int(np.median(xs))
-            total_weight = sum(weights)
+            median_y = int(np.median(ys))
-
+            return (median_x, median_y)
            smoothed_x = sum(x * w for (x, y), w in zip(self.focus_history, weights)) / total_weight
            smoothed_y = sum(y * w for (x, y), w in zip(self.focus_history, weights)) / total_weight
            return (int(smoothed_x), int(smoothed_y))
        else:
            return (raw_focus_x, raw_focus_y)
    def _calculate_group_bounding_box(
        self,
        faces: List[FaceDetection],
        padding_percent: float = 0.15,
        max_faces: int = 6
    ) -> Optional[GroupBoundingBox]:
        """
        Calculate bounding box containing all detected faces with padding.
        Args:
            faces: List of detected faces
            padding_percent: Padding around group as percentage of bbox dimensions
            max_faces: Maximum faces to include (use most confident if exceeded)
        Returns:
            GroupBoundingBox or None if no faces
        """
        if not faces:
            return None
        # If too many faces, use most confident ones
        if len(faces) > max_faces:
            faces = sorted(faces, key=lambda f: f.confidence, reverse=True)[:max_faces]
        # Calculate bounding box containing all faces
        min_x = min(f.x for f in faces)
        max_x = max(f.x + f.width for f in faces)
        min_y = min(f.y for f in faces)
        max_y = max(f.y + f.height for f in faces)
        # Add padding
        width = max_x - min_x
        height = max_y - min_y
        pad_x = int(width * padding_percent)
        pad_y = int(height * padding_percent)
        final_x = max(0, min_x - pad_x)
        final_y = max(0, min_y - pad_y)
        final_width = width + 2 * pad_x
        final_height = height + 2 * pad_y
        return GroupBoundingBox(
            x=final_x,
            y=final_y,
            width=final_width,
            height=final_height,
            center_x=final_x + final_width // 2,
            center_y=final_y + final_height // 2,
            face_count=len(faces)
        )
    def close(self):
        """Release resources."""
        self.detector.close()
--- a/video_render/llm.py
+++ b/video_render/llm.py
@@ -137,11 +137,11 @@ class OpenRouterCopywriter:
                    continue
                duration = end - start
-                if duration < 45:
+                if duration < 60:
                    logger.warning(f"Highlight ignorado: muito curto ({duration}s, minimo 45s)")
                    continue
-                if duration > 90:
+                if duration > 120:
                    logger.warning(f"Highlight ignorado: muito longo ({duration}s, maximo 90s)")
                    continue
--- a/video_render/rendering.py
+++ b/video_render/rendering.py
@@ -347,7 +347,12 @@ class VideoRenderer:
            frame_skip=settings.rendering.smart_framing_frame_skip,
            smoothing_window=settings.rendering.smart_framing_smoothing_window,
            max_velocity=settings.rendering.smart_framing_max_velocity,
-            person_switch_cooldown=settings.rendering.smart_framing_person_switch_cooldown
+            person_switch_cooldown=settings.rendering.smart_framing_person_switch_cooldown,
            response_time=settings.rendering.smart_framing_response_time,
            group_padding=settings.rendering.smart_framing_group_padding,
            max_zoom_out=settings.rendering.smart_framing_max_zoom_out,
            dead_zone=settings.rendering.smart_framing_dead_zone,
            min_face_confidence=settings.rendering.smart_framing_min_confidence
        )
    def render(
--- a/video_render/smart_framing.py
+++ b/video_render/smart_framing.py