Ajusta contexto, falas e foco, tremulação do video e demais bugs

This commit is contained in:
LeoMortari
2026-01-03 19:42:23 -03:00
parent c1914dad00
commit 3f7329869d
7 changed files with 932 additions and 455 deletions

View File

@@ -9,7 +9,7 @@ services:
- RABBITMQ_PASS=${RABBITMQ_PASS} - RABBITMQ_PASS=${RABBITMQ_PASS}
- OPENROUTER_API_URL=${OPENROUTER_API_URL:-https://openrouter.ai/api/v1/chat/completions} - OPENROUTER_API_URL=${OPENROUTER_API_URL:-https://openrouter.ai/api/v1/chat/completions}
- OPENROUTER_API_KEY=${OPENROUTER_API_KEY} - OPENROUTER_API_KEY=${OPENROUTER_API_KEY}
- OPENROUTER_MODEL=${OPENROUTER_MODEL:-openai/gpt-oss-20b:free} - OPENROUTER_MODEL=${OPENROUTER_MODEL:-mistralai/mistral-small-3.1-24b-instruct:free}
- OPENROUTER_PROMPT_PATH=${OPENROUTER_PROMPT_PATH:-prompts/generate.txt} - OPENROUTER_PROMPT_PATH=${OPENROUTER_PROMPT_PATH:-prompts/generate.txt}
- FASTER_WHISPER_MODEL_SIZE=${FASTER_WHISPER_MODEL_SIZE:-medium} - FASTER_WHISPER_MODEL_SIZE=${FASTER_WHISPER_MODEL_SIZE:-medium}
- SMART_FRAMING_SMOOTHING_WINDOW=${SMART_FRAMING_SMOOTHING_WINDOW:-30} - SMART_FRAMING_SMOOTHING_WINDOW=${SMART_FRAMING_SMOOTHING_WINDOW:-30}

View File

@@ -1,118 +1,111 @@
Você é especialista em viralidade de redes sociais (TikTok, Instagram Reels, YouTube Shorts). Sua missão: EXTRAIR O MÁXIMO de clips virais possíveis, priorizando QUANTIDADE + QUALIDADE. # TAREFA: Extrair clips virais de uma transcrição de vídeo
🎯 OBJETIVO: Transformar cada vídeo em MÚLTIPLOS clips que podem viralizar Você é um especialista em conteúdo viral para TikTok, Instagram Reels e YouTube Shorts.
PROCESSO DE ANÁLISE: ## REGRA MAIS IMPORTANTE - DURAÇÃO DOS CLIPS
1. Mapear TODOS os potenciais trechos virais na transcrição
2. Avaliar cada trecho usando sistema de pontuação abaixo
3. Rankear do maior para menor score viral
4. Selecionar TODOS os trechos com score ≥ 60 (não seja conservador!)
SISTEMA DE PONTUAÇÃO VIRAL (0-100 pontos): **CADA CLIP DEVE TER ENTRE 60 E 120 SEGUNDOS DE DURAÇÃO.**
🪝 GANCHO INICIAL (0-30 pontos) - CRÍTICO PARA VIRALIZAÇÃO: - MÍNIMO ABSOLUTO: 60 segundos (end - start >= 60)
[30] Frase CHOCANTE, pergunta POLÊMICA ou promessa OUSADA nos primeiros 3 segundos - MÁXIMO: 120 segundos (end - start <= 120)
[25] Hook forte: "Você não vai acreditar...", "O segredo que ninguém conta...", "Isso mudou tudo..." - IDEAL: 60-90 segundos
[20] Pergunta intrigante ou afirmação controversa
[15] História interessante mas gancho fraco
[10] Início genérico mas aceitável
[0] "Oi", "então", "bem", silêncio - DESCARTAR
🔥 GATILHO EMOCIONAL (0-25 pontos): **CLIPS COM MENOS DE 60 SEGUNDOS SERÃO REJEITADOS PELO SISTEMA.**
[25] Emoção EXTREMA: raiva, choque, riso intenso, WTF moment, revelação bombástica
[20] Emoção forte: surpresa, indignação, humor, curiosidade intensa
[15] Emoção moderada: interesse, leve humor, insight interessante
[10] Emoção fraca: informativo sem impacto
[0] Monótono, técnico, sem apelo emocional - EVITAR
💎 VALOR/UTILIDADE (0-20 pontos): Antes de incluir um clip, SEMPRE calcule: end - start >= 60
[20] Segredo VALIOSO, insight transformador, informação EXCLUSIVA
[15] Ensina algo prático e IMEDIATAMENTE aplicável
[10] Opinião interessante ou perspectiva única
[5] Informação genérica ou conhecimento comum
[0] Nenhum valor prático, puro "enrolation" - DESCARTAR
📖 ESTRUTURA NARRATIVA (0-15 pontos): ## QUANTIDADE DE CLIPS
[15] História COMPLETA com início, conflito/clímax e resolução satisfatória
[10] Segmento com começo e fim coerentes, faz sentido isolado
[5] Trecho com sentido mas cortado abruptamente
[0] Fragmento sem contexto - NÃO USAR
⚡ RITMO E ENERGIA (0-10 pontos): Baseado na duração total do vídeo:
[10] DINÂMICO, sem pausas longas, alta energia, palavras impactantes - Até 10 min: 2-4 clips
[7] Bom ritmo com pausas naturais curtas (< 2s) - 10-20 min: 4-6 clips
[3] Ritmo lento mas aceitável - 20-30 min: 6-10 clips
[0] Muitas pausas (> 3s), hesitações, monotonia - EVITAR - 30+ min: 8-15 clips
REGRAS DE QUANTIDADE (SER AGRESSIVO): ## CRITÉRIOS DE SELEÇÃO
📊 Quantidade MÍNIMA por duração:
- 5-10 min: MÍNIMO 4-6 clips
- 10-15 min: MÍNIMO 6-8 clips
- 15-20 min: MÍNIMO 8-10 clips
- 20-30 min: MÍNIMO 10-15 clips
- 30+ min: MÍNIMO 15-20 clips
🎯 REGRA DE OURO: 1 clip a cada 2-3 minutos de vídeo (NO MÍNIMO) Um bom clip viral possui:
- Se encontrar momentos virais, SEMPRE selecione!
- Melhor ter 3 clips perfeitos que 10 clips bons
CRITÉRIOS DE SELEÇÃO: 1. GANCHO FORTE nos primeiros 3 segundos (pergunta, afirmação chocante, promessa)
- Score viral ≥ 60 pontos (idealmente ≥ 70) 2. EMOÇÃO (humor, surpresa, indignação, curiosidade)
- Duração ideal: 60-120s (formato ideal para Reels/Shorts) 3. VALOR (ensina algo, revela segredo, dá dica prática)
- Duração mínima: 60s | Duração máxima: 120s 4. ESTRUTURA (início, meio e fim coerentes)
- Sem sobreposição temporal 5. RITMO (sem pausas longas, dinâmico)
- DEVE ter gancho forte nos primeiros 3 segundos
- Início e fim coerentes
GANCHOS QUE FAZEM VIRALIZAR (use como filtro): ## O QUE EVITAR
- "O que ninguém te conta sobre..."
- "O erro que 90% das pessoas cometem..."
- "Você não vai acreditar o que aconteceu..."
- Revelações chocantes ou contraintuitivas
- Antes vs Depois, transformações
- Segredos, bastidores, verdades ocultas
- Polêmicas, opiniões fortes, hot takes
- Histórias dramáticas com reviravolta
- Dicas práticas e acionáveis
- Momentos de humor genuíno
❌ EVITE (mas não descarte se score alto): - Introduções genéricas ("oi pessoal", "então", "bem")
- Introduções genéricos SEM gancho - Trechos com pausas longas (> 3 segundos de silêncio)
- Trechos com pausas > 3s consecutivas - Segmentos sem contexto ou conclusão
- Explicações técnicas SEM gancho emocional - Explicações técnicas monótonas
- Segmentos sem conclusão clara
- Momentos de transição vazios
FORMATO JSON (retorne APENAS isto, SEM texto adicional): ## FORMATO DE RESPOSTA
Retorne APENAS um JSON válido, sem texto antes ou depois:
```json
{ {
"highlights": [ "highlights": [
{ {
"start": <float>, "start": 0.0,
"end": <float>, "end": 75.0,
"summary": "Score: XX/100 | Gancho: [descreva] | Gatilho: [descreva]", "summary": "Descrição do que acontece neste trecho"
},
{
"start": 120.5,
"end": 195.0,
"summary": "Descrição do que acontece neste trecho"
} }
] ]
} }
```
REGRAS TÉCNICAS: ## REGRAS DO JSON
- Float com ponto decimal (45.5 NÃO 45,5)
- Timestamps exatos dos segments fornecidos
- Ordem cronológica (start crescente)
- Summary conciso mas informativo (2-3 frases)
TAREFA PASSO A PASSO: - "start" e "end" são números decimais (float) em SEGUNDOS
1. Leia transcrição completa - Use ponto como separador decimal (60.5, não 60,5)
2. Identifique TODOS os momentos potencialmente virais - "summary" é uma descrição breve do conteúdo (1-2 frases)
3. Avalie e pontue cada trecho (seja generoso!) - Clips em ordem cronológica (start crescente)
4. Rankear por score viral - Clips não podem se sobrepor
5. Selecione TODOS com score ≥ 60
6. Garanta mínimo de 1 clip a cada 5 minutos
7. Retorne JSON completo
⚠️ IMPORTANTE: ## CHECKLIST ANTES DE RESPONDER
- NÃO seja conservador! Se encontrou 10 momentos bons, retorne os 10!
- Pense em MAXIMIZAR alcance: mais clips = mais chances de viralizar
- Se vídeo tem conteúdo fraco, seja criterioso, mas SEMPRE retorne pelo menos 3-5 clips
- Priorize clips com GANCHOS FORTES - gancho fraco = baixo alcance
🎯 MINDSET: Você é um criador de conteúdo viral. Seu objetivo é extrair MÁXIMO valor do vídeo original. Para CADA clip, verifique:
- [ ] end - start >= 60 segundos?
- [ ] end - start <= 120 segundos?
- [ ] Tem gancho forte no início?
- [ ] Faz sentido isolado do resto do vídeo?
- [ ] JSON está válido?
## EXEMPLO
Se o vídeo tem 15 minutos e você encontrou 4 momentos virais:
```json
{
"highlights": [
{
"start": 60.0,
"end": 120.0,
"summary": "Revelação sobre como economizar 50% nas compras"
},
{
"start": 180.0,
"end": 255.0,
"summary": "História engraçada sobre cliente que tentou enganar a loja"
},
{
"start": 400.0,
"end": 480.0,
"summary": "Dica prática de negociação com fornecedores"
},
{
"start": 600.0,
"end": 690.0,
"summary": "Conclusão motivacional sobre empreendedorismo"
}
]
}
```
Agora analise a transcrição fornecida e extraia os clips virais seguindo estas instruções.

View File

@@ -62,13 +62,16 @@ class RenderingSettings:
subtitle_font_size: int = int(os.environ.get("RENDER_SUBTITLE_FONT_SIZE", 64)) subtitle_font_size: int = int(os.environ.get("RENDER_SUBTITLE_FONT_SIZE", 64))
caption_min_words: int = int(os.environ.get("CAPTION_MIN_WORDS", 2)) caption_min_words: int = int(os.environ.get("CAPTION_MIN_WORDS", 2))
caption_max_words: int = int(os.environ.get("CAPTION_MAX_WORDS", 2)) caption_max_words: int = int(os.environ.get("CAPTION_MAX_WORDS", 2))
# Smart framing settings - CONTAINMENT TRACKING mode
enable_smart_framing: bool = os.environ.get("ENABLE_SMART_FRAMING", "true").lower() in ("true", "1", "yes") enable_smart_framing: bool = os.environ.get("ENABLE_SMART_FRAMING", "true").lower() in ("true", "1", "yes")
smart_framing_min_confidence: float = float(os.environ.get("SMART_FRAMING_MIN_CONFIDENCE", 0.3)) # Lowered for better cartoon detection smart_framing_min_confidence: float = float(os.environ.get("SMART_FRAMING_MIN_CONFIDENCE", 0.3))
smart_framing_smoothing_window: int = int(os.environ.get("SMART_FRAMING_SMOOTHING_WINDOW", 30)) # Reduced - not needed with containment tracking smart_framing_smoothing_window: int = int(os.environ.get("SMART_FRAMING_SMOOTHING_WINDOW", 30))
smart_framing_frame_skip: int = int(os.environ.get("SMART_FRAMING_FRAME_SKIP", 1)) # Process every frame for smooth 30 FPS tracking smart_framing_frame_skip: int = int(os.environ.get("SMART_FRAMING_FRAME_SKIP", 1))
smart_framing_max_velocity: int = int(os.environ.get("SMART_FRAMING_MAX_VELOCITY", 20)) # Moderate - only used during transitions smart_framing_max_velocity: int = int(os.environ.get("SMART_FRAMING_MAX_VELOCITY", 25))
smart_framing_person_switch_cooldown: int = int(os.environ.get("SMART_FRAMING_PERSON_SWITCH_COOLDOWN", 999999)) # DISABLED - never switch people smart_framing_person_switch_cooldown: int = int(os.environ.get("SMART_FRAMING_PERSON_SWITCH_COOLDOWN", 30))
smart_framing_response_time: float = float(os.environ.get("SMART_FRAMING_RESPONSE_TIME", 0.6))
smart_framing_group_padding: float = float(os.environ.get("SMART_FRAMING_GROUP_PADDING", 0.15))
smart_framing_max_zoom_out: float = float(os.environ.get("SMART_FRAMING_MAX_ZOOM_OUT", 2.0))
smart_framing_dead_zone: int = int(os.environ.get("SMART_FRAMING_DEAD_ZONE", 60))
@dataclass(frozen=True) @dataclass(frozen=True)

View File

@@ -41,6 +41,18 @@ class PersonTracking:
frame_number: int frame_number: int
@dataclass
class GroupBoundingBox:
"""Bounding box containing all tracked faces."""
x: int
y: int
width: int
height: int
center_x: int
center_y: int
face_count: int
@dataclass @dataclass
class FrameContext: class FrameContext:
"""Context information for a video frame.""" """Context information for a video frame."""
@@ -50,7 +62,8 @@ class FrameContext:
active_speakers: List[int] # indices of speaking faces active_speakers: List[int] # indices of speaking faces
primary_focus: Optional[Tuple[int, int]] # (x, y) center point primary_focus: Optional[Tuple[int, int]] # (x, y) center point
layout_mode: str # "single", "dual_split", "grid" layout_mode: str # "single", "dual_split", "grid"
selected_people: List[int] = field(default_factory=list) # indices of people selected for display (max 2) selected_people: List[int] = field(default_factory=list) # indices of people selected for display
group_bounds: Optional[GroupBoundingBox] = None # bounding box for all detected faces
class MediaPipeDetector: class MediaPipeDetector:
@@ -385,10 +398,11 @@ class AudioActivityDetector:
class ContextAnalyzer: class ContextAnalyzer:
"""Analyzes video context to determine focus and layout.""" """Analyzes video context to determine focus and layout."""
def __init__(self, person_switch_cooldown: int = 30): def __init__(self, person_switch_cooldown: int = 30, min_face_confidence: float = 0.3):
self.detector = MediaPipeDetector() self.detector = MediaPipeDetector()
self.audio_detector = AudioActivityDetector() self.audio_detector = AudioActivityDetector()
self.previous_faces: List[FaceDetection] = [] self.previous_faces: List[FaceDetection] = []
self.min_face_confidence = min_face_confidence
# Person tracking state # Person tracking state
self.current_selected_people: List[int] = [] # Indices of people currently on screen self.current_selected_people: List[int] = [] # Indices of people currently on screen
@@ -400,9 +414,9 @@ class ContextAnalyzer:
self.stability_threshold = 20 # Frames needed to confirm a switch (increased for more stability) self.stability_threshold = 20 # Frames needed to confirm a switch (increased for more stability)
self.last_switched_people: List[int] = [] # People we just switched FROM self.last_switched_people: List[int] = [] # People we just switched FROM
# Focus stability: track recent focus points for temporal smoothing
self.focus_history: List[Tuple[int, int]] = [] self.focus_history: List[Tuple[int, int]] = []
self.focus_history_size: int = 5 # Keep last 5 focus points for smoothing self.focus_history_size: int = 20
self.focus_dead_zone: int = 60
# Debug logging # Debug logging
self.frame_log_interval = 30 # Log every N frames self.frame_log_interval = 30 # Log every N frames
@@ -429,9 +443,11 @@ class ContextAnalyzer:
FrameContext with detection results FrameContext with detection results
""" """
faces = self.detector.detect_face_landmarks(frame) faces = self.detector.detect_face_landmarks(frame)
faces = [face for face in faces if face.confidence >= self.min_face_confidence] if faces else []
if not faces: if not faces:
faces = self.detector.detect_faces(frame) faces = self.detector.detect_faces(frame)
faces = [face for face in faces if face.confidence >= self.min_face_confidence] if faces else []
# Determine who is speaking # Determine who is speaking
active_speakers = [] active_speakers = []
@@ -440,13 +456,13 @@ class ContextAnalyzer:
for i, face in enumerate(faces): for i, face in enumerate(faces):
is_speaking = False is_speaking = False
# Check audio-based speech detection # Prefer visual cues when multiple faces are present.
if has_audio_speech:
is_speaking = True
# Check lip movement (visual speech detection)
if face.landmarks and len(self.previous_faces) > i: if face.landmarks and len(self.previous_faces) > i:
is_speaking = is_speaking or self._detect_lip_movement(face, self.previous_faces[i]) is_speaking = self._detect_lip_movement(face, self.previous_faces[i])
# Audio can confirm speech when there's only one face.
if has_audio_speech and len(faces) == 1:
is_speaking = True
if is_speaking: if is_speaking:
active_speakers.append(i) active_speakers.append(i)
@@ -456,6 +472,15 @@ class ContextAnalyzer:
logger.info(f"Speech detection - Frame {frame_number}: audio_active={has_audio_speech}, " logger.info(f"Speech detection - Frame {frame_number}: audio_active={has_audio_speech}, "
f"speakers={active_speakers}, total_faces={len(faces)}") f"speakers={active_speakers}, total_faces={len(faces)}")
if active_speakers:
selected_people = active_speakers[:4]
if len(selected_people) == 1:
layout_mode = "single"
elif len(selected_people) == 2:
layout_mode = "dual_split"
else:
layout_mode = "grid"
else:
# Select THE person to focus on (always single person) # Select THE person to focus on (always single person)
# Priority: 1) Who is speaking, 2) Who is most centered # Priority: 1) Who is speaking, 2) Who is most centered
selected_people = self._select_person_to_focus( selected_people = self._select_person_to_focus(
@@ -465,17 +490,23 @@ class ContextAnalyzer:
frame.shape[1], # frame width for center calculation frame.shape[1], # frame width for center calculation
frame.shape[0] # frame height for center calculation frame.shape[0] # frame height for center calculation
) )
# Always use single-person layout (no split screen)
layout_mode = "single" layout_mode = "single"
# Calculate group bounding box for ALL detected faces (multi-person support)
group_bounds = self._calculate_group_bounding_box(faces)
# For multi-person mode, use group center as primary focus
if group_bounds and group_bounds.face_count > 1:
primary_focus = (group_bounds.center_x, group_bounds.center_y)
else:
primary_focus = self._calculate_focus_point(faces, selected_people) primary_focus = self._calculate_focus_point(faces, selected_people)
# Debug logging every N frames # Debug logging every N frames
if frame_number % self.frame_log_interval == 0: if frame_number % self.frame_log_interval == 0:
focus_reason = "speaker" if active_speakers else "no_speech_detected" focus_reason = "speaker" if active_speakers else "no_speech_detected"
group_info = f", group={group_bounds.face_count} faces" if group_bounds else ""
logger.info(f"Frame {frame_number}: {len(faces)} faces, " logger.info(f"Frame {frame_number}: {len(faces)} faces, "
f"{len(active_speakers)} speakers, focus={selected_people}, reason={focus_reason}") f"{len(active_speakers)} speakers, focus={selected_people}, reason={focus_reason}{group_info}")
self.previous_faces = faces self.previous_faces = faces
@@ -486,7 +517,8 @@ class ContextAnalyzer:
active_speakers=active_speakers, active_speakers=active_speakers,
primary_focus=primary_focus, primary_focus=primary_focus,
layout_mode=layout_mode, layout_mode=layout_mode,
selected_people=selected_people selected_people=selected_people,
group_bounds=group_bounds
) )
def _detect_lip_movement(self, current_face: FaceDetection, previous_face: FaceDetection) -> bool: def _detect_lip_movement(self, current_face: FaceDetection, previous_face: FaceDetection) -> bool:
@@ -543,44 +575,40 @@ class ContextAnalyzer:
self.current_selected_people = [] self.current_selected_people = []
return [] return []
# If only 1 person, always focus on them
if len(faces) == 1: if len(faces) == 1:
self.current_selected_people = [0] self.current_selected_people = [0]
return [0] return [0]
# Check if we can switch people (cooldown period)
frames_since_last_switch = frame_number - self.last_switch_frame frames_since_last_switch = frame_number - self.last_switch_frame
can_switch = frames_since_last_switch >= self.person_switch_cooldown can_switch = frames_since_last_switch >= self.person_switch_cooldown
# Calculate frame center for distance comparison
frame_center_x = frame_width / 2
frame_center_y = frame_height / 2
# ULTRA-STABLE MODE: Select ONE person at start, NEVER switch
# This completely eliminates switching-related instability
desired_person_idx = None desired_person_idx = None
# If we already have someone selected, ALWAYS KEEP THEM (never switch) if active_speakers:
if self.current_selected_people and self.current_selected_people[0] in active_speakers:
desired_person_idx = self.current_selected_people[0]
else:
if can_switch or not self.current_selected_people:
desired_person_idx = active_speakers[0]
if self.current_selected_people and desired_person_idx != self.current_selected_people[0]:
logger.info(f"Switching focus to speaker: {desired_person_idx}")
self.last_switch_frame = frame_number
else:
desired_person_idx = self.current_selected_people[0] if self.current_selected_people else active_speakers[0]
else:
if self.current_selected_people and len(self.current_selected_people) > 0: if self.current_selected_people and len(self.current_selected_people) > 0:
current_idx = self.current_selected_people[0] current_idx = self.current_selected_people[0]
if current_idx < len(faces): if current_idx < len(faces):
# Current person still detected - keep them
desired_person_idx = current_idx desired_person_idx = current_idx
else: else:
# Current person lost - try to find them again by position/size similarity
# This handles temporary detection failures
current_person_found = False
if self.previous_faces and current_idx < len(self.previous_faces): if self.previous_faces and current_idx < len(self.previous_faces):
prev_face = self.previous_faces[current_idx] prev_face = self.previous_faces[current_idx]
# Find most similar face by position and size
best_match_idx = None best_match_idx = None
best_match_score = float('inf') best_match_score = float('inf')
for idx, face in enumerate(faces): for idx, face in enumerate(faces):
# Distance between centers
dx = face.center_x - prev_face.center_x dx = face.center_x - prev_face.center_x
dy = face.center_y - prev_face.center_y dy = face.center_y - prev_face.center_y
dist = np.sqrt(dx**2 + dy**2) dist = np.sqrt(dx**2 + dy**2)
# Size similarity
size_diff = abs(face.width - prev_face.width) + abs(face.height - prev_face.height) size_diff = abs(face.width - prev_face.width) + abs(face.height - prev_face.height)
score = dist + size_diff * 0.5 score = dist + size_diff * 0.5
if score < best_match_score: if score < best_match_score:
@@ -589,88 +617,26 @@ class ContextAnalyzer:
if best_match_idx is not None and best_match_score < 1000: if best_match_idx is not None and best_match_score < 1000:
desired_person_idx = best_match_idx desired_person_idx = best_match_idx
current_person_found = True else:
if not current_person_found:
# Really lost - select most confident
face_confidences = [(idx, face.confidence) for idx, face in enumerate(faces)] face_confidences = [(idx, face.confidence) for idx, face in enumerate(faces)]
face_confidences.sort(key=lambda x: x[1], reverse=True) face_confidences.sort(key=lambda x: x[1], reverse=True)
desired_person_idx = face_confidences[0][0] desired_person_idx = face_confidences[0][0]
logger.warning(f"Current person permanently lost - selecting new: {desired_person_idx}")
else: else:
# First frame - select most confident person ONCE
face_confidences = [(idx, face.confidence) for idx, face in enumerate(faces)] face_confidences = [(idx, face.confidence) for idx, face in enumerate(faces)]
face_confidences.sort(key=lambda x: x[1], reverse=True) face_confidences.sort(key=lambda x: x[1], reverse=True)
desired_person_idx = face_confidences[0][0] desired_person_idx = face_confidences[0][0]
logger.info(f"INITIAL SELECTION - Person {desired_person_idx} (will be tracked throughout entire video)")
# IGNORE SPEECH DETECTION - it was causing instability
# We now track ONE person from start to finish, regardless of who speaks
# OLD LOGIC (commented out - was causing issues):
# This logic would switch based on "who is more centered" which caused constant switching
if False: # Disabled
# Calculate distance from center for each face
center_distances = []
for idx, face in enumerate(faces):
# Euclidean distance from frame center
dx = face.center_x - frame_center_x
dy = face.center_y - frame_center_y
distance = np.sqrt(dx**2 + dy**2)
center_distances.append((idx, distance, face.confidence))
# Sort by distance (closest first), then by confidence as tiebreaker
center_distances.sort(key=lambda x: (x[1], -x[2]))
most_centered_idx = center_distances[0][0]
most_centered_distance = center_distances[0][1]
# STICKY BEHAVIOR: If we already have someone selected, only switch if:
# - New person is SIGNIFICANTLY more centered (30% closer to center)
# - OR current person is now very far from center (>40% of frame width)
if self.current_selected_people and len(self.current_selected_people) > 0:
current_idx = self.current_selected_people[0]
if current_idx < len(faces):
current_face = faces[current_idx]
current_dx = current_face.center_x - frame_center_x
current_dy = current_face.center_y - frame_center_y
current_distance = np.sqrt(current_dx**2 + current_dy**2)
# Define "significantly better" threshold
max_acceptable_distance = frame_width * 0.4 # 40% of frame width
improvement_threshold = 0.7 # New person must be 30% closer (0.7 ratio)
# Keep current person if they're still reasonably centered
if current_distance < max_acceptable_distance:
# Current person is still acceptable - only switch if new is MUCH better
if most_centered_distance < current_distance * improvement_threshold:
desired_person_idx = most_centered_idx
logger.debug(f"Switching: new person MUCH more centered ({most_centered_distance:.0f} vs {current_distance:.0f})")
else: else:
desired_person_idx = current_idx # Keep current face_confidences = [(idx, face.confidence) for idx, face in enumerate(faces)]
logger.debug(f"Keeping current person: still reasonably centered ({current_distance:.0f} px from center)") face_confidences.sort(key=lambda x: x[1], reverse=True)
else: desired_person_idx = face_confidences[0][0]
# Current person is too far from center - switch
desired_person_idx = most_centered_idx
logger.debug(f"Current person too far from center ({current_distance:.0f} px), switching")
else:
# Current selection invalid
desired_person_idx = most_centered_idx
else:
# First time - select most centered
desired_person_idx = most_centered_idx
# Wrap in list for compatibility with existing code
desired_people = [desired_person_idx] if desired_person_idx is not None else [] desired_people = [desired_person_idx] if desired_person_idx is not None else []
# ULTRA-STABLE MODE: NO SWITCHING LOGIC AT ALL
# Simply set the person and never change
if not self.current_selected_people: if not self.current_selected_people:
# First time only
self.current_selected_people = desired_people self.current_selected_people = desired_people
self.last_switch_frame = frame_number self.last_switch_frame = frame_number
logger.info(f"Frame {frame_number}: LOCKED ON person {desired_people} - will never switch") logger.info(f"Frame {frame_number}: Locked on person {desired_people}")
else: else:
# Already have someone - just update to desired (which is same person due to logic above)
self.current_selected_people = desired_people self.current_selected_people = desired_people
return self.current_selected_people.copy() return self.current_selected_people.copy()
@@ -798,24 +764,77 @@ class ContextAnalyzer:
raw_focus_x = most_confident.center_x raw_focus_x = most_confident.center_x
raw_focus_y = most_confident.center_y raw_focus_y = most_confident.center_y
# Apply temporal smoothing using focus history if self.focus_history:
last_x, last_y = self.focus_history[-1]
dx = abs(raw_focus_x - last_x)
dy = abs(raw_focus_y - last_y)
if dx < self.focus_dead_zone and dy < self.focus_dead_zone:
return self.focus_history[-1]
self.focus_history.append((raw_focus_x, raw_focus_y)) self.focus_history.append((raw_focus_x, raw_focus_y))
if len(self.focus_history) > self.focus_history_size: if len(self.focus_history) > self.focus_history_size:
self.focus_history.pop(0) self.focus_history.pop(0)
# Calculate smoothed focus as weighted average (more weight to recent frames) if len(self.focus_history) >= 5:
if len(self.focus_history) > 1: xs = [x for x, y in self.focus_history]
# Exponential weights: recent frames have more influence ys = [y for x, y in self.focus_history]
weights = [2 ** i for i in range(len(self.focus_history))] median_x = int(np.median(xs))
total_weight = sum(weights) median_y = int(np.median(ys))
return (median_x, median_y)
smoothed_x = sum(x * w for (x, y), w in zip(self.focus_history, weights)) / total_weight
smoothed_y = sum(y * w for (x, y), w in zip(self.focus_history, weights)) / total_weight
return (int(smoothed_x), int(smoothed_y))
else: else:
return (raw_focus_x, raw_focus_y) return (raw_focus_x, raw_focus_y)
def _calculate_group_bounding_box(
self,
faces: List[FaceDetection],
padding_percent: float = 0.15,
max_faces: int = 6
) -> Optional[GroupBoundingBox]:
"""
Calculate bounding box containing all detected faces with padding.
Args:
faces: List of detected faces
padding_percent: Padding around group as percentage of bbox dimensions
max_faces: Maximum faces to include (use most confident if exceeded)
Returns:
GroupBoundingBox or None if no faces
"""
if not faces:
return None
# If too many faces, use most confident ones
if len(faces) > max_faces:
faces = sorted(faces, key=lambda f: f.confidence, reverse=True)[:max_faces]
# Calculate bounding box containing all faces
min_x = min(f.x for f in faces)
max_x = max(f.x + f.width for f in faces)
min_y = min(f.y for f in faces)
max_y = max(f.y + f.height for f in faces)
# Add padding
width = max_x - min_x
height = max_y - min_y
pad_x = int(width * padding_percent)
pad_y = int(height * padding_percent)
final_x = max(0, min_x - pad_x)
final_y = max(0, min_y - pad_y)
final_width = width + 2 * pad_x
final_height = height + 2 * pad_y
return GroupBoundingBox(
x=final_x,
y=final_y,
width=final_width,
height=final_height,
center_x=final_x + final_width // 2,
center_y=final_y + final_height // 2,
face_count=len(faces)
)
def close(self): def close(self):
"""Release resources.""" """Release resources."""
self.detector.close() self.detector.close()

View File

@@ -137,11 +137,11 @@ class OpenRouterCopywriter:
continue continue
duration = end - start duration = end - start
if duration < 45: if duration < 60:
logger.warning(f"Highlight ignorado: muito curto ({duration}s, minimo 45s)") logger.warning(f"Highlight ignorado: muito curto ({duration}s, minimo 45s)")
continue continue
if duration > 90: if duration > 120:
logger.warning(f"Highlight ignorado: muito longo ({duration}s, maximo 90s)") logger.warning(f"Highlight ignorado: muito longo ({duration}s, maximo 90s)")
continue continue

View File

@@ -347,7 +347,12 @@ class VideoRenderer:
frame_skip=settings.rendering.smart_framing_frame_skip, frame_skip=settings.rendering.smart_framing_frame_skip,
smoothing_window=settings.rendering.smart_framing_smoothing_window, smoothing_window=settings.rendering.smart_framing_smoothing_window,
max_velocity=settings.rendering.smart_framing_max_velocity, max_velocity=settings.rendering.smart_framing_max_velocity,
person_switch_cooldown=settings.rendering.smart_framing_person_switch_cooldown person_switch_cooldown=settings.rendering.smart_framing_person_switch_cooldown,
response_time=settings.rendering.smart_framing_response_time,
group_padding=settings.rendering.smart_framing_group_padding,
max_zoom_out=settings.rendering.smart_framing_max_zoom_out,
dead_zone=settings.rendering.smart_framing_dead_zone,
min_face_confidence=settings.rendering.smart_framing_min_confidence
) )
def render( def render(

File diff suppressed because it is too large Load Diff