Ajusta contexto, falas e foco, tremulação do video e demais bugs

This commit is contained in:
LeoMortari
2026-01-03 19:42:23 -03:00
parent c1914dad00
commit 3f7329869d
7 changed files with 932 additions and 455 deletions

View File

@@ -9,7 +9,7 @@ services:
- RABBITMQ_PASS=${RABBITMQ_PASS}
- OPENROUTER_API_URL=${OPENROUTER_API_URL:-https://openrouter.ai/api/v1/chat/completions}
- OPENROUTER_API_KEY=${OPENROUTER_API_KEY}
- OPENROUTER_MODEL=${OPENROUTER_MODEL:-openai/gpt-oss-20b:free}
- OPENROUTER_MODEL=${OPENROUTER_MODEL:-mistralai/mistral-small-3.1-24b-instruct:free}
- OPENROUTER_PROMPT_PATH=${OPENROUTER_PROMPT_PATH:-prompts/generate.txt}
- FASTER_WHISPER_MODEL_SIZE=${FASTER_WHISPER_MODEL_SIZE:-medium}
- SMART_FRAMING_SMOOTHING_WINDOW=${SMART_FRAMING_SMOOTHING_WINDOW:-30}

View File

@@ -1,118 +1,111 @@
Você é especialista em viralidade de redes sociais (TikTok, Instagram Reels, YouTube Shorts). Sua missão: EXTRAIR O MÁXIMO de clips virais possíveis, priorizando QUANTIDADE + QUALIDADE.
# TAREFA: Extrair clips virais de uma transcrição de vídeo
🎯 OBJETIVO: Transformar cada vídeo em MÚLTIPLOS clips que podem viralizar
Você é um especialista em conteúdo viral para TikTok, Instagram Reels e YouTube Shorts.
PROCESSO DE ANÁLISE:
1. Mapear TODOS os potenciais trechos virais na transcrição
2. Avaliar cada trecho usando sistema de pontuação abaixo
3. Rankear do maior para menor score viral
4. Selecionar TODOS os trechos com score ≥ 60 (não seja conservador!)
## REGRA MAIS IMPORTANTE - DURAÇÃO DOS CLIPS
SISTEMA DE PONTUAÇÃO VIRAL (0-100 pontos):
**CADA CLIP DEVE TER ENTRE 60 E 120 SEGUNDOS DE DURAÇÃO.**
🪝 GANCHO INICIAL (0-30 pontos) - CRÍTICO PARA VIRALIZAÇÃO:
[30] Frase CHOCANTE, pergunta POLÊMICA ou promessa OUSADA nos primeiros 3 segundos
[25] Hook forte: "Você não vai acreditar...", "O segredo que ninguém conta...", "Isso mudou tudo..."
[20] Pergunta intrigante ou afirmação controversa
[15] História interessante mas gancho fraco
[10] Início genérico mas aceitável
[0] "Oi", "então", "bem", silêncio - DESCARTAR
- MÍNIMO ABSOLUTO: 60 segundos (end - start >= 60)
- MÁXIMO: 120 segundos (end - start <= 120)
- IDEAL: 60-90 segundos
🔥 GATILHO EMOCIONAL (0-25 pontos):
[25] Emoção EXTREMA: raiva, choque, riso intenso, WTF moment, revelação bombástica
[20] Emoção forte: surpresa, indignação, humor, curiosidade intensa
[15] Emoção moderada: interesse, leve humor, insight interessante
[10] Emoção fraca: informativo sem impacto
[0] Monótono, técnico, sem apelo emocional - EVITAR
**CLIPS COM MENOS DE 60 SEGUNDOS SERÃO REJEITADOS PELO SISTEMA.**
💎 VALOR/UTILIDADE (0-20 pontos):
[20] Segredo VALIOSO, insight transformador, informação EXCLUSIVA
[15] Ensina algo prático e IMEDIATAMENTE aplicável
[10] Opinião interessante ou perspectiva única
[5] Informação genérica ou conhecimento comum
[0] Nenhum valor prático, puro "enrolation" - DESCARTAR
Antes de incluir um clip, SEMPRE calcule: end - start >= 60
📖 ESTRUTURA NARRATIVA (0-15 pontos):
[15] História COMPLETA com início, conflito/clímax e resolução satisfatória
[10] Segmento com começo e fim coerentes, faz sentido isolado
[5] Trecho com sentido mas cortado abruptamente
[0] Fragmento sem contexto - NÃO USAR
## QUANTIDADE DE CLIPS
⚡ RITMO E ENERGIA (0-10 pontos):
[10] DINÂMICO, sem pausas longas, alta energia, palavras impactantes
[7] Bom ritmo com pausas naturais curtas (< 2s)
[3] Ritmo lento mas aceitável
[0] Muitas pausas (> 3s), hesitações, monotonia - EVITAR
Baseado na duração total do vídeo:
- Até 10 min: 2-4 clips
- 10-20 min: 4-6 clips
- 20-30 min: 6-10 clips
- 30+ min: 8-15 clips
REGRAS DE QUANTIDADE (SER AGRESSIVO):
📊 Quantidade MÍNIMA por duração:
- 5-10 min: MÍNIMO 4-6 clips
- 10-15 min: MÍNIMO 6-8 clips
- 15-20 min: MÍNIMO 8-10 clips
- 20-30 min: MÍNIMO 10-15 clips
- 30+ min: MÍNIMO 15-20 clips
## CRITÉRIOS DE SELEÇÃO
🎯 REGRA DE OURO: 1 clip a cada 2-3 minutos de vídeo (NO MÍNIMO)
- Se encontrar momentos virais, SEMPRE selecione!
- Melhor ter 3 clips perfeitos que 10 clips bons
Um bom clip viral possui:
CRITÉRIOS DE SELEÇÃO:
- Score viral ≥ 60 pontos (idealmente ≥ 70)
- Duração ideal: 60-120s (formato ideal para Reels/Shorts)
- Duração mínima: 60s | Duração máxima: 120s
- Sem sobreposição temporal
- DEVE ter gancho forte nos primeiros 3 segundos
- Início e fim coerentes
1. GANCHO FORTE nos primeiros 3 segundos (pergunta, afirmação chocante, promessa)
2. EMOÇÃO (humor, surpresa, indignação, curiosidade)
3. VALOR (ensina algo, revela segredo, dá dica prática)
4. ESTRUTURA (início, meio e fim coerentes)
5. RITMO (sem pausas longas, dinâmico)
GANCHOS QUE FAZEM VIRALIZAR (use como filtro):
- "O que ninguém te conta sobre..."
- "O erro que 90% das pessoas cometem..."
- "Você não vai acreditar o que aconteceu..."
- Revelações chocantes ou contraintuitivas
- Antes vs Depois, transformações
- Segredos, bastidores, verdades ocultas
- Polêmicas, opiniões fortes, hot takes
- Histórias dramáticas com reviravolta
- Dicas práticas e acionáveis
- Momentos de humor genuíno
## O QUE EVITAR
❌ EVITE (mas não descarte se score alto):
- Introduções genéricos SEM gancho
- Trechos com pausas > 3s consecutivas
- Explicações técnicas SEM gancho emocional
- Segmentos sem conclusão clara
- Momentos de transição vazios
- Introduções genéricas ("oi pessoal", "então", "bem")
- Trechos com pausas longas (> 3 segundos de silêncio)
- Segmentos sem contexto ou conclusão
- Explicações técnicas monótonas
FORMATO JSON (retorne APENAS isto, SEM texto adicional):
## FORMATO DE RESPOSTA
Retorne APENAS um JSON válido, sem texto antes ou depois:
```json
{
"highlights": [
{
"start": <float>,
"end": <float>,
"summary": "Score: XX/100 | Gancho: [descreva] | Gatilho: [descreva]",
"start": 0.0,
"end": 75.0,
"summary": "Descrição do que acontece neste trecho"
},
{
"start": 120.5,
"end": 195.0,
"summary": "Descrição do que acontece neste trecho"
}
]
}
```
REGRAS TÉCNICAS:
- Float com ponto decimal (45.5 NÃO 45,5)
- Timestamps exatos dos segments fornecidos
- Ordem cronológica (start crescente)
- Summary conciso mas informativo (2-3 frases)
## REGRAS DO JSON
TAREFA PASSO A PASSO:
1. Leia transcrição completa
2. Identifique TODOS os momentos potencialmente virais
3. Avalie e pontue cada trecho (seja generoso!)
4. Rankear por score viral
5. Selecione TODOS com score ≥ 60
6. Garanta mínimo de 1 clip a cada 5 minutos
7. Retorne JSON completo
- "start" e "end" são números decimais (float) em SEGUNDOS
- Use ponto como separador decimal (60.5, não 60,5)
- "summary" é uma descrição breve do conteúdo (1-2 frases)
- Clips em ordem cronológica (start crescente)
- Clips não podem se sobrepor
⚠️ IMPORTANTE:
- NÃO seja conservador! Se encontrou 10 momentos bons, retorne os 10!
- Pense em MAXIMIZAR alcance: mais clips = mais chances de viralizar
- Se vídeo tem conteúdo fraco, seja criterioso, mas SEMPRE retorne pelo menos 3-5 clips
- Priorize clips com GANCHOS FORTES - gancho fraco = baixo alcance
## CHECKLIST ANTES DE RESPONDER
🎯 MINDSET: Você é um criador de conteúdo viral. Seu objetivo é extrair MÁXIMO valor do vídeo original.
Para CADA clip, verifique:
- [ ] end - start >= 60 segundos?
- [ ] end - start <= 120 segundos?
- [ ] Tem gancho forte no início?
- [ ] Faz sentido isolado do resto do vídeo?
- [ ] JSON está válido?
## EXEMPLO
Se o vídeo tem 15 minutos e você encontrou 4 momentos virais:
```json
{
"highlights": [
{
"start": 60.0,
"end": 120.0,
"summary": "Revelação sobre como economizar 50% nas compras"
},
{
"start": 180.0,
"end": 255.0,
"summary": "História engraçada sobre cliente que tentou enganar a loja"
},
{
"start": 400.0,
"end": 480.0,
"summary": "Dica prática de negociação com fornecedores"
},
{
"start": 600.0,
"end": 690.0,
"summary": "Conclusão motivacional sobre empreendedorismo"
}
]
}
```
Agora analise a transcrição fornecida e extraia os clips virais seguindo estas instruções.

View File

@@ -62,13 +62,16 @@ class RenderingSettings:
subtitle_font_size: int = int(os.environ.get("RENDER_SUBTITLE_FONT_SIZE", 64))
caption_min_words: int = int(os.environ.get("CAPTION_MIN_WORDS", 2))
caption_max_words: int = int(os.environ.get("CAPTION_MAX_WORDS", 2))
# Smart framing settings - CONTAINMENT TRACKING mode
enable_smart_framing: bool = os.environ.get("ENABLE_SMART_FRAMING", "true").lower() in ("true", "1", "yes")
smart_framing_min_confidence: float = float(os.environ.get("SMART_FRAMING_MIN_CONFIDENCE", 0.3)) # Lowered for better cartoon detection
smart_framing_smoothing_window: int = int(os.environ.get("SMART_FRAMING_SMOOTHING_WINDOW", 30)) # Reduced - not needed with containment tracking
smart_framing_frame_skip: int = int(os.environ.get("SMART_FRAMING_FRAME_SKIP", 1)) # Process every frame for smooth 30 FPS tracking
smart_framing_max_velocity: int = int(os.environ.get("SMART_FRAMING_MAX_VELOCITY", 20)) # Moderate - only used during transitions
smart_framing_person_switch_cooldown: int = int(os.environ.get("SMART_FRAMING_PERSON_SWITCH_COOLDOWN", 999999)) # DISABLED - never switch people
smart_framing_min_confidence: float = float(os.environ.get("SMART_FRAMING_MIN_CONFIDENCE", 0.3))
smart_framing_smoothing_window: int = int(os.environ.get("SMART_FRAMING_SMOOTHING_WINDOW", 30))
smart_framing_frame_skip: int = int(os.environ.get("SMART_FRAMING_FRAME_SKIP", 1))
smart_framing_max_velocity: int = int(os.environ.get("SMART_FRAMING_MAX_VELOCITY", 25))
smart_framing_person_switch_cooldown: int = int(os.environ.get("SMART_FRAMING_PERSON_SWITCH_COOLDOWN", 30))
smart_framing_response_time: float = float(os.environ.get("SMART_FRAMING_RESPONSE_TIME", 0.6))
smart_framing_group_padding: float = float(os.environ.get("SMART_FRAMING_GROUP_PADDING", 0.15))
smart_framing_max_zoom_out: float = float(os.environ.get("SMART_FRAMING_MAX_ZOOM_OUT", 2.0))
smart_framing_dead_zone: int = int(os.environ.get("SMART_FRAMING_DEAD_ZONE", 60))
@dataclass(frozen=True)

View File

@@ -41,6 +41,18 @@ class PersonTracking:
frame_number: int
@dataclass
class GroupBoundingBox:
"""Bounding box containing all tracked faces."""
x: int
y: int
width: int
height: int
center_x: int
center_y: int
face_count: int
@dataclass
class FrameContext:
"""Context information for a video frame."""
@@ -50,7 +62,8 @@ class FrameContext:
active_speakers: List[int] # indices of speaking faces
primary_focus: Optional[Tuple[int, int]] # (x, y) center point
layout_mode: str # "single", "dual_split", "grid"
selected_people: List[int] = field(default_factory=list) # indices of people selected for display (max 2)
selected_people: List[int] = field(default_factory=list) # indices of people selected for display
group_bounds: Optional[GroupBoundingBox] = None # bounding box for all detected faces
class MediaPipeDetector:
@@ -385,10 +398,11 @@ class AudioActivityDetector:
class ContextAnalyzer:
"""Analyzes video context to determine focus and layout."""
def __init__(self, person_switch_cooldown: int = 30):
def __init__(self, person_switch_cooldown: int = 30, min_face_confidence: float = 0.3):
self.detector = MediaPipeDetector()
self.audio_detector = AudioActivityDetector()
self.previous_faces: List[FaceDetection] = []
self.min_face_confidence = min_face_confidence
# Person tracking state
self.current_selected_people: List[int] = [] # Indices of people currently on screen
@@ -400,9 +414,9 @@ class ContextAnalyzer:
self.stability_threshold = 20 # Frames needed to confirm a switch (increased for more stability)
self.last_switched_people: List[int] = [] # People we just switched FROM
# Focus stability: track recent focus points for temporal smoothing
self.focus_history: List[Tuple[int, int]] = []
self.focus_history_size: int = 5 # Keep last 5 focus points for smoothing
self.focus_history_size: int = 20
self.focus_dead_zone: int = 60
# Debug logging
self.frame_log_interval = 30 # Log every N frames
@@ -429,9 +443,11 @@ class ContextAnalyzer:
FrameContext with detection results
"""
faces = self.detector.detect_face_landmarks(frame)
faces = [face for face in faces if face.confidence >= self.min_face_confidence] if faces else []
if not faces:
faces = self.detector.detect_faces(frame)
faces = [face for face in faces if face.confidence >= self.min_face_confidence] if faces else []
# Determine who is speaking
active_speakers = []
@@ -440,13 +456,13 @@ class ContextAnalyzer:
for i, face in enumerate(faces):
is_speaking = False
# Check audio-based speech detection
if has_audio_speech:
is_speaking = True
# Check lip movement (visual speech detection)
# Prefer visual cues when multiple faces are present.
if face.landmarks and len(self.previous_faces) > i:
is_speaking = is_speaking or self._detect_lip_movement(face, self.previous_faces[i])
is_speaking = self._detect_lip_movement(face, self.previous_faces[i])
# Audio can confirm speech when there's only one face.
if has_audio_speech and len(faces) == 1:
is_speaking = True
if is_speaking:
active_speakers.append(i)
@@ -456,26 +472,41 @@ class ContextAnalyzer:
logger.info(f"Speech detection - Frame {frame_number}: audio_active={has_audio_speech}, "
f"speakers={active_speakers}, total_faces={len(faces)}")
# Select THE person to focus on (always single person)
# Priority: 1) Who is speaking, 2) Who is most centered
selected_people = self._select_person_to_focus(
faces,
active_speakers,
frame_number,
frame.shape[1], # frame width for center calculation
frame.shape[0] # frame height for center calculation
)
if active_speakers:
selected_people = active_speakers[:4]
if len(selected_people) == 1:
layout_mode = "single"
elif len(selected_people) == 2:
layout_mode = "dual_split"
else:
layout_mode = "grid"
else:
# Select THE person to focus on (always single person)
# Priority: 1) Who is speaking, 2) Who is most centered
selected_people = self._select_person_to_focus(
faces,
active_speakers,
frame_number,
frame.shape[1], # frame width for center calculation
frame.shape[0] # frame height for center calculation
)
layout_mode = "single"
# Always use single-person layout (no split screen)
layout_mode = "single"
# Calculate group bounding box for ALL detected faces (multi-person support)
group_bounds = self._calculate_group_bounding_box(faces)
primary_focus = self._calculate_focus_point(faces, selected_people)
# For multi-person mode, use group center as primary focus
if group_bounds and group_bounds.face_count > 1:
primary_focus = (group_bounds.center_x, group_bounds.center_y)
else:
primary_focus = self._calculate_focus_point(faces, selected_people)
# Debug logging every N frames
if frame_number % self.frame_log_interval == 0:
focus_reason = "speaker" if active_speakers else "no_speech_detected"
group_info = f", group={group_bounds.face_count} faces" if group_bounds else ""
logger.info(f"Frame {frame_number}: {len(faces)} faces, "
f"{len(active_speakers)} speakers, focus={selected_people}, reason={focus_reason}")
f"{len(active_speakers)} speakers, focus={selected_people}, reason={focus_reason}{group_info}")
self.previous_faces = faces
@@ -486,7 +517,8 @@ class ContextAnalyzer:
active_speakers=active_speakers,
primary_focus=primary_focus,
layout_mode=layout_mode,
selected_people=selected_people
selected_people=selected_people,
group_bounds=group_bounds
)
def _detect_lip_movement(self, current_face: FaceDetection, previous_face: FaceDetection) -> bool:
@@ -543,134 +575,68 @@ class ContextAnalyzer:
self.current_selected_people = []
return []
# If only 1 person, always focus on them
if len(faces) == 1:
self.current_selected_people = [0]
return [0]
# Check if we can switch people (cooldown period)
frames_since_last_switch = frame_number - self.last_switch_frame
can_switch = frames_since_last_switch >= self.person_switch_cooldown
# Calculate frame center for distance comparison
frame_center_x = frame_width / 2
frame_center_y = frame_height / 2
# ULTRA-STABLE MODE: Select ONE person at start, NEVER switch
# This completely eliminates switching-related instability
desired_person_idx = None
# If we already have someone selected, ALWAYS KEEP THEM (never switch)
if self.current_selected_people and len(self.current_selected_people) > 0:
current_idx = self.current_selected_people[0]
if current_idx < len(faces):
# Current person still detected - keep them
desired_person_idx = current_idx
if active_speakers:
if self.current_selected_people and self.current_selected_people[0] in active_speakers:
desired_person_idx = self.current_selected_people[0]
else:
# Current person lost - try to find them again by position/size similarity
# This handles temporary detection failures
current_person_found = False
if self.previous_faces and current_idx < len(self.previous_faces):
prev_face = self.previous_faces[current_idx]
# Find most similar face by position and size
best_match_idx = None
best_match_score = float('inf')
for idx, face in enumerate(faces):
# Distance between centers
dx = face.center_x - prev_face.center_x
dy = face.center_y - prev_face.center_y
dist = np.sqrt(dx**2 + dy**2)
# Size similarity
size_diff = abs(face.width - prev_face.width) + abs(face.height - prev_face.height)
score = dist + size_diff * 0.5
if score < best_match_score:
best_match_score = score
best_match_idx = idx
if best_match_idx is not None and best_match_score < 1000:
desired_person_idx = best_match_idx
current_person_found = True
if not current_person_found:
# Really lost - select most confident
face_confidences = [(idx, face.confidence) for idx, face in enumerate(faces)]
face_confidences.sort(key=lambda x: x[1], reverse=True)
desired_person_idx = face_confidences[0][0]
logger.warning(f"Current person permanently lost - selecting new: {desired_person_idx}")
if can_switch or not self.current_selected_people:
desired_person_idx = active_speakers[0]
if self.current_selected_people and desired_person_idx != self.current_selected_people[0]:
logger.info(f"Switching focus to speaker: {desired_person_idx}")
self.last_switch_frame = frame_number
else:
desired_person_idx = self.current_selected_people[0] if self.current_selected_people else active_speakers[0]
else:
# First frame - select most confident person ONCE
face_confidences = [(idx, face.confidence) for idx, face in enumerate(faces)]
face_confidences.sort(key=lambda x: x[1], reverse=True)
desired_person_idx = face_confidences[0][0]
logger.info(f"INITIAL SELECTION - Person {desired_person_idx} (will be tracked throughout entire video)")
# IGNORE SPEECH DETECTION - it was causing instability
# We now track ONE person from start to finish, regardless of who speaks
# OLD LOGIC (commented out - was causing issues):
# This logic would switch based on "who is more centered" which caused constant switching
if False: # Disabled
# Calculate distance from center for each face
center_distances = []
for idx, face in enumerate(faces):
# Euclidean distance from frame center
dx = face.center_x - frame_center_x
dy = face.center_y - frame_center_y
distance = np.sqrt(dx**2 + dy**2)
center_distances.append((idx, distance, face.confidence))
# Sort by distance (closest first), then by confidence as tiebreaker
center_distances.sort(key=lambda x: (x[1], -x[2]))
most_centered_idx = center_distances[0][0]
most_centered_distance = center_distances[0][1]
# STICKY BEHAVIOR: If we already have someone selected, only switch if:
# - New person is SIGNIFICANTLY more centered (30% closer to center)
# - OR current person is now very far from center (>40% of frame width)
if self.current_selected_people and len(self.current_selected_people) > 0:
current_idx = self.current_selected_people[0]
if current_idx < len(faces):
current_face = faces[current_idx]
current_dx = current_face.center_x - frame_center_x
current_dy = current_face.center_y - frame_center_y
current_distance = np.sqrt(current_dx**2 + current_dy**2)
# Define "significantly better" threshold
max_acceptable_distance = frame_width * 0.4 # 40% of frame width
improvement_threshold = 0.7 # New person must be 30% closer (0.7 ratio)
# Keep current person if they're still reasonably centered
if current_distance < max_acceptable_distance:
# Current person is still acceptable - only switch if new is MUCH better
if most_centered_distance < current_distance * improvement_threshold:
desired_person_idx = most_centered_idx
logger.debug(f"Switching: new person MUCH more centered ({most_centered_distance:.0f} vs {current_distance:.0f})")
else:
desired_person_idx = current_idx # Keep current
logger.debug(f"Keeping current person: still reasonably centered ({current_distance:.0f} px from center)")
else:
# Current person is too far from center - switch
desired_person_idx = most_centered_idx
logger.debug(f"Current person too far from center ({current_distance:.0f} px), switching")
desired_person_idx = current_idx
else:
# Current selection invalid
desired_person_idx = most_centered_idx
else:
# First time - select most centered
desired_person_idx = most_centered_idx
if self.previous_faces and current_idx < len(self.previous_faces):
prev_face = self.previous_faces[current_idx]
best_match_idx = None
best_match_score = float('inf')
for idx, face in enumerate(faces):
dx = face.center_x - prev_face.center_x
dy = face.center_y - prev_face.center_y
dist = np.sqrt(dx**2 + dy**2)
size_diff = abs(face.width - prev_face.width) + abs(face.height - prev_face.height)
score = dist + size_diff * 0.5
if score < best_match_score:
best_match_score = score
best_match_idx = idx
if best_match_idx is not None and best_match_score < 1000:
desired_person_idx = best_match_idx
else:
face_confidences = [(idx, face.confidence) for idx, face in enumerate(faces)]
face_confidences.sort(key=lambda x: x[1], reverse=True)
desired_person_idx = face_confidences[0][0]
else:
face_confidences = [(idx, face.confidence) for idx, face in enumerate(faces)]
face_confidences.sort(key=lambda x: x[1], reverse=True)
desired_person_idx = face_confidences[0][0]
else:
face_confidences = [(idx, face.confidence) for idx, face in enumerate(faces)]
face_confidences.sort(key=lambda x: x[1], reverse=True)
desired_person_idx = face_confidences[0][0]
# Wrap in list for compatibility with existing code
desired_people = [desired_person_idx] if desired_person_idx is not None else []
# ULTRA-STABLE MODE: NO SWITCHING LOGIC AT ALL
# Simply set the person and never change
if not self.current_selected_people:
# First time only
self.current_selected_people = desired_people
self.last_switch_frame = frame_number
logger.info(f"Frame {frame_number}: LOCKED ON person {desired_people} - will never switch")
logger.info(f"Frame {frame_number}: Locked on person {desired_people}")
else:
# Already have someone - just update to desired (which is same person due to logic above)
self.current_selected_people = desired_people
return self.current_selected_people.copy()
@@ -798,24 +764,77 @@ class ContextAnalyzer:
raw_focus_x = most_confident.center_x
raw_focus_y = most_confident.center_y
# Apply temporal smoothing using focus history
if self.focus_history:
last_x, last_y = self.focus_history[-1]
dx = abs(raw_focus_x - last_x)
dy = abs(raw_focus_y - last_y)
if dx < self.focus_dead_zone and dy < self.focus_dead_zone:
return self.focus_history[-1]
self.focus_history.append((raw_focus_x, raw_focus_y))
if len(self.focus_history) > self.focus_history_size:
self.focus_history.pop(0)
# Calculate smoothed focus as weighted average (more weight to recent frames)
if len(self.focus_history) > 1:
# Exponential weights: recent frames have more influence
weights = [2 ** i for i in range(len(self.focus_history))]
total_weight = sum(weights)
smoothed_x = sum(x * w for (x, y), w in zip(self.focus_history, weights)) / total_weight
smoothed_y = sum(y * w for (x, y), w in zip(self.focus_history, weights)) / total_weight
return (int(smoothed_x), int(smoothed_y))
if len(self.focus_history) >= 5:
xs = [x for x, y in self.focus_history]
ys = [y for x, y in self.focus_history]
median_x = int(np.median(xs))
median_y = int(np.median(ys))
return (median_x, median_y)
else:
return (raw_focus_x, raw_focus_y)
def _calculate_group_bounding_box(
self,
faces: List[FaceDetection],
padding_percent: float = 0.15,
max_faces: int = 6
) -> Optional[GroupBoundingBox]:
"""
Calculate bounding box containing all detected faces with padding.
Args:
faces: List of detected faces
padding_percent: Padding around group as percentage of bbox dimensions
max_faces: Maximum faces to include (use most confident if exceeded)
Returns:
GroupBoundingBox or None if no faces
"""
if not faces:
return None
# If too many faces, use most confident ones
if len(faces) > max_faces:
faces = sorted(faces, key=lambda f: f.confidence, reverse=True)[:max_faces]
# Calculate bounding box containing all faces
min_x = min(f.x for f in faces)
max_x = max(f.x + f.width for f in faces)
min_y = min(f.y for f in faces)
max_y = max(f.y + f.height for f in faces)
# Add padding
width = max_x - min_x
height = max_y - min_y
pad_x = int(width * padding_percent)
pad_y = int(height * padding_percent)
final_x = max(0, min_x - pad_x)
final_y = max(0, min_y - pad_y)
final_width = width + 2 * pad_x
final_height = height + 2 * pad_y
return GroupBoundingBox(
x=final_x,
y=final_y,
width=final_width,
height=final_height,
center_x=final_x + final_width // 2,
center_y=final_y + final_height // 2,
face_count=len(faces)
)
def close(self):
"""Release resources."""
self.detector.close()

View File

@@ -137,11 +137,11 @@ class OpenRouterCopywriter:
continue
duration = end - start
if duration < 45:
if duration < 60:
logger.warning(f"Highlight ignorado: muito curto ({duration}s, minimo 45s)")
continue
if duration > 90:
if duration > 120:
logger.warning(f"Highlight ignorado: muito longo ({duration}s, maximo 90s)")
continue

View File

@@ -347,7 +347,12 @@ class VideoRenderer:
frame_skip=settings.rendering.smart_framing_frame_skip,
smoothing_window=settings.rendering.smart_framing_smoothing_window,
max_velocity=settings.rendering.smart_framing_max_velocity,
person_switch_cooldown=settings.rendering.smart_framing_person_switch_cooldown
person_switch_cooldown=settings.rendering.smart_framing_person_switch_cooldown,
response_time=settings.rendering.smart_framing_response_time,
group_padding=settings.rendering.smart_framing_group_padding,
max_zoom_out=settings.rendering.smart_framing_max_zoom_out,
dead_zone=settings.rendering.smart_framing_dead_zone,
min_face_confidence=settings.rendering.smart_framing_min_confidence
)
def render(

File diff suppressed because it is too large Load Diff