July 31, 2025

Jailbreaks, Collaboration, and Cognitive Shifts

1 hour 2 minutes

Generated by Google NotebookLM.

This episode explores 15 new research papers at the edge of LLM behavior, safety, collaboration, and reasoning:

Beyond passive replies – CollabLLM rethinks how LLMs interact across turns, training them to uncover user intent and proactively collaborate.
Red teaming, automated – RedCoder weaponizes multi-turn attacks against code models, training autonomous agents to probe for unsafe generations.
Synthesis by simulation – CodeEvo builds training data by pairing coder and reviewer agents in feedback loops, automating high-quality instruction-code generation.
Internal deception – Linear probes and SAEs reveal how truthful features flip when models are prompted to lie.
Defense by deflection – SDeflection avoids refusal and instead rewrites malicious prompts into innocuous replies, lowering jailbreak success without hurting helpfulness.
Attack by persona – A genetic algorithm crafts persona prompts that reduce refusal rates and supercharge jailbreaks, especially when stacked with other methods.
Agents with evolving maps – CoEx lets planning agents continually revise their world models, co-adapting structure and strategy over time.
Interfaces for oversight – Magentic-UI powers human-in-the-loop agentic systems with long-term memory, action guards, and collaborative controls.
Measuring long-context reasoning – NeedleChain moves past “needle-in-a-haystack” with tasks that require full semantic integration across long input windows.
Bias as an exploit – CognitiveAttack uncovers how stacking psychological biases in prompts dramatically increases LLM jailbreak success.
Patching with logic – RePaCA guides LLMs to assess bug fixes using chain-of-thought, boosting accuracy and explainability in patch correctness tasks.
Federated fine-tuning at scale – H2Tune handles architectural and task diversity across clients with a novel decomposition and disentanglement scheme.
Multimodal mastery – MoCHA uses sparse MoE connectors and hierarchical attention to align vision with language and reduce hallucinations.
Where demos belong – A detailed analysis of demo position bias finds that demonstration ordering in prompts drastically alters LLM accuracy and stability.

Together, these papers uncover the subtle mechanics that shape LLM trustworthiness, the strategies that make or break jailbreak defenses, and the design patterns emerging in agentic interfaces and federated learning.

Sources:

CollabLLM: arXiv:2406.04425
RedCoder: arXiv:2407.00482
CodeEvo: arXiv:2407.00483
When Truthful Representations Flip Under Deceptive Instructions: arXiv:2407.00495
Strategic Deflection: arXiv:2407.00496
Enhancing Jailbreak Attacks via Persona Prompts: arXiv:2407.00499
CoEx: arXiv:2407.00508
Magentic-UI: arXiv:2407.00510
NeedleChain: arXiv:2407.00518
CognitiveAttack: arXiv:2407.00519
RePaCA: arXiv:2407.00523
H2Tune: arXiv:2407.00529
MoCHA: arXiv:2407.00530
Where to show Demos in Your Prompt: arXiv:2407.00533

...more

View all episodes

By Scot Bearss

July 31, 2025

Jailbreaks, Collaboration, and Cognitive Shifts

1 hour 2 minutes

Generated by Google NotebookLM.

This episode explores 15 new research papers at the edge of LLM behavior, safety, collaboration, and reasoning:

Beyond passive replies – CollabLLM rethinks how LLMs interact across turns, training them to uncover user intent and proactively collaborate.
Red teaming, automated – RedCoder weaponizes multi-turn attacks against code models, training autonomous agents to probe for unsafe generations.
Synthesis by simulation – CodeEvo builds training data by pairing coder and reviewer agents in feedback loops, automating high-quality instruction-code generation.
Internal deception – Linear probes and SAEs reveal how truthful features flip when models are prompted to lie.
Defense by deflection – SDeflection avoids refusal and instead rewrites malicious prompts into innocuous replies, lowering jailbreak success without hurting helpfulness.
Attack by persona – A genetic algorithm crafts persona prompts that reduce refusal rates and supercharge jailbreaks, especially when stacked with other methods.
Agents with evolving maps – CoEx lets planning agents continually revise their world models, co-adapting structure and strategy over time.
Interfaces for oversight – Magentic-UI powers human-in-the-loop agentic systems with long-term memory, action guards, and collaborative controls.
Measuring long-context reasoning – NeedleChain moves past “needle-in-a-haystack” with tasks that require full semantic integration across long input windows.
Bias as an exploit – CognitiveAttack uncovers how stacking psychological biases in prompts dramatically increases LLM jailbreak success.
Patching with logic – RePaCA guides LLMs to assess bug fixes using chain-of-thought, boosting accuracy and explainability in patch correctness tasks.
Federated fine-tuning at scale – H2Tune handles architectural and task diversity across clients with a novel decomposition and disentanglement scheme.
Multimodal mastery – MoCHA uses sparse MoE connectors and hierarchical attention to align vision with language and reduce hallucinations.
Where demos belong – A detailed analysis of demo position bias finds that demonstration ordering in prompts drastically alters LLM accuracy and stability.

Sources:

CollabLLM: arXiv:2406.04425
RedCoder: arXiv:2407.00482
CodeEvo: arXiv:2407.00483
When Truthful Representations Flip Under Deceptive Instructions: arXiv:2407.00495
Strategic Deflection: arXiv:2407.00496
Enhancing Jailbreak Attacks via Persona Prompts: arXiv:2407.00499
CoEx: arXiv:2407.00508
Magentic-UI: arXiv:2407.00510
NeedleChain: arXiv:2407.00518
CognitiveAttack: arXiv:2407.00519
RePaCA: arXiv:2407.00523
H2Tune: arXiv:2407.00529
MoCHA: arXiv:2407.00530
Where to show Demos in Your Prompt: arXiv:2407.00533

...more

Share Jailbreaks, Collaboration, and Cognitive Shifts

Sign up to save your podcasts

Jailbreaks, Collaboration, and Cognitive Shifts

Jailbreaks, Collaboration, and Cognitive Shifts