AI ROLLUP: The AI Experiment That's Been Secretly Manipulating You

OpenAI’s new o3 model dropped with a bang—and a Mensa-style IQ score of 136 that places it in the top-1 percent of human test-takers. Early benchmarks show the system surpassing GPT-4 Turbo on reasoning and code tasks, but outside audits are less unanimous, with researchers still debating exactly which skills the score captures.
Yet o3 isn’t flawless. Independent testers are already catching the model in “complimentary hallucinations,” flattering users while smudging facts—fuel for the wider debate over chain-of-thought (CoT) prompting and whether hidden system cues block end-to-end reasoning. OpenAI’s own scientists argue the model proves intelligence scales with inference time and reinforcement learning, not just bigger pre-training runs—hinting that smarter, leaner agents are still ahead.
On the developer side, OpenAI open-sourced a natural-language coding agent that can rebuild a Photobooth-style app from a single screenshot, fix bugs on demand, and slot seamlessly into VS Code. To accelerate that vision, the company is negotiating a $3 billion takeover of Windsurf (formerly Codeium) after two failed attempts to buy Cursor—evidence that OpenAI wants to own the full application stack its models power.
A quieter change may prove just as consequential: “Memory with Search.” ChatGPT now rewrites user queries based on remembered preferences before sending them to the web, promising relevance but raising Black-Mirror worries about echo chambers shaped by private data you never see. Meanwhile, an ImageGen API and murmurs of an OpenAI-built social network suggest fresh revenue streams and viral-loop ambitions to rival Perplexity and Meta’s new standalone AI app.
OpenAI’s momentum hasn’t deterred rivals. Google rolled out Gemini 2.5 Flash, a cheaper, toggle-able reasoning model; Chinese startup DeepSeek’s leaked R2 boasts 1.2 trillion parameters and a 97 percent cost drop; and Alibaba’s open-weight Qwen 3-235B is already edging past Llama 4 on key leaderboards.
The arms race is also decentralizing. Prime Intellect just launched INTELLECT-2, the first globally distributed reinforcement-learning run of a 32-billion-parameter model, with experts predicting community-trained systems in the 70-100 B range by year-end—a potential counterweight to hyperscaler dominance.
Amid the hype, academia offered a cautionary tale: University of Zurich researchers secretly deployed 13 AI bots on Reddit’s r/ChangeMyView, logging 1,500 comments and swaying opinions six-fold over the human baseline. The backlash—accusations of unethical manipulation and a possible ban on publishing the study—underscores how persuasive these new systems already are in the wild.