What slowdown? AI models are evolving fast

Hero Image
A month ago, speculation was rife among tech enthusiasts worldwide whether there was a slowdown in AI capabilities due to diminishing returns in model training techniques. That fear has been soundly put to bed with a host of new AI model releases one after the other by some of the biggest players in the space. We’ve compiled a list of some of the notable releases that have come out of the tech world in the last month.

o3 is almost as smart as a human

Sam Altman-led OpenAI has introduced o3, its latest family of reasoning-focused AI models, marking a significant step forward from the earlier o1 model. This new release includes o3 and o3-mini, a smaller, task-specific variant. Designed to enhance reasoning capabilities, o3 employs advanced techniques like reinforcement learning and “private chain of thought” processes to simulate step-by-step logical thinking. Notably, o3 allows for adjustable “reasoning time” settings— low, medium, or high compute—enabling users to balance performance and efficiency. At higher compute levels, o3 demonstrates remarkable results in fields like science, mathematics, and programming. For example, it achieved 87.5% on the ARC-AGI test for acquiring new skills, though it also revealed limitations on simpler tasks, underscoring differences from human intelligence. Despite these advancements, o3 has not yet been widely released. OpenAI is offering previews to safety researchers, with broader availability planned for early 2025.

Alibaba’s Marco-o1 is a competitive model

Not to be left out in the cold, China’s tech giants have stepped up their AI game as well. Marco-o1, developed by the MarcoPolo Team under Alibaba International Digital Commerce, is an advanced large language model designed to excel in reasoning and open-ended problem-solving tasks. The model leverages cutting-edge techniques such as Chain-of-Thought (CoT) finetuning, Monte Carlo Tree Search (MCTS), and innovative reasoning strategies. Built on the Qwen2-7B-Instruct architecture, Marco-o1 has been fine-tuned using a combination of opensource CoT datasets, proprietary synthetic data, and customised instruction datasets. The integration of MCTS allows the model to evaluate multiple reasoning paths by assigning confidence scores, while its reasoning action strategies refine problem-solving through granular, step-by-step actions. Marco-o1 demonstrates notable performance improvements, achieving a 6.17% accuracy increase on the MGSM English dataset and a 5.60% improvement on the Chinese version. It also excels in machine translation, accurately interpreting complex phrases and idiomatic expressions. Available on GitHub and Hugging Face, Marco-o1 is accessible for researchers and developers aiming to apply it across domains such as multilingual translation and advanced reasoning. Its strong performance benchmarks suggest that Marco-o1 is a competitive player in the evolving landscape of reasoning-focused AI models.

O1 is out now globally

OpenAI has officially released its o1 reasoning model, now accessible globally to ChatGPT Plus and Team users, with availability for Enterprise and Edu users following shortly. This marks a significant upgrade from o1-preview, delivering faster, more accurate reasoning capabilities, particularly in coding and mathematics. Key advancements include the ability to analyse and explain image uploads, enabling applications in visual data interpretation with enhanced detail and precision. Additionally, o1 has been trained to produce more concise reasoning, reducing response times compared to its preview version. Testing reveals a 34% reduction in major errors on complex real-world questions, solidifying o1 as a robust reasoning tool. The model is available through the ChatGPT model selector, replacing o1-preview, and will soon support features like web browsing and file uploads. OpenAI plans to extend o1’s capabilities through an API, introducing features like vision processing, function calling, developer messages, and structured outputs for seamless interaction with external systems. Additionally, OpenAI introduced ChatGPT Pro, a $200 subscription tier offering scaled access to o1, GPT-4o, and Advanced Voice. Pro users gain exclusive access to a high compute version of o1 for solving the most challenging problems. To foster societal impact, OpenAI has launched a ChatGPT Pro Grant Program, awarding grants to researchers in fields like medicine and beyond.

Gemini 2 is Google’s agentic bet

The tech critics so far have been heavily impressed with Google’s latest AI model – Gemini 2. Sundar Pichai-led Google says their latest AI model represents a major advancement in the evolution of AI. Building on the foundation of Gemini 1.0 and 1.5, this new model is designed for the “agentic era,” enabling AI systems to understand, reason, and act more effectively in a variety of contexts. Gemini 2’s multimodal capabilities allow it to process and generate text, images, video, and audio outputs. The first release in the Gemini 2 family, the experimental 2.0 Flash model, achieves faster performance and better results than its predecessors, such as the popular 1.5 Pro. It supports advanced features, including multimodal inputs and outputs, steerable text-to-speech, and native tool integration for tasks like web searches and code execution. Additionally, Gemini 2 Flash introduces a new Multimodal Live API, enabling real-time audio and video-streaming input for dynamic application development.


Amazon enters AI race with Nova

Amazon decided that it won’t just sit by and watch as its competitors steal all the AI limelight. To this end, Amazon Nova is a collection of advanced AI models introduced by AWS as part of its Bedrock AI platform. Designed to support a wide array of generative AI tasks, the tech giant says the Nova suite offers businesses efficient solutions for text generation, multimodal content creation, and complex reasoning. The Nova lineup includes several models, each catering to specific use cases. Amazon Nova Micro is a text-only model that delivers low-latency responses at an exceptionally low cost, making it ideal for applications requiring fast and efficient text processing. Amazon Nova Lite, a multimodal model, excels in handling text, image, and video inputs with lightning speed while maintaining affordability. Amazon Nova Pro offers a balance of accuracy, speed, and cost, making it highly versatile for a range of tasks. Meanwhile, Amazon Nova Premier, the most advanced model, is designed for complex reasoning and serves as an excellent teacher for distilling custom models. This model is expected to become available in Q1 2025.


Veo 2 can create stunning videos

Veo 2, Google’s latest video generation model, represents a significant advancement in AI-driven video creation, that’s what experts who have used it are saying. Building on the foundations of its predecessor, Veo, the model is capable of generating high-resolution videos across a wide range of styles and subjects, achieving results that have been rated superior in quality when compared to leading video generation models. A standout feature of Veo 2 is its improved understanding of real-world physics and human movement, enabling the creation of videos with enhanced realism and detail. The model incorporates cinematic expertise, allowing users to specify genres, camera angles, lenses, and effects. For instance, Veo 2 can craft wide-angle shots using an “18mm lens” or emphasize a subject with a “shallow depth of field” prompt. It supports resolutions up to 4K and can generate videos extending to several minutes in length. Veo 2 reduces common issues like “hallucinated” details, such as extra limbs or misplaced objects, ensuring outputs are more reliable. Available through Google Labs’ VideoFX tool, Veo 2 also features SynthID watermarking for transparency, addressing concerns around misinformation. With plans to expand its availability to platforms like YouTube Shorts, Veo 2 is poised to redefine creative workflows for enterprises and individual creators alike.



Sora is finally released

After teasing the release months ago, OpenAI finally unveiled Sora, its advanced AI model developed to create realistic videos from textual descriptions. Building upon earlier research in world simulation, Sora aims to bridge the gap between digital content generation and real-world interaction. The updated version, Sora Turbo , is now available as a standalone product at Sora.com, exclusively for ChatGPT Plus and Pro users. Sora Turbo introduces a range of new features, including the ability to generate videos up to 1080p resolution and 20 seconds long. It supports various aspect ratios, such as widescreen, vertical, and square formats. Users can either generate entirely new content from text prompts or extend, remix, and blend existing assets. A storyboard tool provides precise control over inputs for each frame, enabling more detailed and structured video creation. While Sora offers innovative capabilities, it comes with limitations, such as challenges in simulating realistic physics and handling complex, long-duration actions. To ensure responsible use, all videos include visible watermarks and metadata for transparency, and harmful content is strictly prohibited. Sora is currently included in Plus accounts, with Pro plans offering expanded access.

Nova Reel lets you control the camera

Amazon Nova Reel is a video generation model introduced by AWS as part of the Nova family of foundation models. Designed to transform text and image inputs into high-quality, short-form videos, Nova Reel enables users to create multimedia content with ease. Amazon says the capabilities are particularly beneficial for applications in advertising, marketing, and entertainment, where dynamic visual content is essential. A notable feature of Nova Reel is its camera motion controls, which can be adjusted using natural language inputs. This allows users to specify desired camera movements, such as “dolly forward,” to produce videos with professional-grade effects.


Deep Research is a unique initiative

Google’s Deep Research is a groundbreaking AI-powered tool integrated with the Gemini 2 model, aimed at revolutionizing how users gather and synthesize information from the web. Designed for researchers, professionals, and casual users alike, this tool automates the traditionally time-consuming process of finding, verifying, and summarizing information online. Users can delegate research tasks to the AI, which autonomously scans credible sources, extracts key insights, and compiles them into comprehensive, easy-to-read reports. These reports are enriched with links to original sources, enabling users to dive deeper if needed. A unique feature of Deep Research is its adaptability to user-specific needs—whether it’s gathering data on market trends, technical documentation, or general knowledge, the tool tailors its output to the task. Available through Gemini 2 Advanced, Deep Research supports multiple languages, including regional dialects, making it accessible to a global audience. By combining speed, accuracy, and context, Google Deep Research empowers users to focus on analysis and decision-making rather than the legwork of information gathering.

Project Mariner wants AI to browse the web

Project Mariner is Google’s ambitious project to enable AIdriven web browsing and interaction. Designed for the Gemini 2 model, Mariner allows users to perform advanced tasks online autonomously. This includes booking tickets, shopping, or conducting market research without constant human oversight. Mariner’s unique capability lies in its ability to interact with web interfaces as a human would, simulating clicks, text inputs, and navigation. A key aspect of the project is safety and control; all actions are logged, and users receive notifications for major decisions, ensuring transparency. For developers, Mariner offers APIs to integrate its browsing capabilities into their applications, creating new possibilities for automation and user experience enhancement. With applications ranging from e-commerce to education, Project Mariner, Google says, is a glimpse into a future where AI not only aids but actively participates in the digital world, performing complex tasks with precision and minimal human intervention.

Project Astra envisions AI assistants

Project Astra is Google’s initiative to develop an AI assistant that works seamlessly across its suite of services. Powered by Gemini 2, Astra acts as a highly contextual and proactive assistant, capable of providing realtime responses tailored to user needs. Unlike traditional assistants, Astra doesn’t just respond to commands; it anticipates user requirements by analyzing behavior and context. For instance, while drafting an email in Gmail, Astra might suggest relevant attachments or provide concise summaries of referenced documents from Google Drive. It can also manage complex tasks like coordinating schedules across time zones or creating detailed travel itineraries using Google Maps and Calendar. Astra’s integration with Google’s ecosystem ensures a cohesive user experience, streamlining productivity and enhancing daily workflows. Still in its experimental phase, Project Astra aims to redefine personal assistance in the digital age, offering users a proactive, intelligent helper that evolves with their preferences and habits.