Share ThursdAI - The top AI news from the past week
Share to email
Share to Facebook
Share to X
By From Weights & Biases, Join AI Evangelist Alex Volkov and a panel of experts to cover everything important that happened in the world of AI from the past week
5
77 ratings
The podcast currently has 71 episodes available.
Hey, Alex here. Super quick, as I’m still attending Dev Day, but I didn’t want to leave you hanging (if you're a paid subscriber!), I have decided to outsource my job and give the amazing podcasters of NoteBookLM the whole transcript of the opening keynote of OpenAI Dev Day.
You can see a blog of everything they just posted here
Here’s a summary of all what was announced:
* Developer-Centric Approach: OpenAI consistently emphasized the importance of developers in their mission to build beneficial AGI. The speaker stated, "OpenAI's mission is to build AGI that benefits all of humanity, and developers are critical to that mission... we cannot do this without you."
* Reasoning as a New Frontier: The introduction of the GPT-4 series, specifically the "O1" models, marks a significant step towards AI with advanced reasoning capabilities, going beyond the limitations of previous models like GPT-3.
* Multimodal Capabilities: OpenAI is expanding the potential of AI applications by introducing multimodal capabilities, particularly focusing on real-time speech-to-speech interaction through the new Realtime API.
* Customization and Fine-Tuning: Empowering developers to customize models is a key theme. OpenAI introduced Vision for fine-tuning with images and announced easier access to fine-tuning with model distillation tools.
* Accessibility and Scalability: OpenAI demonstrated a commitment to making AI more accessible and cost-effective for developers through initiatives like price reductions, prompt caching, and model distillation tools.
Important Ideas and Facts:
1. The O1 Models:
* Represent a shift towards AI models with enhanced reasoning capabilities, surpassing previous generations in problem-solving and logical thought processes.
* O1 Preview is positioned as the most powerful reasoning model, designed for complex problems requiring extended thought processes.
* O1 Mini offers a faster, cheaper, and smaller alternative, particularly suited for tasks like code debugging and agent-based applications.
* Both models demonstrate advanced capabilities in coding, math, and scientific reasoning.
* OpenAI highlighted the ability of O1 models to work with developers as "thought partners," understanding complex instructions and contributing to the development process.
Quote: "The shift to reasoning introduces a new shape of AI capability. The ability for our model to scale and correct the process is pretty mind-blowing. So we are resetting the clock, and we are introducing a new series of models under the name O1."
2. Realtime API:
* Enables developers to build real-time AI experiences directly into their applications using WebSockets.
* Launches with support for speech-to-speech interaction, leveraging the technology behind ChatGPT's advanced voice models.
* Offers natural and seamless integration of voice capabilities, allowing for dynamic and interactive user experiences.
* Showcased the potential to revolutionize human-computer interaction across various domains like driving, education, and accessibility.
Quote: "You know, a lot of you have been asking about building amazing speech-to-speech experiences right into your apps. Well now, you can."
3. Vision, Fine-Tuning, and Model Distillation:
* Vision introduces the ability to use images for fine-tuning, enabling developers to enhance model performance in image understanding tasks.
* Fine-tuning with Vision opens up opportunities in diverse fields such as product recommendations, medical imaging, and autonomous driving.
* OpenAI emphasized the accessibility of these features, stating that "fine-tuning with Vision is available to every single developer."
* Model distillation tools facilitate the creation of smaller, more efficient models by transferring knowledge from larger models like O1 and GPT-4.
* This approach addresses cost concerns and makes advanced AI capabilities more accessible for a wider range of applications and developers.
Quote: "With distillation, you take the outputs of a large model to supervise, to teach a smaller model. And so today, we are announcing our own model distillation tools."
4. Cost Reduction and Accessibility:
* OpenAI highlighted its commitment to lowering the cost of AI models, making them more accessible for diverse use cases.
* Announced a 90% decrease in cost per token since the release of GPT-3, emphasizing continuous efforts to improve affordability.
* Introduced prompt caching, automatically providing a 50% discount for input tokens the model has recently processed.
* These initiatives aim to remove financial barriers and encourage wider adoption of AI technologies across various industries.
Quote: "Every time we reduce the price, we see new types of applications, new types of use cases emerge. We're super far from the price equilibrium. In a way, models are still too expensive to be bought at massive scale."
Conclusion:
OpenAI DevDay conveyed a strong message of developer empowerment and a commitment to pushing the boundaries of AI capabilities. With new models like O1, the introduction of the Realtime API, and a dedicated focus on accessibility and customization, OpenAI is paving the way for a new wave of innovative and impactful AI applications developed by a global community.
Hey everyone, it's Alex (still traveling!), and oh boy, what a week again! Advanced Voice Mode is finally here from OpenAI, Google updated their Gemini models in a huge way and then Meta announced MultiModal LlaMas and on device mini Llamas (and we also got a "better"? multimodal from Allen AI called MOLMO!)
From Weights & Biases perspective, our hackathon was a success this weekend, and then I went down to Menlo Park for my first Meta Connect conference, full of news and updates and will do a full recap here as well.
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Overall another crazy week in AI, and it seems that everyone is trying to rush something out the door before OpenAI Dev Day next week (which I'll cover as well!) Get ready, folks, because Dev Day is going to be epic!
TL;DR of all topics covered:
* Open Source LLMs
* Meta llama 3.2 Multimodal models (11B & 90B) (X, HF, try free)
* Meta Llama 3.2 tiny models 1B & 3B parameters (X, Blog, download)
* Allen AI releases MOLMO - open SOTA multimodal AI models (X, Blog, HF, Try It)
* Big CO LLMs + APIs
* OpenAI releases Advanced Voice Mode to all & Mira Murati leaves OpenAI
* Google updates Gemini 1.5-Pro-002 and 1.5-Flash-002 (Blog)
* This weeks Buzz
* Our free course is LIVE - more than 3000 already started learning how to build advanced RAG++
* Sponsoring tonights AI Tinkerers in Seattle, if you're in Seattle, come through for my demo
* Voice & Audio
* Meta also launches voice mode (demo)
* Tools & Others
* Project ORION - holographic glasses are here! (link)
Meta gives us new LLaMas and AI hardware
LLama 3.2 Multimodal 11B and 90B
This was by far the biggest OpenSource release of this week (tho see below, may not be the "best"), as a rumored released finally came out, and Meta has given our Llama eyes! Coming with 2 versions (well 4 if you count the base models which they also released), these new MultiModal LLaMas were trained with an adapter architecture, keeping the underlying text models the same, and placing a vision encoder that was trained and finetuned separately on top.
LLama 90B is among the best open-source mutlimodal models available
— Meta team at launch
These new vision adapters were trained on a massive 6 Billion images, including synthetic data generation by 405B for questions/captions, and finetuned with a subset of 600M high quality image pairs.
Unlike the rest of their models, the Meta team did NOT claim SOTA on these models, and the benchmarks are very good but not the best we've seen (Qwen 2 VL from a couple of weeks ago, and MOLMO from today beat it on several benchmarks)
With text-only inputs, the Llama 3.2 Vision models are functionally the same as the Llama 3.1 Text models; this allows the Llama 3.2 Vision models to be a drop-in replacement for Llama 3.1 8B/70B with added image understanding capabilities.
Seems like these models don't support multi image or video as well (unlike Pixtral for example) nor tool use with images.
Meta will also release these models on meta.ai and every other platform, and they cited a crazy 500 million monthly active users of their AI services across all their apps 🤯 which marks them as the leading AI services provider in the world now.
Llama 3.2 Lightweight Models (1B/3B)
The additional and maybe more exciting thing that we got form Meta was the introduction of the small/lightweight models of 1B and 3B parameters.
Trained on up to 9T tokens, and distilled / pruned from larger models, these are aimed for on-device inference (and by device here we mean from laptops to mobiles to soon... glasses? more on this later)
In fact, meta released an IOS demo, that runs these models, takes a group chat, summarizes and calls the calendar tool to schedule based on the conversation, and all this happens on device without the info leaving to a larger model.
They have also been able to prune down the LLama-guard safety model they released to under 500Mb and have had demos of it running on client side and hiding user input on the fly as the user types something bad!
Interestingly, here too, the models were not SOTA, even in small category, with tiny models like Qwen 2.5 3B beating these models on many benchmarks, but they are outlining a new distillation / pruning era for Meta as they aim for these models to run on device, eventually even glasses (and some said Smart Thermostats)
In fact they are so tiny, that the communtiy quantized them, released and I was able to download these models, all while the keynote was still going! Here I am running the Llama 3B during the developer keynote!
Speaking AI - not only from OpenAI
Zuck also showcased a voice based Llama that's coming to Meta AI (unlike OpenAI it's likely a pipeline of TTS/STT) but it worked really fast and Zuck was able to interrupt it.
And they also showed a crazy animated AI avatar of a creator, that was fully backed by Llama, while the human creator was on stage, Zuck chatted with his avatar and reaction times were really really impressive.
AI Hardware was glasses all along?
Look we've all seen the blunders of this year, the Humane AI Ping, the Rabbit R1 (which sits on my desk and I haven't recharged in two months) but maybe Meta is the answer here?
Zuck took a bold claim that glasses are actually the perfect form factor for AI, it sits on your face, sees what you see and hears what you hear, and can whisper in your ear without disrupting the connection between you and your conversation partner.
They haven't announced new Meta Raybans, but did update the lineup with a new set of transition lenses (to be able to wear those glasses inside and out) and a special edition clear case pair that looks very sleek + new AI features like memories to be able to ask the glasses "hey Meta where did I park" or be able to continue the conversation. I had to get me a pair of this limited edition ones!
Project ORION - first holographic glasses
And of course, the biggest announcement of the Meta Connect was the super secret decade old project of fully holographic AR glasses, which they called ORION.
Zuck introduced these as the most innovative and technologically dense set of glasses in the world. They always said the form factor will become just "glasses" and they actually did it ( a week after Snap spectacles ) tho those are not going to get released to any one any time soon, hell they only made a few thousand of these and they are extremely expensive.
With 70 deg FOV, cameras, speakers and a compute puck, these glasses pack a full day battery with under 100grams of weight, and have a custom silicon, custom displays with MicroLED projector and just... tons of more innovation in there.
They also come in 3 pieces, the glasses themselves, the compute wireless pack that will hold the LLaMas in your pocket and the EMG wristband that allows you to control these devices using muscle signals.
These won't ship as a product tho so don't expect to get them soon, but they are real, and will allow Meta to build the product that we will get on top of these by 2030
AI usecases
So what will these glasses be able to do? well, they showed off a live translation feature on stage that mostly worked, where you just talk and listen to another language in near real time, which was great. There are a bunch of mixed reality games, you'd be able to call people and see them in your glasses on a virtual screen and soon you'll show up as an avatar there as well.
The AI use-case they showed beyond just translation was MultiModality stuff, where they had a bunch of ingredients for a shake, and you could ask your AI assistant, which shake you can make with what it sees. Do you really need
I'm so excited about these to finally come to people I screamed in the audience 👀👓
OpenAI gives everyone* advanced voice mode
It's finally here, and if you're paying for chatGPT you know this, the long announced Advanced Voice Mode for chatGPT is now rolled out to all plus members.
The new updated since the beta are, 5 new voices (Maple, Spruce, Vale, Arbor and Sol), finally access to custom instructions and memory, so you can ask it to remember things and also to know who you are and your preferences (try saving your jailbreaks there)
Unfortunately, as predicted, by the time it rolled out to everyone, this feels way less exciting than it did 6 month ago, the model is way less emotional, refuses to sing (tho folks are making it anyway) and generally feels way less "wow" than what we saw. Less "HER" than we wanted for sure Seriously, they nerfed the singing! Why OpenAI, why?
Pro tip of mine that went viral : you can set your action button on the newer iphones to immediately start the voice conversation with 1 click.
*This new mode is not available in EU
This weeks Buzz - our new advanced RAG++ course is live
I had an awesome time with my colleagues Ayush and Bharat today, after they finally released a FREE advanced RAG course they've been working so hard on for the past few months! Definitely check out our conversation, but better yet, why don't you roll into the course? it's FREE and you'll get to learn about data ingestion, evaluation, query enhancement and more!
New Gemini 002 is 50% cheaper, 2x faster and better at MMLU-pro
It seems that every major lab (besides Anthropic) released a big thing this week to try and get under Meta's skin?
Google announced an update to their Gemini Pro/Flash models, called 002, which is a very significant update!
Not only are these models 50% cheaper now (Pro price went down by 50% on <128K context lengths), they are 2x faster on outputs with 3x lower latency on first tokens. It's really quite something to see
The new models have also improved scores, with the Flash models (the super cheap ones, remember) from September, now coming close to or beating the Pro scores from May 2024!
Definitely a worthy update from the team at Google!
Hot off the press, the folks at Google Labs also added a feature to the awesome NotebookLM that allows it to summarize over 50h of youtube videos in the crazy high quality Audio Overview feature!
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
That's it for the week, we of course chatted about way way more during the show, so make sure to listen to the podcast this week, but otherwise, signing off for this week, as I travel back home for a weekend, before returning to SF for the OpenAI dev day next week!
Expect full Dev Day coverage live next tuesday and a recap on the newsletter.
Meanwhile, if you've already subscribed, please share this newsletter with 1 or two people who are interested in AI 🙇♂️ and see you next week.
Hey folks, Alex here, back with another ThursdAI recap – and let me tell you, this week's episode was a whirlwind of open-source goodness, mind-bending inference techniques, and a whole lotta talk about talking AIs! We dove deep into the world of LLMs, from Alibaba's massive Qwen 2.5 drop to the quirky, real-time reactions of Moshi.
We even got a sneak peek at Nous Research's ambitious new project, Forge, which promises to unlock some serious LLM potential. So grab your pumpkin spice latte (it's that time again isn't it? 🍁) settle in, and let's recap the AI awesomeness that went down on ThursdAI, September 19th!
ThursdAI is brought to you (as always) by Weights & Biases, we still have a few spots left in our Hackathon this weekend and our new advanced RAG course is now released and is FREE to sign up!
TL;DR of all topics + show notes and links
* Open Source LLMs
* Alibaba Qwen 2.5 models drop + Qwen 2.5 Math and Qwen 2.5 Code (X, HF, Blog, Try It)
* Qwen 2.5 Coder 1.5B is running on a 4 year old phone (Nisten)
* KyutAI open sources Moshi & Mimi (Moshiko & Moshika) - end to end voice chat model (X, HF, Paper)
* Microsoft releases GRIN-MoE - tiny (6.6B active) MoE with 79.4 MMLU (X, HF, GIthub)
* Nvidia - announces NVLM 1.0 - frontier class multimodal LLMS (no weights yet, X)
* Big CO LLMs + APIs
* OpenAI O1 results from LMsys do NOT disappoint - vibe checks also confirm, new KING llm in town (Thread)
* NousResearch announces Forge in waitlist - their MCTS enabled inference product (X)
* This weeks Buzz - everything Weights & Biases related this week
* Judgement Day (hackathon) is in 2 days! Still places to come hack with us Sign up
* Our new RAG Course is live - learn all about advanced RAG from WandB, Cohere and Weaviate (sign up for free)
* Vision & Video
* Youtube announces DreamScreen - generative AI image and video in youtube shorts ( Blog)
* CogVideoX-5B-I2V - leading open source img2video model (X, HF)
* Runway, DreamMachine & Kling all announce text-2-video over API (Runway, DreamMachine)
* Runway announces video 2 video model (X)
* Tools
* Snap announces their XR glasses - have hand tracking and AI features (X)
Open Source Explosion!
👑 Qwen 2.5: new king of OSS llm models with 12 model releases, including instruct, math and coder versions
This week's open-source highlight was undoubtedly the release of Alibaba's Qwen 2.5 models. We had Justin Lin from the Qwen team join us live to break down this monster drop, which includes a whopping seven different sizes, ranging from a nimble 0.5B parameter model all the way up to a colossal 72B beast! And as if that wasn't enough, they also dropped Qwen 2.5 Coder and Qwen 2.5 Math models, further specializing their LLM arsenal. As Justin mentioned, they heard the community's calls for 14B and 32B models loud and clear – and they delivered! "We do not have enough GPUs to train the models," Justin admitted, "but there are a lot of voices in the community...so we endeavor for it and bring them to you." Talk about listening to your users!
Trained on an astronomical 18 trillion tokens (that’s even more than Llama 3.1 at 15T!), Qwen 2.5 shows significant improvements across the board, especially in coding and math. They even open-sourced the previously closed-weight Qwen 2 VL 72B, giving us access to the best open-source vision language models out there. With a 128K context window, these models are ready to tackle some serious tasks. As Nisten exclaimed after putting the 32B model through its paces, "It's really practical…I was dumping in my docs and my code base and then like actually asking questions."
It's safe to say that Qwen 2.5 coder is now the best coding LLM that you can use, and just in time for our chat, a new update from ZeroEval confirms, Qwen 2.5 models are the absolute kings of OSS LLMS, beating Mistral large, 4o-mini, Gemini Flash and other huge models with just 72B parameters 👏
Moshi: The Chatty Cathy of AI
We've covered Moshi Voice back in July, and they have promised to open source the whole stack, and now finally they did! Including the LLM and the Mimi Audio Encoder!
This quirky little 7.6B parameter model is a speech-to-speech marvel, capable of understanding your voice and responding in kind. It's an end-to-end model, meaning it handles the entire speech-to-speech process internally, without relying on separate speech-to-text and text-to-speech models.
While it might not be a logic genius, Moshi's real-time reactions are undeniably uncanny. Wolfram Ravenwolf described the experience: "It's uncanny when you don't even realize you finished speaking and it already starts to answer." The speed comes from the integrated architecture and efficient codecs, boasting a theoretical response time of just 160 milliseconds!
Moshi uses (also open sourced) Mimi neural audio codec, and achieves 12.5 Hz representation with just 1.1 kbps bandwidth.
You can download it and run on your own machine or give it a try here just don't expect a masterful conversationalist hehe
Gradient-Informed MoE (GRIN-MoE): A Tiny Titan
Just before our live show, Microsoft dropped a paper on GrinMoE, a gradient-informed Mixture of Experts model. We were lucky enough to have the lead author, Liyuan Liu (aka Lucas), join us impromptu to discuss this exciting development. Despite having only 6.6B active parameters (16 x 3.8B experts), GrinMoE manages to achieve remarkable performance, even outperforming larger models like Phi-3 on certain benchmarks. It's a testament to the power of clever architecture and training techniques. Plus, it's open-sourced under the MIT license, making it a valuable resource for the community.
NVIDIA NVLM: A Teaser for Now
NVIDIA announced NVLM 1.0, their own set of multimodal LLMs, but alas, no weights were released. We’ll have to wait and see how they stack up against the competition once they finally let us get our hands on them. Interestingly, while claiming SOTA on some vision tasks, they haven't actually compared themselves to Qwen 2 VL, which we know is really really good at vision tasks 🤔
Nous Research Unveils Forge: Inference Time Compute Powerhouse (beating o1 at AIME Eval!)
Fresh off their NousCon event, Karan and Shannon from Nous Research joined us to discuss their latest project, Forge. Described by Shannon as "Jarvis on the front end," Forge is an inference engine designed to push the limits of what’s possible with existing LLMs. Their secret weapon? Inference-time compute. By implementing sophisticated techniques like Monte Carlo Tree Search (MCTS), Forge can outperform larger models on complex reasoning tasks beating OpenAI's o1-preview at the AIME Eval, competition math benchmark, even with smaller, locally runnable models like Hermes 70B. As Karan emphasized, “We’re actually just scoring with Hermes 3.1, which is available to everyone already...we can scale it up to outperform everything on math, just using a system like this.”
Forge isn't just about raw performance, though. It's built with usability and transparency in mind. Unlike OpenAI's 01, which obfuscates its chain of thought reasoning, Forge provides users with a clear visual representation of the model's thought process. "You will still have access in the sidebar to the full chain of thought," Shannon explained, adding, “There’s a little visualizer and it will show you the trajectory through the tree… you’ll be able to see exactly what the model was doing and why the node was selected.” Forge also boasts built-in memory, a graph database, and even code interpreter capabilities, initially supporting Python, making it a powerful platform for building complex LLM applications.
Forge is currently in a closed beta, but a waitlist is open for eager users. Karan and Shannon are taking a cautious approach to the rollout, as this is Nous Research’s first foray into hosting a product. For those lucky enough to gain access, Forge offers a tantalizing glimpse into the future of LLM interaction, promising greater transparency, improved reasoning, and more control over the model's behavior.
For ThursdAI readers early, here's a waitlist form to test it out!
Big Companies and APIs: The Reasoning Revolution
OpenAI’s 01: A New Era of LLM Reasoning
The big story in the Big Tech world is OpenAI's 01. Since we covered it live last week as it dropped, many of us have been playing with these new reasoning models, and collecting "vibes" from the community. These models represent a major leap in reasoning capabilities, and the results speak for themselves.
01 Preview claimed the top spot across the board on the LMSys Arena leaderboard, demonstrating significant improvements in complex tasks like competition math and coding. Even the smaller 01 Mini showed impressive performance, outshining larger models in certain technical areas. (and the jump in ELO score above the rest in MATH is just incredible to see!) and some folks made this video viral, of a PHD candidate reacting to 01 writing in 1 shot, code that took him a year to write, check it out, it’s priceless.
One key aspect of 01 is the concept of “inference-time compute”. As Noam Brown from OpenAI calls it, this represents a "new scaling paradigm", allowing the model to spend more time “thinking” during inference, leading to significantly improved performance on reasoning tasks. The implications of this are vast, opening up the possibility of LLMs tackling long-horizon problems in areas like drug discovery and physics.
However, the opacity surrounding 01’s chain of thought reasoning being hidden/obfuscated and the ban on users asking about it was a major point of contention at least within the ThursdAI chat. As Wolfram Ravenwolf put it, "The AI gives you an answer and you can't even ask how it got there. That is the wrong direction." as he was referring to the fact that not only is asking about the reasoning impossible, some folks were actually getting threatening emails and getting banned from using the product all together 😮
This Week's Buzz: Hackathons and RAG Courses!
We're almost ready to host our Weights & Biases Judgment Day Hackathon (LLMs as a judge, anyone?) with a few spots left, so if you're reading this and in SF, come hang out with us!
And the main thing I gave an update about is our Advanced RAG course, packed with insights from experts at Weights & Biases, Cohere, and Weaviate. Definitely check those out if you want to level up your LLM skills (and it's FREE in our courses academy!)
Vision & Video: The Rise of Generative Video
Generative video is having its moment, with a flurry of exciting announcements this week. First up, the open-source CogVideoX-5B-I2V, which brings accessible image-to-video capabilities to the masses. It's not perfect, but being able to generate video on your own hardware is a game-changer.
On the closed-source front, YouTube announced the integration of generative AI into YouTube Shorts with their DreamScreen feature, bringing AI-powered video generation to a massive audience. We also saw API releases from three leading video model providers: Runway, DreamMachine, and Kling, making it easier than ever to integrate generative video into applications. Runway even unveiled a video-to-video model, offering even more control over the creative process, and it's wild, check out what folks are doing with video-2-video!
One last thing here, Kling is adding a motion brush feature to help users guide their video generations, and it just looks so awesome I wanted to show you
Whew! That was one hell of a week, tho from the big companies perspective, it was a very slow week, getting a new OSS king, an end to end voice model and a new hint of inference platform from Nous, and having all those folks come to the show was awesome!
If you're reading all the way down to here, it seems that you like this content, why not share it with 1 or two friends? 👇 And as always, thank you for reading and subscribing! 🫶
P.S - I’m traveling for the next two weeks, and this week the live show was live recorded from San Francisco, thanks to my dear friends swyx & Alessio for hosting my again in their awesome Latent Space pod studio at Solaris SF!
March 14th, 2023 was the day ThursdAI was born, it was also the day OpenAI released GPT-4, and I jumped into a Twitter space and started chaotically reacting together with other folks about what a new release of a paradigm shifting model from OpenAI means, what are the details, the new capabilities. Today, it happened again!
Hey, it's Alex, I'm back from my mini vacation (pic after the signature) and boy am I glad I decided to not miss September 12th! The long rumored 🍓 thinking model from OpenAI, dropped as breaking news in the middle of ThursdAI live show, giving us plenty of time to react live!
But before this, we already had an amazing show with some great guests! Devendra Chaplot from Mistral came on and talked about their newly torrented (yeah they did that again) Pixtral VLM, their first multi modal! , and then I had the honor to host Steven Johnson and Raiza Martin from NotebookLM team at Google Labs which shipped something so uncannily good, that I legit said "holy fu*k" on X in a reaction!
So let's get into it (TL;DR and links will be at the end of this newsletter)
OpenAI o1, o1 preview and o1-mini, a series of new "reasoning" models
This is it folks, the strawberries have bloomed, and we finally get to taste them. OpenAI has released (without a waitlist, 100% rollout!) o1-preview and o1-mini models to chatGPT and API (tho only for tier-5 customers) 👏 and are working on releasing 01 as well.
These are models that think before they speak, and have been trained to imitate "system 2" thinking, and integrate chain-of-thought reasoning internally, using Reinforcement Learning and special thinking tokens, which allows them to actually review what they are about to say before they are saying it, achieving remarkable results on logic based questions.
Specifically you can see the jumps in the very very hard things like competition math and competition code, because those usually require a lot of reasoning, which is what these models were trained to do well.
New scaling paradigm
Noam Brown from OpenAI calls this a "new scaling paradigm" and Dr Jim Fan explains why, with this new way of "reasoning", the longer the model thinks - the better it does on reasoning tasks, they call this "test-time compute" or "inference-time compute" as opposed to compute that was used to train the model. This shifting of computation down to inference time is the essence of the paradigm shift, as in, pre-training can be very limiting computationally as the models scale in size of parameters, they can only go so big until you have to start building out a huge new supercluster of GPUs to host the next training run (Remember Elon's Colossus from last week?).
The interesting thing to consider here is, while current "thinking" times are ranging between a few seconds to a minute, imagine giving this model hours, days, weeks to think about new drug problems, physics problems 🤯.
Prompting o1
Interestingly, a new prompting paradigm has also been introduced. These models now have CoT (think "step by step") built-in, so you no longer have to include it in your prompts. By simply switching to o1-mini, most users will see better results right off the bat. OpenAI has worked with the Devin team to test drive these models, and these folks found that asking the new models to just give the final answer often works better and avoids redundancy in instructions.
The community of course will learn what works and doesn't in the next few hours, days, weeks, which is why we got 01-preview and not the actual (much better) o1.
Safety implications and future plans
According to Greg Brokman, this inference time compute also greatly helps with aligning the model to policies, giving it time to think about policies at length, and improving security and jailbreak preventions, not only logic.
The folks at OpenAI are so proud of all of the above that they have decided to restart the count and call this series o1, but they did mention that they are going to release GPT series models as well, adding to the confusing marketing around their models.
Open Source LLMs
Reflecting on Reflection 70B
Last week, Reflection 70B was supposed to launch live on the ThursdAI show, and while it didn't happen live, I did add it in post editing, and sent the newsletter, and packed my bag, and flew for my vacation. I got many DMs since then, and at some point couldn't resist checking and what I saw was complete chaos, and despite this, I tried to disconnect still until last night.
So here's what I could gather since last night. The claims of a llama 3.1 70B finetune that Matt Shumer and Sahil Chaudhary from Glaive beating Sonnet 3.5 are proven false, nobody was able to reproduce those evals they posted and boasted about, which is a damn shame.
Not only that, multiple trusted folks from our community, like Kyle Corbitt, Alex Atallah have reached out to Matt in to try to and get to the bottom of how such a thing would happen, and how claims like these could have been made in good faith. (or was there foul play)
The core idea of something like Reflection is actually very interesting, but alas, the inability to replicate, but also to stop engaging with he community openly (I've reached out to Matt and given him the opportunity to come to the show and address the topic, he did not reply), keep the model on hugging face where it's still trending, claiming to be the world's number 1 open source model, all these smell really bad, despite multiple efforts on out part to give the benefit of the doubt here.
As for my part in building the hype on this (last week's issues till claims that this model is top open source model), I addressed it in the beginning of the show, but then twitter spaces crashed, but unfortunately as much as I'd like to be able to personally check every thing I cover, I often have to rely on the reputation of my sources, which is easier with established big companies, and this time this approached failed me.
This weeks Buzzzzzz - One last week till our hackathon!
Look at this point, if you read this newsletter and don't know about our hackathon, then I really didn't do my job prompting it, but it's coming up, September 21-22 ! Join us, it's going to be a LOT of fun!
🖼️ Pixtral 12B from Mistral
Mistral AI burst onto the scene with Pixtral, their first multimodal model! Devendra Chaplot, research scientist at Mistral, joined ThursdAI to explain their unique approach, ditching fixed image resolutions and training a vision encoder from scratch.
"We designed this from the ground up to...get the most value per flop," Devendra explained. Pixtral handles multiple images interleaved with text within a 128k context window - a far cry from the single-image capabilities of most open-source multimodal models. And to make the community erupt in thunderous applause (cue the clap emojis!) they released the 12 billion parameter model under the ultra-permissive Apache 2.0 license. You can give Pixtral a whirl on Hyperbolic, HuggingFace, or directly through Mistral.
DeepSeek 2.5: When Intelligence Per Watt is King
Deepseek 2.5 launched amid the reflection news and did NOT get the deserved attention it.... deserves. It folded (no deprecated) Deepseek Coder into 2.5 and shows incredible metrics and a truly next-gen architecture. "It's like a higher order MOE", Nisten revealed, "which has this whole like pile of brain and it just like picks every time, from that." 🤯. DeepSeek 2.5 achieves maximum "intelligence per active parameter"
Google's turning text into AI podcast for auditory learners with Audio Overviews
Today I had the awesome pleasure of chatting with Steven Johnson and Raiza Martin from the NotebookLM team at Google Labs. NotebookLM is a research tool, that if you haven't used, you should definitely give it a spin, and this week they launched something I saw in preview and was looking forward to checking out and honestly was jaw-droppingly impressed today.
NotebookLM allows you to upload up to 50 "sources" which can be PDFs, web links that they will scrape for you, documents etc' (no multimodality so far) and will allow you to chat with them, create study guides, dive deeper and add notes as you study.
This week's update allows someone who doesn't like reading, to turn all those sources into a legit 5-10 minute podcast, and that sounds so realistic, that I was honestly blown away. I uploaded a documentation of fastHTML in there.. and well hear for yourself
The conversation with Steven and Raiza was really fun, podcast definitely give it a listen!
Not to mention that Google released (under waitlist) another podcast creating tool called illuminate, that will convert ArXiv papers into similar sounding very realistic 6-10 minute podcasts!
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
There are many more updates from this week, there was a whole Apple keynote I missed, which had a new point and describe feature with AI on the new iPhones and Apple Intelligence, Google also released new DataGemma 27B, and more things in TL'DR which are posted here in raw format
See you next week 🫡 Thank you for being a subscriber, weeks like this are the reason we keep doing this! 🔥 Hope you enjoy these models, leave in comments what you think about them
TL;DR in raw format
* Open Source LLMs
* Reflect on Reflection 70B & Matt Shumer (X, Sahil)
* Mixtral releases Pixtral 12B - multimodal model (X, try it)
* Pixtral is really good at OCR says swyx
* Interview with Devendra Chaplot on ThursdAI
* Initial reports of Pixtral beating GPT-4 on WildVision arena from AllenAI
* JinaIA reader-lm-0.5b and reader-lm-1.5b (X)
* ZeroEval updates
* Deepseek 2.5 -
* Deepseek coder is now folded into DeepSeek v2.5
* 89 HumanEval (up from 84 from deepseek v2)
* 9 on MT-bench
* Google - DataGemma 27B (RIG/RAG) for improving results
* Retrieval-Interleaved Generation
* 🤖 DataGemma: AI models that connect LLMs to Google's Data Commons
* 📊 Data Commons: A vast repository of trustworthy public data
* 🔍 Tackling AI hallucination by grounding LLMs in real-world data
* 🔍 Two approaches: RIG (Retrieval-Interleaved Generation) and RAG (Retrieval-Augmented Generation)
* 🔍 Preliminary results show enhanced accuracy and reduced hallucinations
* 🔓 Making DataGemma open models to enable broader adoption
* 🌍 Empowering informed decisions and deeper understanding of the world
* 🔍 Ongoing research to refine the methodologies and scale the work
* 🔍 Integrating DataGemma into Gemma and Gemini AI models
* 🤝 Collaborating with researchers and developers through quickstart notebooks
* Big CO LLMs + APIs
* Apple event
* Apple Intelligence - launching soon
* Visual Intelligence with a dedicated button
* Google Illuminate - generate arXiv paper into multiple speaker podcasts (Website)
* 5-10 min podcasts
* multiple speakers
* any paper
* waitlist
* has samples
* sounds super cool
* Google NotebookLM is finally available - multi modal research tool + podcast (NotebookLM)
* Has RAG like abilities, can add sources from drive or direct web links
* Currently not multimodal
* Generation of multi speaker conversation about this topic to present it, sounds really really realistic
* Chat with Steven and Raiza
* OpenAI reveals new o1 models, and launches o1 preview and o1-mini in chat and API (X, Blog)
* Trained with RL to think before it speaks with special thinking tokens (that you pay for)
* new scaling paradigm
* This weeks Buzz
* Vision & Video
* Adobe announces Firefly video model (X)
* Voice & Audio
* Hume launches EVI 2 (X)
* Fish Speech 1.4 (X)
* Instant Voice Cloning
* Ultra low latenc
* ~1GB model weights
* LLaMA-Omni, a new model for speech interaction (X)
* Tools
* New Jina reader (X)
Welcome back everyone, can you believe it's another ThursdAI already? And can you believe me when I tell you that friends of the pod Matt Shumer & Sahil form Glaive.ai just dropped a LLama 3.1 70B finetune that you can download that will outperform Claude Sonnet 3.5 while running locally on your machine?
Today was a VERY heavy Open Source focused show, we had a great chat w/ Niklas, the leading author of OLMoE, a new and 100% open source MoE from Allen AI, a chat with Eugene (pico_creator) about RWKV being deployed to over 1.5 billion devices with Windows updates and a lot more.
In the realm of the big companies, Elon shook the world of AI by turning on the biggest training cluster called Colossus (100K H100 GPUs) which was scaled in 122 days 😮 and Anthropic announced that they have 500K context window Claude that's only reserved if you're an enterprise customer, while OpenAI is floating an idea of a $2000/mo subscription for Orion, their next version of a 100x better chatGPT?!
TL;DR
* Open Source LLMs
* Matt Shumer / Glaive - Reflection-LLama 70B beats Claude 3.5 (X, HF)
* Allen AI - OLMoE - first "good" MoE 100% OpenSource (X, Blog, Paper, WandB)
* RWKV.cpp is deployed with Windows to 1.5 Billion devices
* MMMU pro - more robust multi disipline multimodal understanding bench (proj)
* 01AI - Yi-Coder 1.5B and 9B (X, Blog, HF)
* Big CO LLMs + APIs
* Replit launches Agent in beta - from coding to production (X, Try It)
* Ilya SSI announces 1B round from everyone (Post)
* Cohere updates Command-R and Command R+ on API (Blog)
* Claude Enterprise with 500K context window (Blog)
* Claude invisibly adds instructions (even via the API?) (X)
* Google got structured output finally (Docs)
* Amazon to include Claude in Alexa starting this October (Blog)
* X ai scaled Colossus to 100K H100 GPU goes online (X)
* DeepMind - AlphaProteo new paper (Blog, Paper, Video)
* This weeks Buzz
* Hackathon did we mention? We're going to have Eugene and Greg as Judges!
* AI Art & Diffusion & 3D
* ByteDance - LoopyAvatar - Audio Driven portait avatars (Page)
Open Source LLMs
Reflection Llama-3.1 70B - new 👑 open source LLM from Matt Shumer / GlaiveAI
This model is BANANAs folks, this is a LLama 70b finetune, that was trained with a new way that Matt came up with, that bakes CoT and Reflection into the model via Finetune, which results in model outputting its thinking as though you'd prompt it in a certain way.
This causes the model to say something, and then check itself, and then reflect on the check and then finally give you a much better answer. Now you may be thinking, we could do this before, RefleXion (arxiv.org/2303.11366) came out a year ago, so what's new?
What's new is, this is now happening inside the models head, you don't have to reprompt, you don't even have to know about these techniques! So what you see above, is just colored differently, but all of it, is output by the model without extra prompting by the user or extra tricks in system prompt. the model thinks, plans, does chain of thought, then reviews and reflects, and then gives an answer!
And the results are quite incredible for a 70B model 👇
Looking at these evals, this is a 70B model that beats GPT-4o, Claude 3.5 on Instruction Following (IFEval), MATH, GSM8K with 99.2% 😮 and gets very close to Claude on GPQA and HumanEval!
(Note that these comparisons are a bit of a apples to ... different types of apples. If you apply CoT and reflection to the Claude 3.5 model, they may in fact perform better on the above, as this won't be counted 0-shot anymore. But given that this new model is effectively spitting out those reflection tokens, I'm ok with this comparison)
This is just the 70B, next week the folks are planning to drop the 405B finetune with the technical report, so stay tuned for that!
Kudos on this work, go give Matt Shumer and Glaive AI a follow!
Allen AI OLMoE - tiny "good" MoE that's 100% open source, weights, code, logs
We've previously covered OLMO from Allen Institute, and back then it was obvious how much commitment they have to open source, and this week they continued on this path with the release of OLMoE, an Mixture of Experts 7B parameter model (1B active parameters), trained from scratch on 5T tokens, which was completely open sourced.
This model punches above its weights on the best performance/cost ratio chart for MoEs and definitely highest on the charts of releasing everything.
By everything here, we mean... everything, not only the final weights file; they released 255 checkpoints (every 5000 steps), the training code (Github) and even (and maybe the best part) the Weights & Biases logs!
It was a pleasure to host the leading author of the OLMoE paper, Niklas Muennighoff on the show today, so definitely give this segment a listen, he's a great guest and I learned a lot!
Big Companies LLMs + API
Anthropic has 500K context window Claude but only for Enterprise?
Well, this sucks (unless you work for Midjourney, Airtable or Deloitte). Apparently Anthropic has been sitting on Claude that can extend to half a million tokens in the context window, and decided to keep it to themselves and a few trial enterprises, and package it as an Enterprise offering.
This offering now includes, beyond just the context window, also a native Github integration, and a few key enterprise features like access logs, provisioning and SCIM and all kinds of "procurement and CISO required" stuff enterprises look for.
To be clear, this is a great move for Anthropic, and this isn't an API tier, this is for their front end offering, including the indredible artifacts tool, so that companies can buy their employees access to Claude.ai and have them be way more productive coding (hence the Github integration) or summarizing (very very) long documents, building mockups and one off apps etc'
Anthropic is also in the news this week, because Amazon announced that it'll use Claude as the backbone for the smart (or "remarkable" as they call it) Alexa brains coming up in October, which, again, incredible for Anthropic distribution, as there are maybe 100M Alexa users in the world or so.
Prompt injecting must stop!
And lastly, there have been mounting evidence, including our own Wolfram Ravenwolf that confirmed it, that Anthropic is prompt injecting additional context into your own prompts, in the UI but also via the API! This is awful practice and if anyone from there reads this newsletter, please stop or at least acknowledge. Claude apparently just... thinks that it's something my users said, when in fact, it's some middle layer of anthropic security decided to just inject some additional words in there!
XAI turns on the largest training GPU SuperCluster Colossus - 100K H100 GPUS
This is a huge deal for AI, specifically due to the time this took and the massive massive scale of this SuperCluster. SuperCluster means all these GPUs sit in one datacenter, drawing from the same power-grid and can effectively run single training jobs.
This took just 122 days for Elon and the XAI team to go from an empty warehouse in Memphis to booting up an incredible 100K H100, and they claim that they will double this capacity by adding 50K H200 in the next few months. As Elon mentioned when they released Grok2, it was trained on 15K, and it matched GPT4!
Per SemiAnalisys, this new Supercluster can train a GPT-4 level model in just 4 days 🤯
XAI was founded a year ago, and by end of this year, they plan for Grok to be the beast LLM in the world, and not just get to GPT-4ish levels, and with this + 6B investment they have taken in early this year, it seems like they are well on track, which makes some folks at OpenAI reportedly worried
This weeks buzz - we're in SF in less than two weeks, join our hackathon!
This time I'm very pleased to announce incredible judges for our hackathon, the spaces are limited, but there's still some spaces so please feel free to sign up and join us
I'm so honored to announce that we'll have Eugene Yan (@eugeneyan), Greg Kamradt (@GregKamradt) and Charles Frye (@charles_irl) on the Judges panel. 🤩 It'll be incredible to have these folks see what hackers come up with, and I'm excited as this comes closer!
Replit launches Agents beta - a fully integrated code → deployment agent
Replit is a great integrated editing environment, with database and production in 1 click and they've had their LLMs trained on a LOT of code helping folks code for a while.
Now they are launching agents, which seems very smart from them, given that development is much more than just coding. All the recent excitement we see about Cursor, is omitting the fact that those demos are only working for folks who already know how to set up the environment, and then there's the need to deploy to production, maintain.
Replit has that basically built in, and now their Agent can build a plan and help you build those apps, and "ship" them, while showing you what they are doing. This is massive, and I can't wait to play around with this!
The additional benefit of Replit is that they nailed the mobile app experience as well, so this now works from mobile, on the go!
In fact, as I was writing this, I got so excited that I paused for 30 minutes, payed the yearly subscription and decided to give building an app a try!
The fact that this can deploy and run the server and the frontend, detect errors, fix them, and then also provision a DB for me, provision Stripe, login buttons and everything else is quite insane.
Can't wait to see what I can spin up with this 🔥 (and show all of you!)
Loopy - Animated Avatars from ByteDance
A new animated avatar project from folks at ByteDance just dropped, and it’s WAY clearer than anything we’ve seen before, like EMO or anything else. I will just add this video here for you to enjoy and look at the earring movements, vocal cords, eyes, everything!
I of course wanted to know if I’ll ever be able to use this, and .. likely no, here’s the response I got from Jianwen one of the Authors today.
That's it for this week, we've talked about so much more in the pod, please please check it out.
As for me, while so many exciting things are happening, I'm going on a small 🏝️ vacation until next ThursdAI, which will happen on schedule, so planning to decompress and disconnect, but will still be checking in, so if you see things that are interesting, please tag me on X 🙏
P.S - I want to shout out a dear community member that's been doing just that, @PresidentLin has been tagging me in many AI related releases, often way before I would even notice them, so please give them a follow! 🫡
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Hey, for the least time during summer of 2024, welcome to yet another edition of ThursdAI, also happy skynet self-awareness day for those who keep track :)
This week, Cerebras broke the world record for fastest LLama 3.1 70B/8B inference (and came on the show to talk about it) Google updated 3 new Geminis, Anthropic artifacts for all, 100M context windows are possible, and Qwen beats SOTA on vision models + much more!
As always, this weeks newsletter is brought to you by Weights & Biases, did I mention we're doing a hackathon in SF in September 21/22 and that we have an upcoming free RAG course w/ Cohere & Weaviate?
TL;DR
* Open Source LLMs
* Nous DisTrO - Distributed Training (X , Report)
* NousResearch/ hermes-function-calling-v1 open sourced - (X, HF)
* LinkedIN Liger-Kernel - OneLine to make Training 20% faster & 60% more memory Efficient (Github)
* Cartesia - Rene 1.3B LLM SSM + Edge Apache 2 acceleration (X, Blog)
* Big CO LLMs + APIs
* Cerebras launches the fastest AI inference - 447t/s LLama 3.1 70B (X, Blog, Try It)
* Google - Gemini 1.5 Flash 8B & new Gemini 1.5 Pro/Flash (X, Try it)
* Google adds Gems & Imagen to Gemini paid tier
* Anthropic artifacts available to all users + on mobile (Blog, Try it)
* Anthropic publishes their system prompts with model releases (release notes)
* OpenAI has project Strawberry coming this fall (via The information)
* This weeks Buzz
* WandB Hackathon hackathon hackathon (Register, Join)
* Also, we have a new RAG course w/ Cohere and Weaviate (RAG Course)
* Vision & Video
* Zhipu AI CogVideoX - 5B Video Model w/ Less 10GB of VRAM (X, HF, Try it)
* Qwen-2 VL 72B,7B,2B - new SOTA vision models from QWEN (X, Blog, HF)
* AI Art & Diffusion & 3D
* GameNgen - completely generated (not rendered) DOOM with SD1.4 (project)
* FAL new LORA trainer for FLUX - trains under 5 minutes (Trainer, Coupon for ThursdAI)
* Tools & Others
* SimpleBench from AI Explained - closely matches human experience (simple-bench.com)
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Open Source
Let's be honest - ThursdAI is a love letter to the open-source AI community, and this week was packed with reasons to celebrate.
Nous Research DiStRO + Function Calling V1
Nous Research was on fire this week (aren't they always?) and they kicked off the week with the release of DiStRO, which is a breakthrough in distributed training. You see, while LLM training requires a lot of hardware, it also requires a lot of network bandwidth between the different GPUs, even within the same data center.
Proprietary networking solutions like Nvidia NVLink, and more open standards like Ethernet work well within the same datacenter, but training across different GPU clouds has been unimaginable until now.
Enter DiStRo, a new decentralized training by the mad geniuses at Nous Research, in which they reduced the required bandwidth to train a 1.2B param model from 74.4GB to just 86MB (857x)!
This can have massive implications for training across compute clusters, doing shared training runs, optimizing costs and efficiency and democratizing LLM training access! So don't sell your old GPUs just yet, someone may just come up with a folding@home but for training the largest open source LLM, and it may just be Nous!
Nous Research also released their function-calling-v1 dataset (HF) that was used to train Hermes-2, and we had InterstellarNinja who authored that dataset, join the show and chat about it. This is an incredible unlock for the open source community, as function calling become a de-facto standard now. Shout out to the Glaive team as well for their pioneering work that paved the way!
LinkedIn's Liger Kernel: Unleashing the Need for Speed (with One Line of Code)
What if I told you, that whatever software you develop, you can add 1 line of code, and it'll run 20% faster, and require 60% less memory?
This is basically what Linkedin researches released this week with Liger Kernel, yes you read that right, Linkedin, as in the website you career related posts on!
"If you're doing any form of finetuning, using this is an instant win"Wing Lian - Axolotl
This absolutely bonkers improvement in training LLMs, now works smoothly with Flash Attention, PyTorch FSDP and DeepSpeed. If you want to read more about the implementation of the triton kernels, you can see a deep dive here, I just wanted to bring this to your attention, even if you're not technical, because efficiency jumps like these are happening all the time. We are used to seeing them in capabilities / intelligence, but they are also happening on the algorithmic/training/hardware side, and it's incredible to see!
Huge shoutout to Byron and team at Linkedin for this unlock, check out their Github if you want to get involved!
Qwen-2 VL - SOTA image and video understanding + open weights mini VLM
You may already know that we love the folks at Qwen here on ThursdAI, not only because Junyang Lin is a frequeny co-host and we get to hear about their releases as soon as they come out (they seem to be releasing them on thursdays around the time of the live show, I wonder why!)
But also because, they are committed to open source, and have released 2 models 7B and 2B with complete Apache 2 license!
First of all, their Qwen-2 VL 72B model, is now SOTA at many benchmarks, beating GPT-4, Claude 3.5 and other much bigger models. This is insane. I literally had to pause Junyang and repeat what he said, this is a 72B param model, that beats GPT-4o on document understanding, on math, on general visual Q&A.
Additional Capabilities & Smaller models
They have added new capabilities in these models, like being able to handle arbitrary resolutions, but the one I'm most excited about is the video understanding. These models can now understand up to 20 minutes of video sequences, and it's not just "split the video to 10 frames and do image caption", no, these models understand video progression and if I understand correctly how they do it, it's quite genius.
They the video embed time progression into the model using a new technique called M-RoPE, which turns the time progression into rotary positional embeddings.
Now, the 72B model is currently available via API, but we do get 2 new small models with Apache 2 license and they are NOT too shabby either!
7B parameters (HF) and 2B Qwen-2 VL (HF) are small enough to run completely on your machine, and the 2B parameter, scores better than GPT-4o mini on OCR-bench for example!
I can't wait to finish writing and go play with these models!
Big Companies & LLM APIs
The biggest news this week came from Cerebras System, a relatively unknown company, that shattered the world record for LLM inferencing out of the blue (and came on the show to talk about how they are doing it)
Cerebras - fastest LLM inference on wafer scale chips
Cerebras has introduced the concept of wafer scale chips to the world, which is, if you imagine a microchip, they are the size of a post stamp maybe? GPUs are bigger, well, Cerebras are making chips the sizes of an iPad (72 square inches), largest commercial chips in the world.
And now, they created an inference stack on top of those chips, and showed that they have the fastest inference in the world, how fast? Well, they can server LLama 3.1 8B at a whopping 1822t/s. No really, this is INSANE speeds, as I was writing this, I copied all the words I had so far, went to inference.cerebras.ai , asked to summarize, pasted and hit send, and I immediately got a summary!
"The really simple explanation is we basically store the entire model, whether it's 8B or 70B or 405B, entirely on the chip. There's no external memory, no HBM. We have 44 gigabytes of memory on chip."James Wang
They not only store the whole model (405B coming soon), but they store it in full fp16 precision as well, so they don't quantize the models. Right now, they are serving it with 8K tokens in context window, and we had a conversation about their next steps being giving more context to developers.
The whole conversation is well worth listening to, James and Ian were awesome to chat with, and while they do have a waitlist, as they gradually roll out their release, James said to DM him on X and mention ThursdAI, and he'll put you through, so you'll be able to get an OpenAI compatible API key and be able to test this insane speed.
P.S - we also did an independent verification of these speeds, using Weave, and found Cerebras to be quite incredible for agentic purposes, you can read our report here and the weave dashboard here
Anthropic - unlocking just-in-time applications with artifacts for all
Well, if you aren't paying claude, maybe this will convince you. This week, anthropic announced that artifacts are available to all users, not only their paid customers.
Artifacts are a feature in Claude that is basically a side pane (and from this week, a drawer in their mobile apps) that allows you to see what Claude is building, by rendering the web application almost on the fly. They have also trained Claude in working with that interface, so it knows about the different files etc
Effectively, this turns Claude into a web developer that will build mini web applications (without backend) for you, on the fly, for any task you can think of.
Drop a design, and it'll build a mock of it, drop some data in a CSV and it'll build an interactive onetime dashboard visualizing that data, or just ask it to build an app helping you split the bill between friends by uploading a picture of a bill.
Artifacts are share-able and remixable, so you can build something and share with friends, so here you go, an artifact I made, by dropping my notes into claude, and asking for a magic 8 Ball, that will spit out a random fact from today's editing of ThursdAI. I also provided Claude with an 8Ball image, but it didn't work due to restrictions, so instead I just uploaded that image to claude and asked it to recreate it with SVG! And viola, a completely un-nessesary app that works!
Google’s Gemini Keeps Climbing the Charts (But Will It Be Enough?)
Sensing a disturbance in the AI force (probably from that Cerebras bombshell), Google rolled out a series of Gemini updates, including a new experimental Gemini 1.5 Pro (0827) with sharper coding skills and logical reasoning. According to LMSys, it’s already nipping at the heels of ChatGPT 4o and is number 2!
Their Gemini 1.5 Flash model got a serious upgrade, vaulting to the #6 position on the arena. And to add to the model madness, they even released an Gemini Flash 8B parameter version for folks who need that sweet spot between speed and size.
Oh, and those long-awaited Gems are finally starting to roll out. But get ready to open your wallet – this feature (preloading Gemini with custom context and capabilities) is a paid-tier exclusive. But hey, at least Imagen-3 is cautiously returning to the image generation game!
AI Art & Diffusion
Doom Meets Stable Diffusion: AI Dreams in 20FPS Glory (GameNGen)
The future of video games is, uh, definitely going to be interesting. Just as everyone thought AI would be conquering Go or Chess, it seems we've stumbled into a different battlefield: first-person shooters. 🤯
This week, researchers in DeepMind blew everyone's minds with their GameNgen research. What did they do? They trained Stable Diffusion 1.4 on Doom, and I'm not just talking about static images - I'm talking about generating actual Doom gameplay in near real time. Think 20FPS Doom running on nothing but the magic of AI.
The craziest part to me is this quote "Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation"
FAL Drops the LORA Training Time Bomb (and I Get a New Meme!)
As you see, I haven't yet relaxed from making custom AI generations with Flux and customizing them with training LORAs. Two weeks ago, this used to take 45 minutes, a week ago, 20 minutes, and now, the wizards at FAL, created a new trainer that shrinks the training times down to less than 5 minutes!
So given that the first upcoming SpaceX commercial spacewalk Polaris Dawn, I trained a SpaceX astronaut LORA and then combined my face with it, and viola, here I am, as a space X astronaut!
BTW because they are awesome, Jonathan and Simo (who is the magician behind this new trainer) came to the show, announced the new trainer, but also gave all listeners of ThursdAI a coupon to train a LORA effectively for free, just use this link and start training! (btw I get nothing out of this, just trying to look out for my listeners!)
That's it for this week, well almost that's it, magic.dev announced a new funding round of 320 million, and that they have a 100M context window capable models and coding product to go with it, but didn't yet release it, just as we were wrapping up. Sam Altman tweeted that OpenAI now has over 200 Million active users on ChatGPT and that OpenAI will collaborate with AI Safety institute.
Ok now officially that's it! See you next week, when it's going to be 🍁 already brrr
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Hey there, Alex here with an end of summer edition of our show, which did not disappoint. Today is the official anniversary of stable diffusion 1.4 can you believe it?
It's the second week in the row that we have an exclusive LLM launch on the show (after Emozilla announced Hermes 3 on last week's show), and spoiler alert, we may have something cooking for next week as well!
This edition of ThursdAI is brought to you by W&B Weave, our LLM observability toolkit, letting you evaluate LLMs for your own use-case easily
Also this week, we've covered both ends of AI progress, doomerist CEO saying "Fck Gen AI" vs an 8yo coder and I continued to geek out on putting myself into memes (I promised I'll stop... at some point) so buckle up, let's take a look at another crazy week:
TL;DR
* Open Source LLMs
* AI21 releases Jamba1.5 Large / Mini hybrid Mamba MoE (X, Blog, HF)
* Microsoft Phi 3.5 - 3 new models including MoE (X, HF)
* BFCL 2 - Berkley Function Calling Leaderboard V2 (X, Blog, Leaderboard)
* NVIDIA - Mistral Nemo Minitron 8B - Distilled / Pruned from 12B (HF)
* Cohere paper proves - code improves intelligence (X, Paper)
* MOHAWK - transformer → Mamba distillation method (X, Paper, Blog)
* AI Art & Diffusion & 3D
* Ideogram launches v2 - new img diffusion king 👑 + API (X, Blog, Try it)
* Midjourney is now on web + free tier (try it finally)
* Flux keeps getting better, cheaper, faster + adoption from OSS (X, X, X)
* Procreate hates generative AI (X)
* Big CO LLMs + APIs
* Grok 2 full is finally available on X - performs well on real time queries (X)
* OpenAI adds GPT-4o Finetuning (blog)
* Google API updates - 1000 pages PDFs + LOTS of free tokens (X)
* This weeks Buzz
* Weights & Biases Judgement Day SF Hackathon in September 21-22 (Sign up to hack)
* Video
* Hotshot - new video model - trained by 4 guys (try it, technical deep dive)
* Luma Dream Machine 1.5 (X, Try it)
* Tools & Others
* LMStudio 0.0.3 update - local RAG, structured outputs with any model & more (X)
* Vercel - Vo now has chat (X)
* Ark - a completely offline device - offline LLM + worlds maps (X)
* Ricky's Daughter coding with cursor video is a must watch (video)
The Best of the Best: Open Source Wins with Jamba, Phi 3.5, and Surprise Function Calling Heroes
We kick things off this week by focusing on what we love the most on ThursdAI, open-source models! We had a ton of incredible releases this week, starting off with something we were super lucky to have live, the official announcement of AI21's latest LLM: Jamba.
AI21 Officially Announces Jamba 1.5 Large/Mini – The Powerhouse Architecture Combines Transformer and Mamba
While we've covered Jamba release on the show back in April, Jamba 1.5 is an updated powerhouse. It's 2 models, Large and Mini, both MoE and both are still hybrid architecture of Transformers + Mamba that try to get both worlds.
Itay Dalmedigos, technical lead at AI21, joined us on the ThursdAI stage for an exclusive first look, giving us the full rundown on this developer-ready model with an awesome 256K context window, but it's not just the size – it’s about using that size effectively.
AI21 measured the effective context use of their model on the new RULER benchmark released by NVIDIA, an iteration of the needle in the haystack and showed that their models have full utilization of context, as opposed to many other models.
“As you mentioned, we’re able to pack many, many tokens on a single GPU. Uh, this is mostly due to the fact that we are able to quantize most of our parameters", Itay explained, diving into their secret sauce, ExpertsInt8, a novel quantization technique specifically designed for MoE models.
Oh, and did we mention Jamba is multilingual (eight languages and counting), natively supports structured JSON, function calling, document digestion… basically everything developers dream of. They even chucked in citation generation, as it's long context can contain full documents, your RAG app may not even need to chunk anything, and the citation can cite full documents!
Berkeley Function Calling Leaderboard V2: Updated + Live (link)
Ever wondered how to measure the real-world magic of those models boasting "I can call functions! I can do tool use! Look how cool I am!" 😎? Enter the Berkeley Function Calling Leaderboard (BFCL) 2, a battleground where models clash to prove their function calling prowess.
Version 2 just dropped, and this ain't your average benchmark, folks. It's armed with a "Live Dataset" - a dynamic, user-contributed treasure trove of real-world queries, rare function documentations, and specialized use-cases spanning multiple languages. Translation: NO more biased, contaminated datasets. BFCL 2 is as close to the real world as it gets.
So, who’s sitting on the Function Calling throne this week? Our old friend Claude 3.5 Sonnet, with an impressive score of 73.61. But breathing down its neck is GPT 4-0613 (the OG Function Calling master) with 73.5. That's right, the one released a year ago, the first one with function calling, in fact the first LLM with function calling as a concept IIRC!
Now, prepare for the REAL plot twist. The top-performing open-source model isn’t some big name, resource-heavy behemoth. It’s a tiny little underdog called Functionary Medium 3.1, a finetuned version of Llama 3.1 that blew everyone away. It even outscored both versions of Claude 3 Opus AND GPT 4 - leaving folks scrambling to figure out WHO created this masterpiece.
“I’ve never heard of this model. It's MIT licensed from an organization called MeetKai. Have you guys heard about Functionary Medium?” I asked, echoing the collective bafflement in the space. Yep, turns out there’s gold hidden in the vast landscape of open source models, just waiting to be unearthed ⛏️.
Microsoft updates Phi 3.5 - 3 new models including an MoE + MIT license
3 new Phi's dropped this week, including an MoE one, and a new revamped vision one. They look very decent on benchmark yet again, with the mini version (3.8B) seemingly beating LLama 3.1 8B on a few benchmarks.
However, as previously the excitement is met with caution because Phi models seem great on benchmarks but then actually talking with them, folks are not as impressed usually.
Terry from BigCodeBench also saw a significant decrease in coding ability for Phi 3.5 vs 3.1
Of course, we're not complaining, the models released with 128K context and MIT license.
The thing I'm most excited about is the vision model updates, it has been updated with "multi-frame image understanding and reasoning" which is a big deal! This means understanding videos more natively across scenes.
This weeks Buzz
Hey, if you're reading this, while sitting in the bay area, and you don't have plans for exactly a month from now, why don't you come and hack with me? (Register Free)
Announcing, the first W&B hackathon, Judgement Day that's going to be focused on LLM as a judge! Come hack on innovative LLM as a judge ideas, UIs, evals and more, meet other like minded hackers and AI engineers and win great prizes!
🎨 AI Art: Ideogram Crowns Itself King, Midjourney Joins the Internet & FLUX everywhere
While there was little news from big LLM labs this week, there is a LOT of AI art news, which is fitting to celebrate 2 year Stable Diffusion 1.4 anniversary!
👑 Ideogram v2: Text Wizardry and API Access (But No Loras… Yet?)
With significantly improved realism, and likely the best text generation across all models out there, Ideogram v2 just took over the AI image generation game! Just look at that text sharpness!
They now offer a selection of styles (Realistic, Design, 3D, Anime) and any aspect ratios you'd like and also, brands can now provide color palettes to control the outputs!
Adding to this is a new API offering (.8c per image for the main model, .5c for the new turbo model of v2!) and a new IOS app, they also added the option (for premium users only) to search through a billion generations and their prompts, which is a great offering as well, as sometimes you don't even know what to prompt.
They claim a significant improvement over Flux[pro] and Dalle-3 in text, alignment and overall, interesting that MJ was not compared!
Meanwhile, Midjourney finally launched a website and a free tier, so no longer do you have to learn to use Discord to even try Midjourney.
Meanwhile Flux enjoys the fruits of Open Source
While the Ideogram and MJ fight it out for the closed source, Black Forest Labs enjoys the fruits of released their weights in the open.
Fal just released an update that LORAs run 2.5x faster and 2.5x cheaper, CivitAI has LORAs for pretty much every character and celebrity ported to FLUX already, different techniques like ControlNets Unions, IPAdapters and more are being trained as we speak and tutorials upon tutorials are released of how to customize these models, for free (shoutout to my friend Matt Wolfe for this one)
you can now train your own face on fal.ai , replicate.com and astria.ai , and thanks to astria, I was able to find some old generations of my LORAs from the 1.5 days (not quite 1.4, but still, enough to show the difference between then and now) and whoa.
🤔 Is This AI Tool Necessary, Bro?
Let’s end with a topic that stirred up a hornets nest of opinions this week: Procreate, a beloved iPad design app, publicly declared their "fing hate” for Generative AI.
Yeah, you read that right. Hate. The CEO, in a public statement went FULL scorched earth - proclaiming that AI-powered features would never sully the pristine code of their precious app.
“Instead of trying to bridge the gap, he’s creating more walls", Wolfram commented, echoing the general “dude… what?” vibe in the space. “It feels marketeerial”, I added, pointing out the obvious PR play (while simultaneously acknowledging the very REAL, very LOUD segment of the Procreate community that cheered this decision).
Here’s the thing: you can hate the tech. You can lament the potential demise of the human creative spark. You can rail against the looming AI overlords. But one thing’s undeniable: this tech isn't going anywhere.
Meanwhile, 8yo coders lean in fully into AI
As a contrast to this doomerism take, just watch this video of Ricky Robinette's eight-year-old daughter building a Harry Potter website in 45 minutes, using nothing but a chat interface in Cursor. No coding knowledge. No prior experience. Just prompts and the power of AI ✨.
THAT’s where we’re headed, folks. It might be terrifying. It might be inspiring. But it’s DEFINITELY happening. Better to understand it, engage with it, and maybe try to nudge it in a positive direction, than burying your head in the sand and muttering “I bleeping hate this progress” like a cranky, Luddite hermit. Just sayin' 🤷♀️.
AI Device to reboot civilization (if needed)
I was scrolling through my feed (as I do VERY often, to bring you this every week) and I saw this and super quickly decided to invite the author to the show to talk about it.
Adam Cohen Hillel has prototyped an AI hardware device, but this one isn't trying to record you or be your friend, no, this one comes with offline LLMs finetuned with health and bio information, survival tactics, and all of the worlds maps and works completely offline!
This to me was a very exciting use for an LLM, a distilled version of all human knowledge, buried in a faraday cage, with replaceable batteries that runs on solar and can help you survive in the case of something bad happening, like really bad happening (think a solar flare that takes out the electrical grid or an EMP device). While improbable, I thought this was a great idea and had a nice chat with the creator, you should definitely give this one a listen, and if you want to buy one, he is going to sell them soon here
This is it for this week, there have been a few updates from the big labs, OpenAI has opened Finetuneing for GPT-4o, and you can use your WandB API key in there to track those, which is cool, Gemini API now accepts incredibly large PDF files (up to 1000 pages) and Grok 2 is finally on X (not mini from last week)
See you next week (we will have another deep dive!)
Look these crazy weeks don't seem to stop, and though this week started out a bit slower (while folks were waiting to see how the speculation about certain red berry flavored conspiracies are shaking out) the big labs are shipping!
We've got space uncle Elon dropping an "almost-gpt4" level Grok-2, that's uncensored, has access to real time data on X and can draw all kinds of images with Flux, OpenAI announced a new ChatGPT 4o version (not the one from last week that supported structured outputs, a different one!) and Anthropic dropping something that makes AI Engineers salivate!
Oh, and for the second week in a row, ThursdAI live spaces were listened to by over 4K people, which is very humbling, and awesome because for example today, Nous Research announced Hermes 3 live on ThursdAI before the public heard about it (and I had a long chat w/ Emozilla about it, very well worth listening to)
TL;DR of all topics covered:
* Big CO LLMs + APIs
* Xai releases GROK-2 - frontier level Grok, uncensored + image gen with Flux (𝕏, Blog, Try It)
* OpenAI releases another ChatGPT-4o (and tops LMsys again) (X, Blog)
* Google showcases Gemini Live, Pixel Bugs w/ Gemini, Google Assistant upgrades ( Blog)
* Anthropic adds Prompt Caching in Beta - cutting costs by u to 90% (X, Blog)
* AI Art & Diffusion & 3D
* Flux now has support for LORAs, ControlNet, img2img (Fal, Replicate)
* Google Imagen-3 is out of secret preview and it looks very good (𝕏, Paper, Try It)
* This weeks Buzz
* Using Weights & Biases Weave to evaluate Claude Prompt Caching (X, Github, Weave Dash)
* Open Source LLMs
* NousResearch drops Hermes 3 - 405B, 70B, 8B LLama 3.1 finetunes (X, Blog, Paper)
* NVIDIA Llama-3.1-Minitron 4B (Blog, HF)
* AnswerAI - colbert-small-v1 (Blog, HF)
* Vision & Video
* Runway Gen-3 Turbo is now available (Try It)
Big Companies & LLM APIs
Grok 2: Real Time Information, Uncensored as Hell, and… Flux?!
The team at xAI definitely knows how to make a statement, dropping a knowledge bomb on us with the release of Grok 2. This isn't your uncle's dad joke model anymore - Grok 2 is a legitimate frontier model, folks.
As Matt Shumer excitedly put it
“If this model is this good with less than a year of work, the trajectory they’re on, it seems like they will be far above this...very very soon” 🚀
Not only does Grok 2 have impressive scores on MMLU (beating the previous GPT-4o on their benchmarks… from MAY 2024), it even outperforms Llama 3 405B, proving that xAI isn't messing around.
But here's where things get really interesting. Not only does this model access real time data through Twitter, which is a MOAT so wide you could probably park a rocket in it, it's also VERY uncensored. Think generating political content that'd make your grandma clutch her pearls or imagining Disney characters breaking bad in a way that’s both hilarious and kinda disturbing all thanks to Grok 2’s integration with Black Forest Labs Flux image generation model.
With an affordable price point ($8/month for x Premium including access to Grok 2 and their killer MidJourney competitor?!), it’ll be interesting to see how Grok’s "truth seeking" (as xAI calls it) model plays out. Buckle up, folks, this is going to be wild, especially since all the normies now have the power to create political memes, that look VERY realistic, within seconds.
Oh yeah… and there’s the upcoming Enterprise API as well… and Grok 2’s made its debut in the wild on the LMSys Arena, lurking incognito as "sus-column-r" and is now placed on TOP of Sonnet 3.5 and comes in as number 5 overall!
OpenAI last ChatGPT is back at #1, but it's all very confusing 😵💫
As the news about Grok-2 was settling in, OpenAI decided to, well… drop yet another GPT-4.o update on us. While Google was hosting their event no less. Seriously OpenAI? I guess they like to one-up Google's new releases (they also kicked Gemini from the #1 position after only 1 week there)
So what was anonymous-chatbot in Lmsys for the past week, was also released in ChatGPT interface, is now the best LLM in the world according to LMSYS and other folks, it's #1 at Math, #1 at complex prompts, coding and #1 overall.
It is also available for us developers via API, but... they don't recommend using it? 🤔
The most interesting thing about this release is, they don't really know to tell us why it's better, they just know that it is, qualitatively and that it's not a new frontier-class model (ie, not 🍓 or GPT5)
Their release notes on this are something else 👇
Meanwhile it's been 3 months, and the promised Advanced Voice Mode is only in the hands of a few lucky testers so far.
Anthropic Releases Prompt Caching to Slash API Prices By up to 90%
Anthropic joined DeepSeek's game of "Let's Give Devs Affordable Intelligence," this week rolling out prompt caching with up to 90% cost reduction on cached tokens (yes NINETY…🤯 ) for those of you new to all this technical sorcery
Prompt Caching allows the inference provider to save users money by reusing repeated chunks of a long prompt form cache, reducing pricing and increasing time to first token, and is especially beneficial for longer contexts (>100K) use-cases like conversations with books, agents with a lot of memory, 1000 examples in prompt etc'
We covered caching before with Gemini (in Google IO) and last week with DeepSeek, but IMO this is a better implementation from a frontier lab that's easy to get started, manages the timeout for you (unlike Google) and is a no brainer implementation.
And, you'll definitely want to see the code to implement it all yourself, (plus Weave is free!🤩):
"In this week's buzz category… I used Weave, our LLM observability tooling to super quickly evaluate how much cheaper Cloud Caching from Anthropic really is, I did a video of it and I posted the code … If you're into this and want to see how to actually do this … how to evaluate, the code is there for you" - Alex
With the ridiculous 90% price drop for those cached calls (Haiku basically becomes FREE and cached Claude is costs like Haiku, .30 cents per 1Mtok). For context, I took 5 transcripts of 2 hour podcast conversations, and it amounted to ~110,000 tokens overall, and was able to ask questions across all this text, and it cost me less than $1 (see in the above video)
Code Here + Weave evaluation Dashboard here
AI Art, Diffusion, and Personalized AI On the Fly
Speaking of mind blowing, Flux took over this week, thanks in no small part to Elon strategically leveraging their tech in Grok (and everyone reminding everyone else, that it's not Grok creating images, it's Flux!)
Now, remember, the REAL magic happens when code meets open source, “Flux now has support for LORAs, ControlNet, img2img…" meaning developers have turned those foundational tools into artistic wizardry. With as little as $5 bucks and a few pictures, “You can train the best image model on your own face. ”🤯 (Seriously folks, head up to Fal.ai, give it a whirl… it’s awesome)
Now if you combine the LORA tech with ControlNet tech, you can get VERY creepy very fast (I'm using my own face here but you get the idea), here's "me" as the distracted boyfriend meme, and the girlfriend, and the distraction 😂 (I'm sorry you had to see this, AI has gone too far! Shut it all down!)
If seeing those creepy faces on screen isn't for you (I totally get that) there’s also Google IMAGEN 3, freshly escaped from secret preview and just waiting for you to unleash those artistic prompts on it! Google, despite being… Google, somehow figured out that a little competition does a lab good and rolled out a model that’s seriously impressive.
Runway Video Gets a "Turbocharged" Upgrade🚀🚀🚀
Ever tried those jaw-dropping text-to-video generators but groaned as you watched those seconds of video render painfully slowly?😭 Well Runway, creators of Gen 3, answered our prayers with the distilled turbocharged version that churns out those visuals in a blink 🤯🤯🤯 .
What's truly cool is they unlocked it for FREE tier users (sign up and unleash those cinematic prompts right now!), letting everyday folks dip their toes in those previously-unfathomable waters. Even the skeptics at OpenBMB (Junyang knows what I'm talking about…) had to acknowledge that their efforts with MiniCPM V are impressive, especially the smooth way it captures video sequences better than models even twice its size 🤯.
Open Source: Hermes 3 and The Next Generation of Open AI 🚀
NousResearch Dropped Hermes 3: Your New Favorite AI (Yes Really)
In the ultimate “We Dropped This On ThursdAI Before Even HuggingFace”, the legendary team at NousResearch dropped the hottest news since Qwen decided to play math God: Hermes 3 is officially here! 🤯
“You’re about to get to use the FIRST big Finetune of LLama 3.1 405B… We don’t think there have been finetunes,” announced Emozilla who’s both co founder and resident master wizard of all things neural net, “And it's available to try for free thanks to Lambda, you can try it out right here ” (you’re all racing to their site as I type this, I KNOW it!).
Not ONLY does this beauty run ridiculously smooth on Lambda, but here’s the real TL;DR:
* Hermes 3 isn’t just 405B; there are 70B and 8B versions dropping simultaneously on Hugging Face, ready to crush benchmarks and melt your VRAM (in a GOOD way… okay maybe not so great for your power bill 😅).
* On Benchmark, they beat LLama 3.1 instruct on a few evals and lose on some, which is quite decent, given that Meta team did an amazing job with their instruct finetuning (and probably spent millions of $ on it too)
* Hermes 3 is all about user alignment, which our open source champion Wolfram Ravenwolf summarized beautifully: “When you have a model, and you run it on your system, IT MUST BE LOYAL TO YOU.” 😈
Hermes 3 does just that with incredibly precise control via its godlike system prompt: “In Hermes 3 the system prompt is KING,” confirmed Emoz. It’s so powerful that the 405B version was practically suffering existential angst in their first conversation… I read that part outloud during the space, but here you go, this is their first conversation, and he goes into why this they thing this happened, in our chat that's very worth listening to
This model was trained on a bunch of datasources that they will release in the future, and includes tool use, and a slew of tokens that you can add in the system prompt, that will trigger abilities in this model to do chain of thought, to do scratchpad (think, and then rethink), to cite from sources for RAG purposes and a BUNCH more.
The technical report is HERE and is worth diving into as is our full conversation with Emozilla on the pod.
Wrapping Things Up… But We’re Just Getting Started! 😈
I know, I KNOW, your brain is already overflowing but we barely SCRATCHED the surface…
We also dove into NVIDIA's research into new pruning and distilling techniques, TII Falcon’s attempt at making those State Space models finally challenge the seemingly almighty Transformer architecture (it's getting closer... but has a way to go!), plus AnswerAI's deceptively tiny Colbert-Small-V1, achieving remarkable search accuracy despite its featherweight size and a bunch more...
See you all next week for what’s bound to be yet another wild AI news bonanza… Get those download speeds prepped, we’re in for a wild ride. 🔥
Hold on tight, folks, because THIS week on ThursdAI felt like riding a roller coaster through the wild world of open-source AI - extreme highs, mind-bending twists, and a sprinkle of "wtf is happening?" conspiracy theories for good measure. 😂
Theme of this week is, Open Source keeps beating GPT-4, while we're inching towards intelligence too cheap to meter on the API fronts.
We even had a live demo so epic, folks at the Large Hadron Collider are taking notice! Plus, strawberry shenanigans abound (did Sam REALLY tease GPT-5?), and your favorite AI evangelist nearly got canceled on X! Buckle up; this is gonna be another long one! 🚀
Qwen2-Math Drops a KNOWLEDGE BOMB: Open Source Wins AGAIN!
When I say "open source AI is unstoppable", I MEAN IT. This week, the brilliant minds from Alibaba's Qwen team decided to show everyone how it's DONE. Say hello to Qwen2-Math-72B-Instruct - a specialized language model SO GOOD at math, it's achieving a ridiculous 84 points on the MATH benchmark. 🤯
For context, folks... that's beating GPT-4, Claude Sonnet 3.5, and Gemini 1.5 Pro. We're not talking incremental improvements here - this is a full-blown DOMINANCE of the field, and you can download and use it right now. 🔥
Get Qwen-2 Math from HuggingFace here
What made this announcement EXTRA special was that Junyang Lin , the Chief Evangelist Officer at Alibaba Qwen team, joined ThursdAI moments after they released it, giving us a behind-the-scenes peek at the effort involved. Talk about being in the RIGHT place at the RIGHT time! 😂
They painstakingly crafted a massive, math-specific training dataset, incorporating techniques like Chain-of-Thought reasoning (where the model thinks step-by-step) to unlock this insane level of mathematical intelligence.
"We have constructed a lot of data with the form of ... Chain of Thought ... And we find that it's actually very effective. And for the post-training, we have done a lot with rejection sampling to create a lot of data sets, so the model can learn how to generate the correct answers" - Junyang Lin
Now I gotta give mad props to Qwen for going beyond just raw performance - they're open-sourcing this beast under an Apache 2.0 license, meaning you're FREE to use it, fine-tune it, adapt it to your wildest mathematical needs! 🎉
But hold on... the awesomeness doesn't stop there! Remember those smaller, resource-friendly LLMs everyone's obsessed with these days? Well, Qwen released 7B and even 1.5B versions of Qwen-2 Math, achieving jaw-dropping scores for their size (70 for the 1.5B?? That's unheard of!).🤯 Nisten nearly lost his mind when he heard that - and trust me, he's seen things. 😂
"This is insane! This is... what, Sonnet 3.5 gets what, 71? 72? This gets 70? And it's a 1.5B? Like I could run that on someone's watch. Real." - Nisten
With this level of efficiency, we're talking about AI-powered calculators, tutoring apps, research tools that run smoothly on everyday devices. The potential applications are endless!
MiniCPM-V 2.6: A Pocket-Sized GPT-4 Vision... Seriously! 🤯
If Qwen's Math marvel wasn't enough open-source goodness for ya, OpenBMB had to get in on the fun too! This time, they're bringing the 🔥 to vision with MiniCPM-V 2.6 - a ridiculous 8 billion parameter VLM (visual language model) that packs a serious punch, even outperforming GPT-4 Vision on OCR benchmarks!
OpenBMB drops a bomb on X here
I'll say this straight up: talking about vision models in a TEXT-based post is hard. You gotta SEE it to believe it. But folks... TRUST ME on this one. This model is mind-blowing, capable of analyzing single images, multi-image sequences, and EVEN VIDEOS with an accuracy that rivaled my wildest hopes for open-source.🤯
Check out their playground and prepare to be stunned
It even captured every single nuance in this viral toddler speed-running video I threw at it, with an accuracy I haven't seen in models THIS small:
"The video captures a young child's journey through an outdoor park setting. Initially, the child ... is seen sitting on a curved stone pathway besides a fountain, dressed in ... a green t-shirt and dark pants. As the video progresses, the child stands up and begins to walk ..."
Junyang said that they actually collabbed with the OpenBMB team and knows firsthand how much effort went into training this model:
"We actually have some collaborations with OpenBMB... it's very impressive that they are using, yeah, multi-images and video. And very impressive results. You can check the demo... the performance... We care a lot about MMMU [the benchmark], but... it is actually relying much on large language models." - Junyang Lin
Nisten and I have been talking for months about the relationship between these visual "brains" and the larger language model base powering their "thinking." While it seems smaller models are catching up fast, combining a top-notch visual processor like MiniCPM-V with a monster LLM like Quen72B or Llama 405B could unlock truly unreal capabilities.
This is why I'm excited - open source lets us mix and match like this! We can Frankenstein the best parts together and see what emerges... and it's usually something mind-blowing. 🤯
Thank you for reading ThursdAI - Recaps of the most high signal AI weekly spaces. This post is public so feel free to share it.
From the Large Hadron Collider to YOUR Phone: This Model Runs ANYWHERE 🚀
While Qwen2-Math is breaking records on one hand, Nisten's latest creation, Biggie-SmoLlm, is showcasing the opposite side of the spectrum. Trying to get the smallest/fastest coherent LLM possible, Nisten blew up on HuggingFace.
Biggie-SmoLlm (Hugging Face) is TINY, efficient, and with some incredible optimization work from the folks right here on the show, it's reaching an insane 330 tokens/second on regular M3 chips. 🤯 That's WAY faster than real-time conversation, folks! And thanks to Eric Hartford's (from Cognitive Computation) awesome new optimizer, (Grok AdamW) it's surprisingly coherent for such a lil' fella.
The cherry on top? Someone messaged Nisten saying they're using Biggie-SmoLlm at the Large. Hadron. Collider. 😳 I'll let that sink in for a second.
It was incredible having ALL the key players behind Biggie-SmoLlm right there on stage: LDJ (whose Capybara dataset made it teaching-friendly), Junyang (whose Qwen work served as the base), and Eric (the optimizer mastermind himself). THIS, my friends, is what the ThursdAI community is ALL about! 🚀
Speaking of which this week we got a new friend of the pod, Mark Saroufim, a long time PyTorch core maintainer, to join the community.
This Week's Buzz (and Yes, It Involves Making AI Even Smarter) 🤓
NeurIPS Hacker Cup 2024 - Can You Solve Problems Humans Struggle With? 🤔
I've gotta hand it to my PyTorch friend, Mark Saroufim. He knows how to make AI interesting! He and his incredible crew (Weiwei from MSFT, some WandB brainiacs, and more) are bringing you NeurIPS Hacker Cup 2024 - a competition to push those coding agents to their ABSOLUTE limits. 🚀
This isn't your typical "LeetCode easy" challenge, folks... These are problems SO hard, years of competitive programming experience are required to even attempt them! Mark himself said,
“At this point, like, if a model does make a significant dent in this competition, uh, I think people would need to acknowledge that, like, LLMs can do a form of planning. ”
And don't worry, total beginners: Mark and Weights & Biases are hosting a series of FREE sessions to level you up. Get those brain cells prepped and ready for the challenge and then Join the NeurIPS Hacker Cup Discord
P.S. We're ALSO starting a killer AI Salon series in our SF office August 15th! You'll get a chance to chat with researches like Shreya Shankar - she's a leading voice on evaluation. More details and free tickets right here! AI Salons Link
Big Co & APIs - Towards intelligence too cheap to meter
Open-source was crushing it this week... but that didn't stop Big AI from throwing a few curveballs. OpenAI is doubling down on structured data (AND cheaper models!), Google slashed Gemini prices again (as we trend towards intelligence too cheap to meter), and a certain strawberry mystery took over Twitter.
DeepSeek context caching lowers price by 90% automatiically
DeepSeek, those masters of ridiculously-good coding AI, casually dropped a bombshell - context caching for their API! 🤯
If you're like "wait, what does THAT mean?", listen up because this is game-changing for production-grade AI:
* Problem: LLMs get fed the ENTIRE conversation history EVERY. SINGLE. TIME. This wastes compute (and $$$) when info is repeated.
* Solution: DeepSeek now remembers what you've said, automatically pulling from a cache when the conversation goes down familiar paths.
* The Win: Up to 90% cheaper API calls. Yes, NINETY.😳 It costs 1.4 CENTS per million tokens for cached content. Let THAT sink in. 🤯
As Nisten (always bringing the technical breakdowns) explained:
"Everyone should be using LLMs this way!...The simplest way is to have a long conversation ... then you save it on disk... you don't have to wait again ... [it's] kind of free. DeepSeek... did this in a more dynamic way". - Nisten
Even Matt Shumer, who usually advocates for clever prompting over massive context, got legitimately hyped about the possibilities:
"For me, and how we use LLMs... instead of gathering a million examples... curate a hundred gold examples... you have something better than if you fine-tuned it, and cheaper, and faster..." - Matt Shumer
Think about this... instead of painstakingly fine-tuning, we can "guide" models with expertly crafted examples, letting them learn "on the fly" with minimal cost. Context as the NEW fine-tuning! 🤯
P.S - Google actually also has caching on its Gemini API, but you have to opt-in, while this happens automatically with DeepSeek API!
Google Goes "Price War Nuclear": Gemini Flash is Officially TOO CHEAP
Speaking of sneaky advancements from Google... they also dropped an update SO casually impactful, it almost got lost in the shuffle. Gemini Flash (their smallest, but still crazy-good model) is now... 7.5 cents per million tokens for input and 30 cents per million tokens for output... (for up to 128k of context)
I REPEAT: 7.5 cents... with LONG context!? 🤯 Google, please chill, MY SANITY cannot handle this price free-fall any longer! 😂
Full Breakdown of Gemini’s Crazy New Prices on Google’s Blog
While this USUALLY means a model's performance gets quietly nerfed in exchange for lower costs... in Gemini's case? Let's just say... even I, a staunch defender of open-source, am kinda SHOOK by how GOOD this thing is NOW!
After Google threw down this gauntlet, I actually used Gemini to draft my last ThursdAI newsletter (for the first time!). It nailed my tone and style better than any other model I've tried - and I've TRIED them ALL. 🤯 Even Nisten, who's super picky about his coding LLMs, gave it a rare nod of approval. Gemini's image understanding capabilities have improved significantly too! 🤯
Google also added improvements in how Gemini understands PDFs that are worth mentioning 👀
From JSON Headaches to REASONING Gains: What's Really New with GPT-4?
While Matt Shumer, my go-to expert on all things practical AI, might not be immediately impressed by OpenAI's new structured output features, they're still a huge win for many developers. Tired of LLM JSON going haywire? Well, GPT-4 can now adhere to your exact schemas, delivering 100% reliable structured data, no need for Instructor! 🙌
This solves a real problem, even if the prompting gurus (like Matt) have figured out their own workarounds. The key is:
* Determinism: This ain't your typical LLM chaos - they're guaranteeing consistency, essential for building reliable applications.
* Ease of use: No need for external libraries - it's built right into the API!
Plus... a sneaky price drop, folks! GPT-4 is now 50% cheaper for input tokens and 33% cheaper for output. As I said on the show:
"Again, quite insane... we're getting 50% cheaper just without fanfare. We're going towards 'intelligence too cheap to meter'... it's crazy".
And HERE'S the plot twist... multiple folks on stage (including the eager newcomer N8) noticed significant reasoning improvements in this new GPT-4 model. They tested it on tasks like lateral thinking puzzles and even anecdotally challenging tasks - and guess what? It consistently outperformed older versions. 🤯
"I have my own benchmark... of lateral thinking puzzles... the new GPT-4 [scored] roughly five to 10% higher... these are like really hard lateral thinking puzzles that require innovative reasoning ability". - N8
OpenAI isn't bragging about this upgrade explicitly, which makes me even MORE curious... 🤔
Mistral Joins the AGENT Hype Train (But Their Version is Different)
Everybody wants a piece of that AI "Agent" pie, and now Mistral (the scrappy, efficient French company) is stepping up. They announced a double whammy this week: fine-tuning is here AND "les agents" have arrived... but their agents are NOT quite what we're seeing elsewhere (think AutoGPT, CrewAI, all those looped assistants). 🤔
Mistral's Blog Post - Fine-tuning & Agents... Ooh La La!
Their fine-tuning service is pretty straightforward: upload your data and they'll host a bespoke Mistral Large V2 running through their API at no extra cost (very cool!).
Their agents aren't based on agentic loop-running like what we see from those recursive assistants. As I pointed out on ThursdAI:
"[Mistral] agents are not agentic... They're more similar to... GPTs for OpenAI or 'Projects' in Anthropic, where... you as a user add examples and preload context".
It's more about defining agents with examples and system prompts, essentially letting Mistral "pre-tune" their models for specific tasks. This lets you deploy those agents via the API or to their LeChat platform - pretty darn neat!
Build your OWN agent - Mistral's "Agent Builder" is slick!
While not as flashy as those recursive agents that build websites and write symphonies on their own, Mistral's take on the agent paradigm is strategic. It plays to their strengths:
* Developer-focused: It's about creating bespoke, task-specific tools - think API integrations, code reviewers, or content generators.
* Ease of deployment: No need for complex loop management, Mistral handles the hard parts for you!
Mistral even teased that they'll eventually be incorporating tool use... so these "pre-tuned" agents could quickly evolve into something very interesting. 😏
NVIDIA leak about downloading videos went viral (And the Internet... Didn't Like That!)
This week, I found myself unexpectedly at the center of an X drama explosion (fun times! 😅 ) when some leaked NVIDIA Slack messages showed them discussing which YouTube channels to scrape. My crime? I dared to ask how this is different from how Google creating Street View, filming every street possible without asking for permission. My Honest Question that Sparked AI Outrage
The Internet, as it often does, had thoughts . The tweet blew up (like a million views blew up). I was labeled an apologist, a shill, all kinds of charming things... 😂 It got so intense, I had to MUTE the whole thing for my sanity's sake. BUT it brings up serious issues:
* AI & Copyright: Where the Heck are the Lines? When does inspiration become infringement when a model's trained on massive datasets? There's no legal precedent, folks, which is scary .
* Ethics vs. Innovation: AI progress moves FAST... sometimes FASTER than our ability to grasp the implications. That's unsettling.
* Twitter Pile-Ons & Nuance (aka What NOT to do): Look, I GET being passionate. BUT when criticism turns into name-calling and mob mentality, it shuts down any chance of meaningful conversation. That's not helping ANYONE.
Strawberry Shenanigans: Theories, Memes, and a Little AI LARPing?🍓
And now, for the MAIN EVENT: STRAWBERRY! You might have heard whispers... seen those cryptic tweets... maybe witnessed that wild Twitter account firsthand! It all started with Sam Altman casually posting a pic of a strawberry garden with the caption "nice summer day". Then came the deluge - more pics of strawberries from OpenAI folks, even those cryptic, semi-official domain names LDJ uncovered... I even spotted a strawberry IN OUR audience for crying out loud! This thing spread like wildfire. 🔥
We spent a solid chunk of the episode piecing together the lore: Q*, the mystery model shrouded in secrecy for years, then that Bloomberg leak claiming it was code-named "Strawberry", and now this. It was peak AI conspiracy-theory land!
We still don't have hard confirmation on Q*... but that strawberry account, spitting out fruit puns and pinging ChatGPT like a maniac? Some on ThursdAI (Yam, mostly) believe that this may not have been a human at all, but an early, uncontrolled attempt to have an AI manage its own PR. 😳 I almost bought it - especially the way it reacted to some of my live comments - but now... the LARP explanation seems more likely
Many folks at OpenAI posted things with strawberries as well, was this a sign of something to come or were they just trying to bury the news that 3 executives departed the company this week under a mountain of 🍓?
Cursor & Composer: When Coding Becomes AI-Powered Magic ✨
I love a good tool... and this week, my dev heart was a-flutter over Cursor . Tried it yet? Seriously, you need to! It's VS Code, but SUPERCHARGED with AI that'll make you question why Copilot ever existed. 😂
You can edit code by CHAT, summarize entire files with one click, zap bugs instantly ... but they just dropped their ultimate weapon: Composer. It's essentially a coding AGENT that does multi-file edits. 🤯
Matt Shumer (my SaaS wizard friend who adopted Cursor early) had some jaw-dropping examples:
" [Composer] ... takes all the parts of Cursor you like and strings them together as an agent... it takes away a lot of the grunt work... you can say 'go add this feature'... it searches your files, figures out what to edit, then puts it together. ...I literally built a SaaS in 20 minutes!" - Matt Shumer
Matt also said that using Cursor is required at their company!
Even my stoic PyTorch friend, Mark, couldn't help but express some curiosity:
"It's cool they're doing things like multi-file editing... pretty curious to see more projects along those lines" - Mark Serafim
Yeah, it's still in the rough-around-the-edges stage (UX could use some polish). But THIS, folks, is the future of coding - less about hammering out syntax, more about describing INTENT and letting the AI handle the magic! 🤯 I can't wait to see what they do next.
Download at cursor.sh and let me know what you think
Conclusion: The Future Is FAST, Open, And Maybe a Bit TOO Spicy? 🌶️😂
Honestly, every single week leaves me awestruck by how fast this AI world is moving. 🤯 We went from "transformers? Huh?" to 70-point math models running on SMARTWATCHES and AI building ENTIRE web apps in less than two years. And I still haven't got GPT-4's new voice model yet!!
Open source keeps proving its power, even THOSE BIG companies are getting in on the action (look at those Google prices! 😍), and then you've got those captivating mysteries keeping us on our toes... like those damned strawberries! 🍓 What DOES OpenAI have up their sleeve??
As always, huge THANK YOU to the amazing guests who make this show what it is - this week, extra kudos to Junyang, Nisten, LDJ, Mark, Yam, and Eric, you guys ROCK. 🔥 And HUGE gratitude to each and every ONE of you readers/listeners (and NEW folks who stuck around after those Strawberry bait tweets! 😂) You make this ThursdAI community truly unstoppable. 💪
Keep on building, stay insanely curious, and I'll see you next Thursday - ready or not, that AI future is coming in hot! 🔥🚀
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Starting Monday, Apple released iOS 18.1 with Apple Intelligence, then Meta dropped SAM-2 (Segment Anything Model) and then Google first open sourced Gemma 2B and now (just literally 2 hours ago, during the live show) released Gemini 1.5 0801 experimental that takes #1 on LMsys arena across multiple categories, to top it all off we also got a new SOTA image diffusion model called FLUX.1 from ex-stability folks and their new Black Forest Lab.
This week on the show, we had Joseph & Piotr Skalski from Roboflow, talk in depth about Segment Anything, and as the absolute experts on this topic (Skalski is our returning vision expert), it was an incredible deep dive into the importance dedicated vision models (not VLMs).
We also had Lukas Atkins & Fernando Neto from Arcee AI talk to use about their new DistillKit and explain model Distillation in detail & finally we had Cristiano Giardina who is one of the lucky few that got access to OpenAI advanced voice mode + his new friend GPT-4o came on the show as well!
Honestly, how can one keep up with all this? by reading ThursdAI of course, that's how but ⚠️ buckle up, this is going to be a BIG one (I think over 4.5K words, will mark this as the longest newsletter I penned, I'm sorry, maybe read this one on 2x? 😂)
[ Chapters ]
00:00 Introduction to the Hosts and Their Work
01:22 Special Guests Introduction: Piotr Skalski and Joseph Nelson
04:12 Segment Anything 2: Overview and Capabilities
15:33 Deep Dive: Applications and Technical Details of SAM2
19:47 Combining SAM2 with Other Models
36:16 Open Source AI: Importance and Future Directions
39:59 Introduction to Distillation and DistillKit
41:19 Introduction to DistilKit and Synthetic Data
41:41 Distillation Techniques and Benefits
44:10 Introducing Fernando and Distillation Basics
44:49 Deep Dive into Distillation Process
50:37 Open Source Contributions and Community Involvement
52:04 ThursdAI Show Introduction and This Week's Buzz
53:12 Weights & Biases New Course and San Francisco Meetup
55:17 OpenAI's Advanced Voice Mode and Cristiano's Experience
01:08:04 SearchGPT Release and Comparison with Perplexity
01:11:37 Apple Intelligence Release and On-Device AI Capabilities
01:22:30 Apple Intelligence and Local AI
01:22:44 Breaking News: Black Forest Labs Emerges
01:24:00 Exploring the New Flux Models
01:25:54 Open Source Diffusion Models
01:30:50 LLM Course and Free Resources
01:32:26 FastHTML and Python Development
01:33:26 Friend.com: Always-On Listening Device
01:41:16 Google Gemini 1.5 Pro Takes the Lead
01:48:45 GitHub Models: A New Era
01:50:01 Concluding Thoughts and Farewell
Show Notes & Links
* Open Source LLMs
* Meta gives SAM-2 - segment anything with one shot + video capability! (X, Blog, DEMO)
* Google open sources Gemma 2 2.6B (Blog, HF)
* MTEB Arena launching on HF - Embeddings head to head (HF)
* Arcee AI announces DistillKit - (X, Blog, Github)
* AI Art & Diffusion & 3D
* Black Forest Labs - FLUX new SOTA diffusion models (X, Blog, Try It)
* Midjourney 6.1 update - greater realism + potential Grok integration (X)
* Big CO LLMs + APIs
* Google updates Gemini 1.5 Pro with 0801 release and is #1 on LMsys arena (X)
* OpenAI started alpha GPT-4o voice mode (examples)
* OpenAI releases SearchGPT (Blog, Comparison w/ PPXL)
* Apple releases beta of iOS 18.1 with Apple Intelligence (X, hands on, Intents )
* Apple released a technical paper of apple intelligence
* This weeks Buzz
* AI Salons in SF + New Weave course for WandB featuring yours truly!
* Vision & Video
* Runway ML adds Gen -3 image to video and makes it 7x faster (X)
* Tools & Hardware
* Avi announces friend.com
* Jeremy Howard releases FastHTML (Site, Video)
* Applied LLM course from Hamel dropped all videos
Open Source
It feels like everyone and their grandma is open sourcing incredible AI this week! Seriously, get ready for segment-anything-you-want + real-time-video capability PLUS small AND powerful language models.
Meta Gives Us SAM-2: Segment ANYTHING Model in Images & Videos... With One Click!
Hold on to your hats, folks! Remember Segment Anything, Meta's already-awesome image segmentation model? They've just ONE-UPPED themselves. Say hello to SAM-2 - it's real-time, promptable (you can TELL it what to segment), and handles VIDEOS like a champ. As I said on the show: "I was completely blown away by segment anything 2".
But wait, what IS segmentation? Basically, pixel-perfect detection - outlining objects with incredible accuracy. My guests, the awesome Piotr Skalski and Joseph Nelson (computer vision pros from Roboflow), broke it down historically, from SAM 1 to SAM 2, and highlighted just how mind-blowing this upgrade is.
"So now, Segment Anything 2 comes out. Of course, it has all the previous capabilities of Segment Anything ... But the segment anything tool is awesome because it also can segment objects on the video". - Piotr Skalski
Think about Terminator vision from the "give me your clothes" bar scene: you see a scene, instantly "understand" every object separately, AND track it as it moves. SAM-2 gives us that, allowing you to click on a single frame, and BAM - perfect outlines that flow through the entire video! I played with their playground, and you NEED to try it - you can blur backgrounds, highlight specific objects... the possibilities are insane. Playground Link
In this video, Piotr annotated only the first few frames of the top video, and SAM understood the bottom two shot from 2 different angles!
Okay, cool tech, BUT why is it actually USEFUL? Well, Joseph gave us incredible examples - from easier sports analysis and visual effects (goodbye manual rotoscoping) to advances in microscopic research and even galactic exploration! Basically, any task requiring precise object identification gets boosted to a whole new level.
"SAM does an incredible job at creating pixel perfect outlines of everything inside visual scenes. And with SAM2, it does it across videos super well, too ... That capability is still being developed for a lot of AI Models and capabilities. So having very rich ability to understand what a thing is, where that thing is, how big that thing is, allows models to understand spaces and reason about them" - Joseph Nelson
AND if you combine this power with other models (like Piotr is already doing!), you get zero-shot segmentation - literally type what you want to find, and the model will pinpoint it in your image/video. It's early days, but get ready for robotics applications, real-time video analysis, and who knows what else these clever hackers are dreaming up! 🤯
Check out Piotr's Zero Shot Florence + Sam2 Implementation
Best of all? Apache 2 license, baby! As Joseph said, "Open source is foundational to making the accessibility, the use cases, and the advancement of the field overall", and this is a prime example. Huge kudos to Meta for empowering us with this tech.
The whole conversation w/ Piotr & Joseph is very much worth listening to on the pod 🎙️
Google Throws Down The Gauntlet: Open Sourcing GemMA 2 2.6B
It was Meta vs. Google on Monday because NOT to be outdone, Google also went on an open-sourcing spree. This time, they gifted us GemMA 2 (a 2.6 billion parameter powerhouse), alongside a safety-focused suite called ShieldGemMA AND a transparency tool called GemmaScope.
So what makes Gemma 2 special? First off, it's optimized for on-device use, meaning super-efficient local running. BUT there's a catch, folks... They claim it beats Mixtral AND Llama 2 70B on the LMsys Arena leaderboard, with an ELO score of 1126. Hold on, a 2 billion parameter model outperforming the big boys? 🤨 As LDJ (one of my regular co-hosts) said on the show:
"Yeah, I think my best theory here is... there's at least two or three variables at play ... In LMSys, people are much more likely to do single turn, and within LMSys, people will usually be biased more towards rating models with a more recent knowledge cutoff as higher".
Translation? It might be gaming the system a bit, but either way, Gemma 2 is an exciting release - super fast, small enough for on-device applications, and coming with safety tools right out the gate! I think Zenova (our Hugging Face wizard) is already running this on WebGPU! You NEED to try it out.
Gemma 2 HF Link
And GemmaScope? That's some cool, cool stuff too. Think about peeking inside the "brain" of the model - you can actually SEE how Gemma 2 processes information. Remember Anthropic Mechinterp? It's like that, giving us unprecedented transparency into how these systems actually "think". You gotta see it on Neuronpedia. Neuronpedia link
It's Meta versus Google - round one, FIGHT! 🥊
Distilling Knowlege: Arcee AI Drops DistilKit!
Just when I thought the week was done throwing surprises, Arcee AI casually dropped DistilKit - an open source tool to build distilled language models. Now, this is some NEXT level stuff, folks. We talked with Lukas Atkins and Fernando (the brilliant minds behind DistillKit), and I finally learned what the heck "distillation" really means.
"TLDR - we teach a smaller model to think like a bigger model"
In a nutshell: teach a smaller model how to think like a larger one. Think GPT-4o and GPT-4 Mini, where the smaller model supposedly got the "essence" of the bigger version. Or imagine a tiny Llama that inherited the smarts of 405B - ridiculous! 🤯 As Fernando eloquently put it:
So in the finetuning that we have been doing, just in terms of generating text instructions and so on, we were observing only the token that was generated from the teacher model. And now with the distillation, we are observing the whole distribution of the tokens that could be sampled
Now I admit, even after Fernando's expert breakdown, my brain still kind of melted. 🫠 BUT, here's why this matters: distilled models are super efficient, saving on cost and resources. Imagine powerful AI that runs seamlessly on your phone! 🤯 Arcee is making this possible for everyone.
Check Out DistilKit Here
Was it pure coincidence they released this on the same week as the Llama 3.1 LICENSE CHANGE (Zuckerberg is clearly watching ThursdAI...), which makes distillation perfectly legal?
It's wild, exciting, AND I predict a massive surge in smaller, specialized AI tools that inherit the intelligence of the big boys.
This weeks buzz
Did I already tell you that someone came up to me and said, hey, you're from Weights & Biases, you are the guys who make the courses right? 😂 I said, well yeah, we have a bunch of free courses on wandb.courses but we also have a world leading ML experiment tracking software and an LLM observability toolkit among other things. It was really funny he thought we're just courses company!
Well this last week, my incredible colleague Agata who's in charge of our courses, took an initiative and stitched together a course about Weave from a bunch of videos that I already had recorded! It's awesome, please check it out if you're interested to learn about Weave 👏
P.S - we are also starting a series of AI events in our SF office called AI Salons, the first one is going to feature Shreya Shankar, and focus on evaluations, it's on August 15th, so if you're in SF, you're invited for free as a ThursdAI subscriber! Get free tickets
Big Co AI - LLMs & APIs
Not only was open source popping off, but those walled-garden mega corps wanted in on the action too! SearchGPT, anyone?
From Whispers to Reality: OpenAI Alpha Tests GPT-4 Voice (and IT'S WILD)
This was THE moment I waited for, folks - GPT-4 with ADVANCED VOICE is finally trickling out to alpha users. Did I get access? NO. 😩 But my new friend, Cristiano Giardina, DID and you've probably seen his viral videos of this tech - they're blowing up MY feed, even Sam Altman retweeted the above one! I said on the show, this new voice "feels like a big next unlock for AI"
What sets this apart from the "regular" GPT-4 voice we have now? As Cristiano told us:
"the biggest difference is that the emotion , and the speech is very real and it follows instructions regarding emotion very well, like you can ask it to speak in a more animated way, you can ask it to be angry, sad, and it really does a good job of doing that."
We did a LIVE DEMO (it worked, thank God), and y'all... I got CHILLS. We heard counting with a breath, depressed Soviet narrators, even a "GET TO THE CHOPPA" Schwarzenegger moment that still makes me laugh 😂 It feels like a completely different level of interaction, something genuinely conversational and even emotional. Check out Cristiano's profile for more insane demos - you won't be disappointed.Follow Cristiano Here For Amazing Voice Mode Videos
Can't wait for access, if anyone from OpenAI is reading this, hook me up 🙏 I'll trade my SearchGPT access!
SearchGPT: OpenAI Throws Their Hat Into The Ring (again?)
Did OpenAI want to remind everyone they're STILL here amidst the LLama/Mistral frenzy? Maybe that's why they released SearchGPT - their newest "search engine that can hold a conversation" tool. Again, waitlisted, but unlike with voice mode... I got access. 😅
The good: Fast. Really fast. And impressively competent, considering it's still a demo. Handles complex queries well, and its "follow-up" ability blows even Perplexity out of the water (which is impressive).
The less-good: Still feels early, especially for multi-language and super local stuff. Honestly, feels more like a sneak peek of an upcoming ChatGPT integration than a standalone competitor to Google.
But either way, it's an interesting development - as you may have already learned from my full breakdown of SearchGPT vs. Perplexity
Apple Intelligence is here! (sort of)
And speaking of big companies, how could I not mention the Apple Intelligence release this week? Apple finally dropped iOS 18.1 with the long-awaited ON-DEVICE intelligence, powered by the Apple Foundational Model (AFM). Privacy nerds rejoice! 🎉
But how good is it? Mixed bag, I'd say. It's there, and definitely usable for summarization, rewriting tools, text suggestions... but Siri STILL isn't hooked up to it yet, tho speech to text is way faster and she does look more beautiful. 🤔 Apple did release a ridiculously detailed paper explaining how they trained this model on Apple silicon... and as Nisten (ever the voice of honesty) said on the show,
"It looks like they've stacked a lot of the tricks that had been working ... overall, they're not actually really doing anything new ... the important thing here is how they apply it all as a system that has access to all your personal data."
Yeah, ouch, BUT still exciting, especially as we get closer to truly personal, on-device AI experiences. Right now, it's less about revolutionary advancements, and more about how Apple can weave this into our lives seamlessly - they're focusing heavily on app intents, meaning AI that can actually DO things for you (think scheduling appointments, drafting emails, finding that photo buried in your library). I'll keep testing this, the more I play around the more I find out, like it suddenly started suggesting replies in messages for me for example, or I haven't yet seen the filtered notifications view where it smartly only lets important messages go through your focus mode.
So stay tuned but it's likely not worth the beta iOS upgrade if you're not a dev or a very strong enthusiast.
Wait, MORE Breaking News?? The AI World Doesn't Sleep!
If this episode wasn't already enough... the very day of the live show, as we're chatting, I get bombarded with breaking news alerts from my ever-vigilant listeners.
1. Gemini 1.5 Pro 0801 - Now #1 on LMsys Arena! 🤯 Google apparently loves to ship big right AFTER I finish recording ThursdAI (this happened last week too!). Gemini's new version, released WHILE we were talking about older Gemini versions, claimed the top spot with an insane 1300 ELO score - crushing GPT-4 and taking home 1st place in Math, Instruction Following, and Hard Prompts! It's experimental, it's up on Google AI Studio... Go play with it! (and then tag me with your crazy findings!)
And you know what? Some of this blog was drafted by this new model, in fact, I had the same prompt sent to Claude Sonnet 3.5, Mistral Large v2 and I tried LLama 3.1 405B but couldn't find any services that host a full context window, and this Gemini just absolutely demolished all of them on tone, on imitating my style, it even took some of the links from my TL;DR and incorporated them into the draft on its own! I've never seen any other model do that! I haven't used any LLMs so far for this blog besides proof-reading because, well they all kinda sucked, but damn, I dare you to try and find out where in this blog it was me and where it was Gemini.
2. GitHub Does a Hugging Face: Introducing GitHub Models!
This dropped just as we wrapped - basically a built-in marketplace where you can try, test, and deploy various models right within GitHub! They've already got LLaMa, Mistral, and some Azure-hosted GPT-4o stuff - very intriguing... Time will tell what Microsoft is cooking here, but you can bet I'll be investigating!🕵️
AI Art & Diffusion
New Stability: Black Forest Labs and FLUX.1 Rise!
Talk about a comeback story: 14 EX Stability AI pros led by Robin Rombach, Andreas Blatmann & Patrick Esser the OG creaters of Stable Diffusion with $31 million in funding from a16z, and are back to make diffusion dreams come true. Enter Black Forest Labs. Their first gift? FLUX.1 - a suite of text-to-image models so good, they're breaking records. I saw those demos and wow. It's good, like REALLY good. 🤯
Try it out here
And the real bomb? They're working on open-source TEXT-TO-VIDEO! That's right, imagine generating those mind-blowing moving visuals... with code anyone can access. It's in their "Up Next" section, so watch that space - it's about to get REAL interesting.
Also... Midjourney 6.1 also came out, and it looks GOOD
And you can see a comparison between the two new leading models in this thread by Noah Hein
Tools & Hardware: When AI Gets Real (And Maybe Too Real...)
You knew I had to close this madness out with some Hardware, because hardware means that we actually are interacting with these incredible models in a meaningful way.
Friend.com: When Your AI Is... Always Listening? 🤨
And then this happened... Avi Schiffman (finally) announces friend.com. with an amazingly dystopian promo video from Sandwich. Videos. ~ 22 million views and counting, not by accident! Link to Video.
It's basically an always-on, listening pendant. "A little like wearing a wire" as Nisten so eloquently put it. 🎧 Not for memory extension or productivity... for friendship. Target audience? Lonely people who want a device that captures and understands their entire lives, but in an almost comforting way (or maybe unsettling, depending on your viewpoint!). The debate about privacy is already RAGING... But as Nisten pointed out:
"Overall, it is a positive. ...The entire privacy talk and data ownership, I think that's a very important conversation to have".
I kinda get the vision. Combine THIS tech with GPT-4 Voice speed... you could actually have engaging conversations, 24/7! 🤯 I don't think it's as simple as "this is dystopian, end of story". Character AI is EXPLODING right now, remember those usage stats, over 20 million users and counting? The potential to help with loneliness is real...
The Developer Corner: Tools for Those Hacking This Future
Gotta love these shoutouts:
* FastHTML from Jeremy Howard: Not strictly AI, but if you hate JS and love Python, this one's for you - insanely FAST web dev using a mind-bending new syntax. FastHTML website link
* Hamel Hussain's Applied LLM Course - All Videos NOW Free!: Want to learn from some of the best minds in the field (including Jeremy Howard, Shreya Shankar evaluation QUEEN, Charles Frye and tons of other great speakers)? This course covers it all - from finetuning to Rag Building to optimizing your prompts.Applied LLMs course - free videos link
AND ALSO ... Nisten blew everyone's minds again in the end! Remember last week, we thought it'd take time before anyone could run Llama 3.1 405B on just CPU? Well, this crazy genius already cracked the code - seven tokens per second on a normal CPU! 🤯 If you're a researcher who hates using cloud GPUs (or wants to use ALL THOSE CORES in your Lambda machine, wink wink)... get ready.
Look, I'm not going to sit here and pretend that weeks are not getting crazier, it takes me longer and longer to prep for the show, and really is harder and harder to contain the show to 2 hours, and we had 3 breaking news stories just today!
So we're accelerating, and I'll likely be using a bit of support from AI, but only if it's good, and only if it's proof read by me, so please let me know if you smell slop! I really wanna know!
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
The podcast currently has 71 episodes available.
155 Listeners
440 Listeners
290 Listeners
286 Listeners
130 Listeners
177 Listeners
199 Listeners
84 Listeners
206 Listeners
327 Listeners
112 Listeners
67 Listeners
156 Listeners
50 Listeners
329 Listeners