AI Architect by MindStudio
Posts
Data Source Limit Expanded 10x, Firecrawl Launch Next Week

Data Source Limit Expanded 10x, Firecrawl Launch Next Week

You can now upload up to 5M words as data sources in MindStudio, and soon you'll be able to scrape URLs with Firecrawl.

Giorgio Barilla
July 12, 2024

You're receiving this email because you registered to one of our workshops. You can unsubscribe at the bottom of each email at any time.

We hope you had a wonderful 4th of July and enjoyed your break. Recently, the AI world has been a bit quieter, but we have still released some great features at MindStudio to help you improve your AI workflows.

The industry made progress, with Anthropic leading in quiet updates. This week, they introduced Claude 3 Haiku fine-tuning on AWS and Artifact publishing, which lets anyone create and launch small web apps without coding. A French startup, Moshi, released a real-time voice model similar to the promised GPT-4o voice model (which is not coming anytime soon).

Resources for Pros

Try out the new MindStudio Trainer v2 - available to all users

Learn more about the Model Mixer, our newest addition

Build a Claude 3.5 Sonnet chatbot that can search the web

Which Model to Choose for Your Next AI Build

Learn to use RAG in MindStudio: when is it optimal and when isn’t

What’s coming next

New voices for text to speech models

Better templating with copy-paste blocks and workflows

First-party integration with Firecrawl

More types of data sources and data retrieval techniques (e.g. GraphRAG)

Less reliance on Zapier for Agentive workflows

More interface upgrades to let you do more with AI

As a reminder, we’re now welcoming partners that want to build AIs for their clients. Sign up for extra support, training resources, and more here.

🗞️ Industry news

Anthropic launches public artifacts - aka hosted mini web apps

Copyright Anthropic

Anthropic has been on a rapid innovation streak recently, setting itself apart with a unique approach to announcements. Unlike OpenAI, which often shares early-stage developments, Anthropic waits until their innovations are ready before making them public.

Artifacts let Claude show you important content in a separate window, away from the main chat. They are like small spaces to run web apps and display long text outputs.

People have been building all sorts of interesting mini apps with Artifacts, and the feature prompted new startups to sell micro services around building apps with Claude 3.5 Sonnet - the best model for coding.

I built a snake game with the artifact feature to test out the publishing options. This took 1 prompt:

“Create a snake game, playable in artifacts. Please make sure the game includes the snake, which should get longer every time it eats a fruit, and a wide variety of fruits randomly spawning across the grid. Use emojis to represent the fruits. Use the emoji of a snake as the head of the snake, and green squares to make it bigger. The game should start after the player clicks a big button that says "START", and should not start automatically. The game should end when the snake hits the wall or eats itself. Please count the score and display at the end of the game, including a button to restart and play again.”

Here’s the game. Let me know your score!
You can now fine tune Claude 3 Haiku in Amazon AWS

Copyright Anthropic

AWS Customers can now fine-tune Claude 3 Haiku, the fastest and most cost-effective model, using Amazon Bedrock. This allows businesses to customize the model's knowledge and capabilities for more effective specialized tasks.

Fine-tuning involves training the model on a set of high-quality prompt-completion pairs, creating a version of Claude 3 Haiku tailored to specific workflows. The fine-tuning API, currently in preview, enables testing and refining of custom models via the Amazon Bedrock console or API until they meet desired performance goals.

For example, SK Telecom improved customer support workflows and satisfaction by fine-tuning Claude for telecommunications tasks, resulting in better performance metrics and positive feedback. Thomson Reuters also anticipates enhanced AI solutions by optimizing Claude 3 Haiku with their industry expertise.

Claude 3 Haiku is one of the best models in terms of cost and quality, and we’re keeping an eye on the new fine tuning API. If Anthropic makes it public outside of AWS, we might consider adding it to MindStudio. Would you be interested?
The media is ringing the bell: AI consumes too much water
Generated by DALL-E 3

Generative AI systems require vastly more computational power than traditional online services, leading to increased electricity and water consumption.

The environmental impact is becoming harder to ignore. For example, Google's energy consumption has doubled since 2019, and Microsoft’s AI developments may compromise its sustainability goals.

Beyond energy, data centers also consume vast amounts of water, which is evaporated and lost to the atmosphere, affecting local water supplies.

Climate concern around AI is more than valid, but it's worth noting tech companies are doing their best to limit the footprint; there's no innovation that doesn't start with higher energy consumption.

Internet itself faced similar criticism by the New York Times in 2012, in an article titled “Power, Pollution and the Internet”, with lots of fearful talk around data centers. Now, we can’t imagine living without them, and quite a large chunk of non-energy intensive web applications now run on green energy.

In other news, a French startup released a sort-of GPT-4o voice model with near real-time speech to speech. The problem?

It’s… bad.

Like really, really bad.

The voice model itself is interesting, and it is indeed real time. However, the underlying LLM is unbelievably basic, resembling the level of intelligence of GPT-2. It’s unable to hold any conversation or retain information, plus it repeats itself over and over in all new chats.

They have a long way to go, but it’s still incredible to see how far we’ve come in being able to communicate with machines without even typing. You can try Moshi for free here.

🔥 Product Updates

5m upload limit in MindStudio

You can now upload up to 5 million words for each AI-powered application in MindStudio. There’s still a 50mb limit per file up to 150.

Grounding your AI applications in reality is one of the most common use cases for MindStudio, and we wanted to ensure you didn’t feel constrained by the previous 500k limit. After all, some of you have much bigger knowledge bases or simply want to provide more details to your workflows.

There’s no extra cost to use the new limits, every user on the Individual and Teams plan will see the new limits starting today.

You can use our MindStudio Trainer v2 to create data sources to use with the new limit:

MindStudio Trainer v2 includes a Data Source Builder. You can now scrape your website or YouTube channel with one click and get the CSV output to upload in MindStudio.

You can use it by creating a new app - it's the first template in the list. Learn more here - the landing page includes FAQs, highlights, and a video overview.

What’s coming next:

MindStudio is releasing a new scraping feature thanks to FireCrawl, a well-known web scraping tool for LLM training. Later, FireCrawl will become an option for data sources, allowing you to crawl all your website URLs at once or on a schedule;
We’re releasing new voices for the text to speech models next week;
MindStudio will soon have a new learn page with a more organized list of resources to learn more about the tool;
We’re working in the background to bring you awesome, major updates in the future. While this means smaller releases in the meantime, please be assured you will love the new updates*

*ok, just a SMALL spoiler. We’re looking into ways to improve our RAG platform and enable more options to feed data into the LLM like GraphRAG. Additionally, we’re looking to decrease the reliance on tools like Zapier to build agentive workflows by building many integrations into MindStudio itself - no extra account required. These are longer-term projects that might take a few weeks to come to light, but they’re both going to redefine what’s possible with MindStudio.

💡 Tip of The Week

List of chunks for a data source in MindStudio Trainer v2

Ever wondered what’s a “chunk” and how it relates to RAG? That’s a terminology we use to fetch specific portions of a document using RAG: retrieval augmented generation.

In MindStudio, that’s typically a Query Data Source block. This block will fetch data from a data source and return 1 to 5 chunks in a variable. Then, you can use that variable as context to ground your AI in reality or perform additional data cleaning operations.

A chunk in MindStudio is approximately 500 words, so the maximum output is around 2500 words. RAG will try to fetch the chunks that most closely resemble the query.

For example, if you have an encyclopedia as data source and your query is “cat”, RAG will try to find the chunks of the encyclopedia that contain the word cat. It will run the query X times (up to 5) and save the result in a variable you define in the Query Data Source block.

Then, you can use that variable in a Generate Text block, or even in the chat terminator, to let the AI know more about the query. In this example, cats.

You can query multiple data sources in a workflow for different queries, but you should avoid querying data sources that only contain a couple of chunks.

Let’s assume your encyclopedia is split into 2 chunks of 500 words each, for a total of 1000. When you query it, you ask for 3 chunks. That’s simply not possible, given that the data source is only split into 2.

And that’s not all. If a file is so small it fits within 1 to 3 chunks, you’re probably better off adding that as context within the prompt rather than running a semantically related search (RAG) to find the appropriate pieces. In most cases, RAG will simply retrieve the whole document anyway.

If you want to learn more about RAG, you can watch our short explainer video on YouTube (10m) or watch a full webinar on the topic (50m).

🤝 Community Events

If you want to hangout with our team, we usually host a Discord event every Friday @ 3PM Eastern. Join our Discord channel to keep up to date with the hangouts - our entire team is active there.

You can register for upcoming events on our brand new events page here.

Our new webinar series is up on there as well, with the following on-demand webinars:

Plus, we have new weekly and bi-weekly events:

Thank you for being an invaluable member of our community, it’s always great to see many of you join multiple workshops 🔥

🌯 That’s a wrap!

Stay tuned to learn more about what’s next and get tips & tricks for your MindStudio build.

You saw it here first,

Giorgio Barilla
MindStudio Developer & Project Manager @ MindStudio