Internal note
As stated in post Strategic Roadmap for Innovating Language Learning with LingoStand: Prioritizing Multimodal Analysis and Scenario-Based Practice I started working on the multimodal awareness to create a proof of concept.
I did the major lifting to start testing, but I found some issues in the tests
I think we should stop this route for now, before it becomes a source of waste.
Key Issues Identified
- Inconsistent Feedback:
- The feedback provided by the Gemini model can be inconsistent, specially for longer content. I tried reading 1m audio file (which is not that long really) of me reading a passage from Seth Godin: “this is marketing” and got sometimes good but some times really bad feedback.
- I think this model works way better for short content
- OpenAI 4o API does not provide multimodal capabilities in their API yet
- The feedback provided by the Gemini model can be inconsistent, specially for longer content. I tried reading 1m audio file (which is not that long really) of me reading a passage from Seth Godin: “this is marketing” and got sometimes good but some times really bad feedback.
- We need to build a Complex Flow to be actually useful: Correcting various language issues that can come form media (intonation, grammar, pronunciation, vocabulary) requires a more complicated and long flow to make something an actual learning experience.
- Depending on the issues there are in the media file, you can get a long list of improvements categorized, but to turn that into practical learning experience (e.g. turning into practical learning bites, help correcting everything before turning into bites) is another matter and I think would required a lot of work.
- The Pron coach that we already have in the byte is more useful, because it focuses on repetition and feedback of the bytes
I think that the real-time Simulations can provide more useful for learning, because it is by nature more byte sized, and can provide immediate short feedback bite-sized feedback that can be applied in during simulations. So for us we should focus on something that can has the learning potential more clear
Maybe we need to wait for the Multimodal API’s get better, and shelf this for now, turning into text based experienced that are more robust
An interesting idea could be that it helps giving a general overview of the users level, but we need to ensure that the user’s input content is actually good for level assessment