Visual language vs. Mental models

Why did Google's VLM demo work so well? Originally a LinkedIn post.

We tell each other stories all day, and most of them only have words when we think to give them some.

The stories I'm talking about are the visual kind — the symbols and images that come from artists and our governments and corporations; they make up our culture, our shared mental models, our lingua franca.

Google's vision language model (VLM) demo has me thinking about our personal and cultural relationships to images, symbols and shared mental models.

The VLM demo was brought to life by Mercari, a digital marketplace with over 20 million monthly users and over 3 billion listings to date. With it being a digital marketplace, the success of its platform depended on its ability to enable reliable selling, searching, and purchasing — all services that require an excellent information architecture.

Not only was the data clean, but the data was also rich — the three billion listings to date kind of rich.

The Mercari platform provided Google's VLM team with tons of user-generated photos from tons of user-generated listings in tons of user-selected categories, all validated by tons of real user-to-user interaction. And it shows. The demo works really, really well.

These community-driven platforms are beautifully matched for AI-augmented services. Embedded in its platform data are our shared mental models and genuine human interaction; when paired up, these models can extrapolate words and interpret meaning even when there is no canonical data, which is nothing short of incredible.

Community-driven platforms.

Human-centered foundations.