mlo-data-collab

community

AI & ML interests

None defined yet.

jasoncorkillย 
posted an update 29 days ago
view post
Post
4681
Do you remember https://thispersondoesnotexist.com/ ? It was one of the first cases where the future of generative media really hit us. Humans are incredibly good at recognizing and analyzing faces, so they are a very good litmus test for any generative image model.

But none of the current benchmarks measure the ability of models to generate humans independently. So we built our own. We measure the models ability to generate a diverse set of human faces and using over 20'000 human annotations we ranked all of the major models on their ability to generate faces. Find the full ranking here:
https://app.rapidata.ai/mri/benchmarks/68af24ae74482280b62f7596

We have release the full underlying data publicly here on huggingface: Rapidata/Face_Generation_Benchmark
jasoncorkillย 
posted an update 5 months ago
view post
Post
3277
"Why did the bee get married?"

"Because he found his honey!"

This was the "funniest" joke out of 10'000 jokes we generated with LLMs. With 68% of respondents rating it as "funny".

Original jokes are particularly hard for LLMs, as jokes are very nuanced and a lot of context is needed to understand if something is "funny". Something that can only reliably be measured using humans.

LLMs are not equally good at generating jokes in every language. Generated English jokes turned out to be way funnier than the Japanese ones. 46% of English-speaking voters on average found the generated joke funny. The same statistic for other languages:

Vietnamese: 44%
Portuguese: 40%
Arabic: 37%
Japanese: 28%

There is not much variance in generation quality among models for any fixed language. But still Claude Sonnet 4 slightly outperforms others in Vietnamese, Arabic and Japanese and Gemini 2.5 Flash in Portuguese and English

We have release the 1 Million (!) native speaker ratings and the 10'000 jokes as a dataset for anyone to use:
Rapidata/multilingual-llm-jokes-4o-claude-gemini
ยท
jasoncorkillย 
posted an update 6 months ago
view post
Post
2438
Imagine you could have an Image Arena score equivalent at each checkpoint during training. We released the first version of just that:
Crowd-Eval

Add one line of code to your training loop and you will have a new real human loss curve in your W&B dashboard.

Thousands of real humans from around the world rating your model in real time at the cost of a few dollars per checkpoint is a game changer.

Check it out here: https://github.com/RapidataAI/crowd-eval

First 5 people to put it in their loop get 100'000 human responses for free! (ping me)
jasoncorkillย 
posted an update 7 months ago
view post
Post
3997
Benchmark Update: @google Veo3 (Text-to-Video)

Two months ago, we benchmarked @google โ€™s Veo2 model. It fell short, struggling with style consistency and temporal coherence, trailing behind Runway, Pika, @tencent , and even @alibaba-pai .

Thatโ€™s changed.

We just wrapped up benchmarking Veo3, and the improvements are substantial. It outperformed every other model by a wide margin across all key metrics. Not just better, dominating across style, coherence, and prompt adherence. It's rare to see such a clear lead in todayโ€™s hyper-competitive T2V landscape.

Dataset coming soon. Stay tuned.
ยท
jasoncorkillย 
posted an update 7 months ago
view post
Post
2881
๐Ÿ”ฅ Hidream I1 is online! ๐Ÿ”ฅ

We just added Hidream I1 to our T2I leaderboard (https://www.rapidata.ai/leaderboard/image-models) benchmarked using 195k+ human responses from 38k+ annotators, all collected in under 24 hours.

It landed #3 overall, right behind:
- @openai 4o
- @black-forest-labs Flux 1 Pro
...and just ahead of @black-forest-labs Flux 1.1 Pro, @xai-org Aurora and @google Imagen3.

Want to dig into the data? Check out our dataset here:
Rapidata/Hidream_t2i_human_preference

What model should we benchmark next?
jasoncorkillย 
posted an update 8 months ago
view post
Post
5544
๐Ÿš€ Building Better Evaluations: 32K Image Annotations Now Available

Today, we're releasing an expanded version: 32K images annotated with 3.7M responses from over 300K individuals which was completed in under two weeks using the Rapidata Python API.

Rapidata/text-2-image-Rich-Human-Feedback-32k

A few months ago, we published one of our most liked dataset with 13K images based on the @data-is-better-together 's dataset, following Google's research on "Rich Human Feedback for Text-to-Image Generation" (https://arxiv.org/abs/2312.10240). It collected over 1.5M responses from 150K+ participants.

Rapidata/text-2-image-Rich-Human-Feedback

In the examples below, users highlighted words from prompts that were not correctly depicted in the generated images. Higher word scores indicate more frequent issues. If an image captured the prompt accurately, users could select [No_mistakes].

We're continuing to work on large-scale human feedback and model evaluation. If you're working on related research and need large, high-quality annotations, feel free to get in touch: [email protected].
jasoncorkillย 
posted an update 8 months ago
view post
Post
3318
๐Ÿš€ We tried something new!

We just published a dataset using a new (for us) preference modality: direct ranking based on aesthetic preference. We ranked a couple of thousand images from most to least preferred, all sampled from the Open Image Preferences v1 dataset by the amazing @data-is-better-together team.

๐Ÿ“Š Check it out here:
Rapidata/2k-ranked-images-open-image-preferences-v1

We're really curious to hear your thoughts!
Is this kind of ranking interesting or useful to you? Let us know! ๐Ÿ’ฌ

If it is, please consider leaving a โค๏ธ and if we hit 30 โค๏ธs, weโ€™ll go ahead and rank the full 17k image dataset!
ยท
jasoncorkillย 
posted an update 8 months ago
view post
Post
3095
๐Ÿ”ฅ Yesterday was a fire day!
We dropped two brand-new datasets capturing Human Preferences for text-to-video and text-to-image generations powered by our own crowdsourcing tool!

Whether you're working on model evaluation, alignment, or fine-tuning, this is for you.

1. Text-to-Video Dataset (Pika 2.2 model):
Rapidata/text-2-video-human-preferences-pika2.2

2. Text-to-Image Dataset (Reve-AI Halfmoon):
Rapidata/Reve-AI-Halfmoon_t2i_human_preference

Letโ€™s train AI on AI-generated content with humans in the loop.
Letโ€™s make generative models that actually get us.
jasoncorkillย 
posted an update 8 months ago
view post
Post
2389
๐Ÿš€ Rapidata: Setting the Standard for Model Evaluation

Rapidata is proud to announce our first independent appearance in academic research, featured in the Lumina-Image 2.0 paper. This marks the beginning of our journey to become the standard for testing text-to-image and generative models. Our expertise in large-scale human annotations allows researchers to refine their models with accurate, real-world feedback.

As we continue to establish ourselves as a key player in model evaluation, weโ€™re here to support researchers with high-quality annotations at scale. Reach out to [email protected] to see how we can help.

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework (2503.21758)
jasoncorkillย 
posted an update 9 months ago
view post
Post
2276
๐Ÿ”ฅ It's out! We published the dataset for our evaluation of @OpenAI 's new 4o image generation model.

Rapidata/OpenAI-4o_t2i_human_preference

Yesterday we published the first large evaluation of the new model, showing that it absolutely leaves the competition in the dust. We have now made the results and data available here! Please check it out and โค๏ธ !
jasoncorkillย 
posted an update 9 months ago
view post
Post
2099
๐Ÿš€ First Benchmark of @OpenAI 's 4o Image Generation Model!

We've just completed the first-ever (to our knowledge) benchmarking of the new OpenAI 4o image generation model, and the results are impressive!

In our tests, OpenAI 4o image generation absolutely crushed leading competitors, including @black-forest-labs , @google , @xai-org , Ideogram, Recraft, and @deepseek-ai , in prompt alignment and coherence! They hold a gap of more than 20% to the nearest competitor in terms of Bradley-Terry score, the biggest we have seen since the beginning of the benchmark!

The benchmarks are based on 200k human responses collected through our API. However, the most challenging part wasn't the benchmarking itself, but generating and downloading the images:

- 5 hours to generate 1000 images (no API available yet)
- Just 10 minutes to set up and launch the benchmark
- Over 200,000 responses rapidly collected

While generating the images, we faced some hurdles that meant that we had to leave out certain parts of our prompt set. Particularly we observed that the OpenAI 4o model proactively refused to generate certain images:

๐Ÿšซ Styles of living artists: completely blocked
๐Ÿšซ Copyrighted characters (e.g., Darth Vader, Pokรฉmon): initially generated but subsequently blocked

Overall, OpenAI 4o stands out significantly in alignment and coherence, especially excelling in certain unusual prompts that have historically caused issues such as: 'A chair on a cat.' See the images for more examples!
  • 1 reply
ยท
jasoncorkillย 
posted an update 9 months ago
view post
Post
3829
At Rapidata, we compared DeepL with LLMs like DeepSeek-R1, Llama, and Mixtral for translation quality using feedback from over 51,000 native speakers. Despite the costs, the performance makes it a valuable investment, especially in critical applications where translation quality is paramount. Now we can say that Europe is more than imposing regulations.

Our dataset, based on these comparisons, is now available on Hugging Face. This might be useful for anyone working on AI translation or language model evaluation.

Rapidata/Translation-deepseek-llama-mixtral-v-deepl
  • 1 reply
ยท
jasoncorkillย 
posted an update 9 months ago
view post
Post
2405
Benchmarking Google's Veo2: How Does It Compare?

The results did not meet expectations. Veo2 struggled with style consistency and temporal coherence, falling behind competitors like Runway, Pika, Tencent, and even Alibaba. While the model shows promise, its alignment and quality are not yet there.

Google recently launched Veo2, its latest text-to-video model, through select partners like fal.ai. As part of our ongoing evaluation of state-of-the-art generative video models, we rigorously benchmarked Veo2 against industry leaders.

We generated a large set of Veo2 videos spending hundreds of dollars in the process and systematically evaluated them using our Python-based API for human and automated labeling.

Check out the ranking here: https://www.rapidata.ai/leaderboard/video-models

Rapidata/text-2-video-human-preferences-veo2
jasoncorkillย 
posted an update 10 months ago
view post
Post
3876
Has OpenGVLab Lumina Outperformed OpenAIโ€™s Model?

Weโ€™ve just released the results from a large-scale human evaluation (400k annotations) of OpenGVLabโ€™s newest text-to-image model, Lumina. Surprisingly, Lumina outperforms OpenAIโ€™s DALL-E 3 in terms of alignment, although it ranks #6 in our overall human preference benchmark.

To support further development in text-to-image models, weโ€™re making our entire human-annotated dataset publicly available. If youโ€™re working on model improvements and need high-quality data, feel free to explore.

We welcome your feedback and look forward to any insights you might share!

Rapidata/OpenGVLab_Lumina_t2i_human_preference
jasoncorkillย 
posted an update 10 months ago
view post
Post
2484
The Sora Video Generation Aligned Words dataset contains a collection of word segments for text-to-video or other multimodal research. It is intended to help researchers and engineers explore fine-grained prompts, including those where certain words are not aligned with the video.

We hope this dataset will support your work in prompt understanding and advance progress in multimodal projects.

If you have specific questions, feel free to reach out.
Rapidata/sora-video-generation-aligned-words