AI & ML interests

None defined yet.

Recent Activity

JoseRFJunior 
posted an update over 1 year ago
view post
Post
1726
JoseRFJunior/TransNAR
https://github.com/JoseRFJuniorLLMs/TransNAR
https://arxiv.org/html/2406.09308v1
TransNAR hybrid architecture. Similar to Alayrac et al, we interleave existing Transformer layers with gated cross-attention layers which enable information to flow from the NAR to the Transformer. We generate queries from tokens while we obtain keys and values from nodes and edges of the graph. The node and edge embeddings are obtained by running the NAR on the graph version of the reasoning task to be solved. When experimenting with pre-trained Transformers, we initially close the cross-attention gate, in order to fully preserve the language model’s internal knowledge at the beginning of training.
vladbogo 
posted an update over 1 year ago
view post
Post
2618
SwapAnything is a new method that allows swapping any object in an image with personalized concepts given by a reference image.

Key points:
1️⃣ It uses pre-trained diffusion models to enable precise and high-fidelity object swapping in images.
2️⃣Targeted variable swapping ensures perfect background preservation while swapping specific areas.
3️⃣SwapAnything achieves good results in single-object, multi-object, partial-object, and cross-domain swapping tasks.

Paper: SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing (2404.05717)
Project page: https://swap-anything.github.io

Congrats to the authors for their work!
vladbogo 
posted an update over 1 year ago
view post
Post
1834
Anthropic introduces "Many-shot Jailbreaking" (MSJ), a new attack on large language models! MSJ exploits long context windows to override safety constraints.

Key Points:
* Prompts LLMs with hundreds of examples of harmful behavior formatted as a dialogue
* Generates malicious examples using an uninhibited "helpful-only" model
* Effective at jailbreaking models like Claude 2.0, GPT-3.5, GPT-4
* Standard alignment techniques provide limited protection against long context attacks

Paper: https://www.anthropic.com/research/many-shot-jailbreaking
More details in my blog: https://huggingface.co/blog/vladbogo/many-shot-jailbreaking

Congrats to the authors for their work!
  • 2 replies
·
vladbogo 
posted an update over 1 year ago
view post
Post
2746
Google DeepMind introduces Gecko a new text embedding! Gecko uses a two-step process that leverages synthetic data generation and reranking.

Keypoints:
* Uses an LLM to generate diverse synthetic queries and tasks from web passages
* Refines the data by retrieving candidate passages and relabeling positives/negatives using the same LLM
* Achieves very good results on the Massive Text Embedding Benchmark, where compact 256D Gecko outperforms 768D models.
* 768D Gecko achieves state-of-the-art performance competing with models a lot larger larger.

Paper: Gecko: Versatile Text Embeddings Distilled from Large Language Models (2403.20327)
More details in my blog: https://huggingface.co/blog/vladbogo/gecko

Congrats to the authors for their work!
vladbogo 
posted an update over 1 year ago
view post
Post
1725
A new paper titled "Long-Form Factuality in Large Language Models" proposes a new approach to evaluate the long-form factuality of large language models using an AI agent! They introduce SAFE (Search-Augmented Factuality Evaluator) which leverages an LLM to break down responses into individual facts, query Google to verify each fact, and perform multi-step reasoning.

Keypoints:
* SAFE (Search-Augmented Factuality Evaluator) is an automated method using an LLM agent to evaluate factuality
* It also introduces LongFact, a 2,280 prompt set spanning 38 topics to test open-domain factual knowledge
* SAFE achieves a 72% humans agreement while being 20x cheaper. It also wins 76% of the disagreements measured on a small scale experiment where a more thorough human procedure (researchers + full internet search) was used.
* Larger models like GPT-4, Claude Opus and Gemini Ultra tend to exhibit better long-form factuality.

Paper: Long-form factuality in large language models (2403.18802)
Code and data: https://github.com/google-deepmind/long-form-factuality

Congrats to the authors for their work!
vladbogo 
posted an update over 1 year ago
view post
Post
1395
A new paper introduces Visual CoT, a new approach that enhances multi-modal large language models with visual chain-of-thought reasoning capabilities. This allows language models to dynamically identify and focus on specific regions within images that are most relevant for answering questions, mimicking human-like efficient visual reasoning.

Keypoints:
* Introduces the 373k Visual CoT dataset with bounding box annotations highlighting essential image regions
* Proposes a multi-turn pipeline for focusing on relevant visual inputs
* Achieves strong results on multi-modal benchmarks

Paper: Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models (2403.16999)
Code, data and other resources: https://github.com/deepcs233/Visual-CoT

Congrats to the authors for their work!
vladbogo 
posted an update over 1 year ago
view post
Post
xAI releases the weights for Grok-1. Apparently it's a 314B MoE with 25% of the weights active on a given token.

Blog: https://x.ai/blog/grok-os
Code: https://github.com/xai-org/grok
Model: xai-org/grok-1
Weights: magnet:?xt=urn:btih:5f96d43576e3d386c9ba65b883210a393b68210e&tr=https%3A%2F%2Facademictorrents.com%2Fannounce.php&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce
  • 2 replies
·
vladbogo 
posted an update over 1 year ago
view post
Post
"Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts" is a new framework designed to animate specific regions within an image through user inputs.

Key points:
* Enables precise animation of selected image regions with just a user click and a concise motion description.
* Achieves promising results for generating localized animations.

Paper: Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts (2403.08268)

Congrats to the authors for their work!
vladbogo 
posted an update over 1 year ago
view post
Post
Synth^2 is a new approach that leverages large language models and text-to-image generators to create synthetic image-caption data for boosting visual-language model performance.

Key Points:
* Overcomes data limitations by generating high-quality synthetic image-caption pairs, reducing reliance on costly human annotations.
* Achieves competitive results on image captioning tasks using 40x less paired data than state-of-the-art methods.

Paper: Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings (2403.07750)

Congrats to the authors for their work!
vladbogo 
posted an update over 1 year ago
view post
Post
A recent paper titled "ShortGPT: Layers in Large Language Models are More Redundant Than You Expect" proposes a simple and effective approach to pruning Large Language Models (LLMs) by removing redundant layers.

Key points:
* Discovers significant redundancy across layers in LLMs, with some layers playing a negligible role for the final performance.
* Defines a new metric called Block Influence (BI) to quantify the importance of each layer in an LLM.
* Removes layers with low BI scores, achieving up to 25% reduction in parameters and computation while maintaining 92% of the LLM's performance.

Congrats to the authors for their work!

Paper: ShortGPT: Layers in Large Language Models are More Redundant Than You Expect (2403.03853)

mvaloatto 
posted an update over 1 year ago
vladbogo 
posted an update almost 2 years ago
view post
Post
A recent paper titled "Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters" proposes using fine-tuned Multimodal Language Models (MLMs) as high-quality filters for image-text data.

Key points:
* Defines multiple metrics to assess image-text quality from different perspectives like object details, text quality, and semantic understanding.
* Leverages GPT-4 and GPT-4V to construct high-quality instruction data for fine-tuning open-source MLMs as effective data filters.
* Fine-tuned MLM filters generate more precise scores, leading to better filtered data and improved performance of pre-trained models on various downstream tasks.

Congrats to the authors for their work!

Paper: Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters (2403.02677)
Code: https://github.com/Victorwz/MLM_Filter
Dataset: weizhiwang/mlm_filter_instructions
Model: weizhiwang/mlm-filter-llava-13b-gpt4v
vladbogo 
posted an update almost 2 years ago
view post
Post
"Multi-LoRA Composition for Image Generation" introduces two new approaches for combining multiple visual elements in text-to-image generation using Low-Rank Adaptations (LoRAs)! 🎨

Key Points:
* Proposes two methods - LoRA Switch and LoRA Composite - that activate/combine LoRAs during the denoising process rather than merging weights
* LoRA Switch cycles through different LoRAs at each step, while LoRA Composite averages guidance from all LoRAs simultaneously

Paper: Multi-LoRA Composition for Image Generation (2402.16843)
Project page: https://maszhongming.github.io/Multi-LoRA-Composition

Congrats to the authors for their work!
vladbogo 
posted an update almost 2 years ago
view post
Post
The "Design2Code: How Far Are We From Automating Front-End Engineering" paper presents a benchmark for multimodal large language models (LLMs) aimed at automating front-end web development by translating webpage designs (screenshots) into code. This task evaluates the models' ability to recreate webpages that are visually and structurally similar to the original designs.

Key Points:
* Introduces the Design2Code task and benchmark for converting webpage screenshots into code, aiming to automate front-end web development.
* Evaluates multimodal LLMs using comprehensive metrics for visual similarity and element matching.
* GPT-4V outperforms other models in terms of visual resemblance and content accuracy, with generated webpages often preferred over the original references.

Paper: Design2Code: How Far Are We From Automating Front-End Engineering? (2403.03163)
Project page: https://salt-nlp.github.io/Design2Code/
Dataset: SALT-NLP/Design2Code

Congrats to the authors for their work!