23 206 378

lhl PRO

leonardlin

https://randomfoo.net/

AI & ML interests

None yet

Recent Activity

posted an update 1 day ago

We just released our latest Shisa V2.1 Japanese multi-lingual models: https://huggingface.co/collections/shisa-ai/shisa-v21 Besides updates to our 14B, and 70B, we have a new LFM2-based 1.2B, Llama 3.2-based 3B, and Qwen 3-based 8B, all with class-leading Japanese language capabilities. Per usual, lots of details in the Model Cards for those interested.

updated a model 1 day ago

shisa-ai/shisa-v1-7b-v2.1

liked a model 2 days ago

mistralai/Devstral-Small-2-24B-Instruct-2512

View all activity

Organizations

posted an update 1 day ago

Post

1449

We just released our latest Shisa V2.1 Japanese multi-lingual models: https://huggingface.co/collections/shisa-ai/shisa-v21

Besides updates to our 14B, and 70B, we have a new LFM2-based 1.2B, Llama 3.2-based 3B, and Qwen 3-based 8B, all with class-leading Japanese language capabilities.

Per usual, lots of details in the Model Cards for those interested.

1 reply

reacted to pagezyhf's post with 👍 3 months ago

Post

3901

🤝 Collaborating with AMD to ensure Hugging Face Transformers runs smoothly on AMD GPUs!

We run daily CI on AMD MI325 to track the health of the most important model architectures and we’ve just made our internal dashboard public.

By making this easily accessible, we hope to spark community contributions and improve support for everyone!

2 replies

posted an update 6 months ago

Post

463

I'm excited to announce the official release of our Shisa V2 405B model:
shisa-ai/shisa-v2-llama3.1-405b

It's the strongest model ever trained in Japan, and even goes toe-to-toe w/ GPT-4o and DeepSeek-V3 in JA MT-Bench.

For all the details, be sure to check out post and overview report here: https://shisa.ai/posts/shisa-v2-405b/

posted an update 7 months ago

Post

2556

BTW, in case anyone wants to kick the tires, test their 日本語, I have our Shisa V2 405B model up and running temporarily: https://chat.shisa.ai/

3 replies

posted an update 8 months ago

Post

2689

Happy to announce the release of Shisa V2, our latest generation of our bilingual Japanese-English language models. After hundreds of ablations and months of work, we're releasing some of the strongest open Japanese models at 7B, 8B, 12B, 14B, 32B and 70B! Full announcement here https://shisa.ai/posts/shisa-v2/ or visit the Shisa V2 HF collection: shisa-ai/shisa-v2-67fc98ecaf940ad6c49f5689

reacted to nroggendorff's post with 👀 about 1 year ago

Post

2687

When huggingface patches this, I'm going to be really sad, but in the meantime, here you go:

When AutoTrain creates a new space to train your model, it does so via the huggingface API. If you modify the code so that it includes a premade README.md file, you can add these two lines:

---
app_port: 8080 # or any integer besides 7860 that's greater than 2 ** 10
startup_duration_timeout: 350m
---

This will tell huggingface to listen for the iframe on your port, instead of the one autotrain is actually hosting on, and because startup time isn't charged, you get the product for free. (you can take this even further by switching compute type to A100 or something)

1 reply

reacted to anakin87's post with 🔥 over 1 year ago

Post

962

🧪 RAG Evaluation with 🔥 Prometheus 2 + Haystack

📝 Blog post: https://haystack.deepset.ai/blog/rag-evaluation-with-prometheus-2
📓 Notebook: https://github.com/deepset-ai/haystack-cookbook/blob/main/notebooks/prometheus2_evaluation.ipynb

─── ⋆⋅☆⋅⋆ ───

When evaluating LLMs' responses, 𝐩𝐫𝐨𝐩𝐫𝐢𝐞𝐭𝐚𝐫𝐲 𝐦𝐨𝐝𝐞𝐥𝐬 like GPT-4 are commonly used due to their strong performance.
However, relying on closed models presents challenges related to data privacy 🔒, transparency, controllability, and cost 💸.

On the other hand, 𝐨𝐩𝐞𝐧 𝐦𝐨𝐝𝐞𝐥𝐬 typically do not correlate well with human judgments and lack flexibility.

🔥 Prometheus 2 is a new family of open-source models designed to address these gaps:
🔹 two variants: prometheus-eval/prometheus-7b-v2.0; prometheus-eval/prometheus-8x7b-v2.0
🔹 trained on open-source data
🔹 high correlation with human evaluations and proprietary models
🔹 highly flexible: capable of performing direct assessments and pairwise rankings, and allowing the definition of custom evaluation criteria.

See my experiments with RAG evaluation in the links above.

posted an update over 1 year ago

Post

2115

My weekened project ended up being doing some testing between torchtune, axolotl, and unsloth. I *think* it's a 1:1 comparison of what LoRA fine-tuning performance looks like between the different hardware I have in my dev boxes (4090, 3090, 7900 XTX, W7900) with a few other interesting tidbits.

Tonight I wrote up a WandB report (the panel editor is super broken in Firefox 😔) that sums up some of the more interesting bits from the results: https://wandb.ai/augmxnt/train-bench/reports/torchtune-vs-axolotl-vs-unsloth-Trainer-Comparison--Vmlldzo4MzU3NTAx

1 reply

posted an update over 1 year ago

Post

2520

Maybe of interest, I just finished a long writeup of my weekend project exploring Qwen 2 7B Instruct's Chinese censorship: https://huggingface.co/blog/leonardlin/chinese-llm-censorship-analysis

I also have an accompanying model and dataset (and codebase) for those curious to poke around:

* augmxnt/Qwen2-7B-Instruct-deccp

* augmxnt/deccp

reacted to thomwolf's post with 🔥 over 1 year ago

Post

4644

[New crazy blog post alert] We are releasing an extensive blog post on the science of creating high quality web-scale datasets, detailing all the steps and learnings that came in our recent 15 trillion tokens 🍷FineWeb release

Inspired by the distill.pub interactive graphics papers, we settled to write the most extensive, enjoyable and in-depth tech report we could draft on so prepare for a 45-mmin read with interactive graphics and all.

And it's not all, in this article we also introduce 📚FineWeb-Edu a filtered subset of Common Crawl with 1.3T tokens containing only web pages with very high educational content. Up to our knowledge, FineWeb-Edu out-performs all openly release web-scale datasets by a significant margin on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA

We also make a number of surprising observations on the "quality" of the internet it-self which may challenge some of the general assumptions on web data (not saying more, I'll let you draw your conclusions ;)

HuggingFaceFW/blogpost-fineweb-v1

1 reply

replied to their post over 1 year ago

I'll just add that I'm sure it's spam now, that space is attached to another one of my models as well (and obviously not running either). Also the user's other space is straight out linking to something shady: https://huggingface.co/spaces/elseodelasgalletas/detector-de-ia (I can't report as I'm rate limited)

replied to their post over 1 year ago

I mean, it's obviously not running my model (it's a brand new JA/EN ablation), so not sure why it'd be attached...

posted an update over 1 year ago

Post

1958

Interesting, I've just seen the my first HF spam on one of my new model uploads: shisa-ai/shisa-v1-llama3-70b - someone has an SEO spam page as a HF space attached to the model!?! Wild. Who do I report this to?

4 replies

replied to their post over 1 year ago

Also, I tested the new https://huggingface.co/DataPilot/ArrowPro-7B-KUJIRA model and it appears to be the real deal, very impressive performance, trained by a 15-yo (!) @Holy-fox - note that using my sampler settings detailed improved the score as well (as otherwise it suffered from looping errors as well).

I'll be aiming for beating that on the Llama 3 8B, and beating Command R Plus for the 70B in the coming days.

replied to their post over 1 year ago

I'll just add a note on the sampler parameters for testing that I found improved performance for virtually every model I tested: temperature 0.2, min_p 0.1, frequency_penalty 0.5 (a frequency/repetition penalty is required to minimize looping errors that otherwise creep into most of these models)

posted an update over 1 year ago

Post

1624

For those with an interest in JA language models, this Llama 3 70B test ablation looks like it is the current strongest publicly released, commercially usable, open model available. A lot of caveats I know, but it also matches gpt-3.5-turbo-0125's JA performance, which is worth noting, and is tuned *exclusively* with the old shisa-v1 dataset (so it's chart position will be very short lived).

shisa-ai/shisa-v1-llama3-70b

augmxnt/ultra-orca-boros-en-ja-v1

2 replies

posted an update over 1 year ago

Post

1959

With slurm figured out and ablations humming along, I though I'd update and post my understanding of the legal status of training data in Japan. It is in general, much clearer in the US: https://huggingface.co/blog/leonardlin/ai-training-data-in-japan

posted an update over 1 year ago

Post

1388

llm-jp-eval is currently one of the most widely used benchmarks for Japanese LLMs and is half of WandB's comprehensive Nejumi LLM Leaderboard scoring. I was seeing some weirdness in results I was getting and ended up in a bit of a rabbit hole. Here's my article on evaling llm-jp-eval: https://huggingface.co/blog/leonardlin/llm-jp-eval-eval

I've setup a fork of Lightblue's Shaberi testing framework which uses LLM-as-a-Judge style benchmarks as something probably more representative of real world LLM strength in Japanese. Here's how the new base model ablations are looking:

reacted to mrfakename's post with ❤️ over 1 year ago

Post

11262

Introducing StyleTTS 2 detector, an audio classification model to detect StyleTTS 2 vs human-generated content!

Dual-licensed under MIT/Apache 2.0.

Model Weights: mrfakename/styletts2-detector
Spaces: mrfakename/styletts2-detector

2 replies

posted an update over 1 year ago

Post

1262

I've been doing some evals and tuning, and this chat template repo maintained by @chujiezheng is great: https://github.com/chujiezheng/chat_templates

Here's also a simple script for checking what the output looks like:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("augmxnt/shisa-7b-v1")
messages = [
    {'role': 'user', 'content': 'This is the first user input.'},
    {'role': 'assistant', 'content': 'This is the first assistant response.'},
    {'role': 'user', 'content': 'This is the second user input.'},
]

print()
print('Chat Template:')
print(tokenizer.chat_template)
print()
print('---')
print()

print(tokenizer.apply_chat_template(messages, tokenize=False))

lhl PRO

AI & ML interests

Recent Activity

Organizations

leonardlin's activity