Hugging Face Model Score Curation at Endor Labs
Understand how models are factored and scored at Endor Labs, new exploration tab for HuggingFace models
Understand how models are factored and scored at Endor Labs, new exploration tab for HuggingFace models
Understand how models are factored and scored at Endor Labs, new exploration tab for HuggingFace models
Understand how models are factored and scored at Endor Labs, new exploration tab for HuggingFace models
Understand how models are factored and scored at Endor Labs, new exploration tab for HuggingFace models
Endor Labs recently extended its Open Source Software (OSS) discovery capability to include Open Source AI models on Hugging Face. Now you can evaluate models based on activity, popularity, security, and quality to help developers select safe and sustainable models.
In this blog we will break down how our scoring system works, and how you can best leverage to select safer models.
What is “Hugging Face”?
Hugging Face is an open-source platform that provides tools and models for natural language processing (NLP) and machine learning (ML). It additionally offers a large repository of pre-trained models which can be used for tasks starting from text classification, translation, question answering, and more. It will be simple to think of this repository of models similar to the vast repositories available on GitHub, except in the artificial intelligence (AI) and machine learning context. On top of this model hub, Hugging Face also provides APIs and libraries that make it easier for developers to experiment with and deploy machine learning models.
Hugging Face API Playground
One important feature we rely on to build our scoring system are the APIs that Hugging Face have put in place for their models. They currently offer a user-friendly interface called the Hub API Playground which can be a great learning tool in using the Hugging Face APIs.
In particular, the GET /api/models endpoint returns an object with several fields that provide value when scoring the models. An example of an API GET request for the model meta-llama/Meta-Llama-3-8B-Instruct is shown below along with the fields which are returned.
Building our backend using Hugging Face model metadata
We can retrieve Hugging Face metadata for a given model from the following URL:
https://huggingface.co/api/models/”modelName”
This returns a huggingface_hub.ModelInfo object that contains the following fields detailed below.
- id (str) — ID of model.
- author (str, optional) — Author of the model.
- sha (str, optional) — Repo SHA at this particular revision.
- created_at (datetime, optional) — Date of creation of the repo on the Hub. Note that the lowest value is 2022-03-02T23:29:04.000Z, corresponding to the date when we began to store creation dates.
- last_modified (datetime, optional) — Date of last commit to the repo.
- private (bool) — Is the repo private.
- disabled (bool, optional) — Is the repo disabled.
- downloads (int) — Number of downloads of the model over the last 30 days.
- downloads_all_time (int) — Cumulated number of downloads of the model since its creation.
- gated (Literal["auto", "manual", False], optional) — Is the repo gated. If so, whether there is manual or automatic approval.
- gguf (Dict, optional) — GGUF information of the model.
- inference (Literal["cold", "frozen", "warm"], optional) — Status of the model on the inference API. Warm models are available for immediate use. Cold models will be loaded on first inference call. Frozen models are not available in Inference API.
- likes (int) — Number of likes of the model.
- library_name (str, optional) — Library associated with the model.
- tags (List[str]) — List of tags of the model. Compared to card_data.tags, contains extra tags computed by the Hub (e.g. supported libraries, model’s arXiv).
- pipeline_tag (str, optional) — Pipeline tag associated with the model.
- mask_token (str, optional) — Mask token used by the model.
- widget_data (Any, optional) — Widget data associated with the model.
- model_index (Dict, optional) — Model index for evaluation.
- config (Dict, optional) — Model configuration.
- transformers_info (TransformersInfo, optional) — Transformers-specific info (auto class, processor, etc.) associated with the model.
- trending_score (int, optional) — Trending score of the model.
- card_data (ModelCardData, optional) — Model Card Metadata as a huggingface_hub.repocard_data.ModelCardData object.
- siblings (List[RepoSibling]) — List of huggingface_hub.hf_api.RepoSibling objects that constitute the model.
- spaces (List[str], optional) — List of spaces using the model.
- safetensors (SafeTensorsInfo, optional) — Model’s safetensors information.
This raw information is processed and converted into a series of “score factors” that capture properties of the specific Hugging Face model that can have a positive or negative contribution to the risks that may come when using the model. We organize these score factors into four major categories that capture what we believe are key aspects of risk:
- Security: Indicates the number of security-related issues a model may have. For example the RepoSibling object which holds all filenames located in the model repository gives insight to insecure file formats such as pickle (more information on pickle formats can be found here) or safe file formats such as safetensors contributing to lower or higher security scores respectively.
- Activity: Indicates the level of development activity for a model observed on Hugging Face. This may be things like the number of pull requests or discussion posts found in the model’s repo. Models with higher activity scores will be more active and presumably better maintained when compared to models with a lower activity score.
- Popularity: Indicates how widely a model is used in Hugging Face by tracking source code management system metrics (for example, the number of likes or downloads for the model) as well as counting how many spaces use the model. A model with a high popularity score indicates that it is used widely.
- Code Quality: Indicates how well the model complies with best practices. For instance, additional metadata that may come with the ModelCardData object contain license information which can indicate quality. A model with a higher quality score has fewer code issues.
Overall, there are several factors which holistically come together to create a comprehensive score card for Hugging Face models. As with our previous work with OSS packages, we will continuously evolve these by adding more score factors and taking advantage of additional information that becomes available through the HF APIs.
Using a Large Language Model (LLM) to extract score factors
As detailed in a previous blog post here, there are several security and operational risks that come with deploying machine learning models that should also be factored into our model scoring system. However these important factors may not always be available in the metadata returned through the Hugging Face API calls. Instead much of the valuable information comes from the model README’s located inside their repository. These README files are essentially what are known as “Model Cards” and contain valuable metadata about the model. Take for instance the following risks which can come from unproven models, missing evaluation results, malicious example code, etc. These are not factors that can be obtained through the metadata coming from Hugging Face’s API as exemplified above BUT can be easily identified on the model’s README document.
Before we dive deeper into this topic, we will briefly go over what an LLM is. LLMs are a subset of machine learning models specifically designed to understand, generate, and work with human language. Popular and well-known examples of LLMs are ChatGPT, Bard, etc. These LLMs can handle complex language tasks, such as answering questions, summarizing text, translating languages, or generating human-like text given an input from the user.
The problem is that README files on the Hugging Face hub are still evolving and there are no standardized formats nor requirements for writing them up that exist as of now. As a result, some models may not contain README documents whereas others may exist but not be complete in the information we are looking for. Take the following two models with existing README documents for comparison. One can see for BAAI/bge-small-en-v1.5 the document is quite extensive revealing much about the training and evaluation process. This builds a user’s trust and understanding of the model, increasing the model’s overall score. On the other hand, we have 2Noise/ChatTTS which is much more barebone. The lack of information and insight into the process of building the model subsequently decreases the model’s quality score as well as other factors which may rely on the README.
Just using our eyes, we can easily read and identify which model README’s contain sufficient or the lack of information. However things change quickly when trying to use code to parse the unpredictable writing styles and formats which may arise from thousands upon thousands of model README’s available on Hugging Face’s hub.
Consequently, we took the idea to take advantage of LLMs themselves to do the parsing for us! Instead of trying (and failing) to manually parse all these very diverse README files, we instruct an LLM to extract key pieces of information and use this information when computing our scores.
Experimentation:
In order to ensure we were not throwing a shot in the dark here, there were efforts to validate parsed results from an LLM. The initial experiment consisted of developing a python program to retrieve the README file content of 100 models chosen from Hugging Face. The program then passed in the extracted file content along with a predefined prompt to an LLM and the response was recorded and manually verified to determine if the LLM had responded accurately. For each model, there were 7 different prompts each corresponding to a specific score factor that we wanted to curate. The experiment results were satisfactory with the percentage of average correct responses across the 7 different prompts for the models asked being 90.625%.
Limitations:
Undisputedly, using an LLM to analyze and parse the READMEs comes with limitations. The first and most obvious being that the LLM does not respond with 100 percent accuracy.
Additionally, another limitation moving forward with using an LLM is the fact that just like human text, the responses can be unpredictable and difficult to parse from the coding standpoint. As a result, there were several prompt iterations which we worked through until ultimately arriving at a reliable and parseable response.
Mitigations:
One mitigation is the fact that the LLM can return an unreliable response that half the time causes a guessing game on how to parse the response. Even giving it requirements such as “respond in a simple YES or NO” could cause it to respond with varying filler words at times. This was causing insufficient builds of scores in our database. As a result, the LLM was given a very strict prompt in which it would need to respond in a complete JSON object which would make parsing much easier. With these efforts parsing was no longer a problem as simply the unmarshalled JSON struct fields would be accessed to obtain our score factors such as example code being present, base model being existent, etc.
When it comes to responding incorrectly, we address these with proper prompt design to minimize the false positives cases. For example when asking a question like “does this model’s README contain performance results” with a proper prompt design, it is less likely for the LLM to hallucinate and create fictional performance results. However, it is more likely to have false negatives (i.e. the model’s README has performance results but the model misses them). Nonetheless, we carefully log all the LLM interactions so we can review and evaluate the performance of the parsing to make necessary changes and address errors.
Examples in our UI
Currently in our UI, model discovery for many of the existing models on Hugging Face is available.
Additionally, scores are available to see in the categories and fields which we have discussed above. Take advantage of these resources to start safely building and deploying ML models in your code today!
We describe how to systematically mine metadata from Hugging Face to increase visibility to the vast number of models available. While our results are carefully vetted, there is always an underlying problem of trust. Most of the model metadata available in Hugging Face are essentially self-reported. When a model claims that it was trained on specific datasets, or has specific performance results, there is no easy way to verify this information. This is because at the core the model is just a large amount of binary data. To validate performance one would have to deploy the model and run tests, meaning there is no practical way to confirm that a model was trained on a given dataset just by looking at the model’s weights.
Conclusively, there is a larger story when it comes to AI-SBOMs and building an inventory of model capabilities that can be trusted but this is a blog for another day.
AI model discovery is available now under DroidGPT in the free trial, sign up now and give it a spin!