Home

Awesome

NeMo Inspector

NeMo Inspector is a tool designed to help you analyze Large Language Model (LLM) generations. It provides two main pages:

About the NeMo Inspector

The Inference page allows you to experiment with model prompts and responses in real-time, adjusting various parameters. The Analyze page lets you load previously generated outputs and apply filtering, sorting, labeling, and statistic calculations for in-depth exploration.

Prerequisites

  1. Clone and Install the Tool:

    git clone git@github.com:NVIDIA/NeMo-Inspector.git
    cd nemo-inspector
    pip install .
    
  2. Launch the Tool:

    nemo_inspector
    

This will start a local server that you can access through your browser.

Inference Page

The Inference page allows you to generate responses using an LLM and analyze them immediately. It supports two generation modes:

The Inference page utilizes NeMo-Skills pipelines for inference.

Analyze Page

The Analyze page helps you work with pre-generated outputs. To use it, provide paths to the generation files using command-line arguments. For example:

nemo_inspector --inspector_params.model_prediction \
  generation1='/path/to/generation1/output-greedy.jsonl' \
  generation2='/path/to/generation2/output-rs*.jsonl'

Once loaded, the Analyze page lets you:

Filtering

The tool supports two filtering modes: Filter Files mode and Filter Questions mode. You can define custom filtering functions in Python and run them directly in the UI.

Filter Files Mode

Example of a custom filtering function:

def custom_filtering_function(error_message: str) -> bool:
    # Implement your logic here
    return 'timeout' not in error_message

# This line will be used for the filtering:
custom_filtering_function(data['generation1']['error_message'])

Note: The last line of the custom filtering function is used for filtering. All preceding lines are just for computation.

To apply multiple conditions to multiple generations, use the && separator. For instance:

data['generation1']['is_correct'] && not data['generation2']['is_correct']

Important: In Filter Files mode, do not write multi-generation conditions without using &&. Each condition should be separated by &&.

Filter Questions Mode

In this mode, you write conditions without the && operator. For example:

data['generation1'][0]['is_correct'] and not data['generation2'][0]['is_correct']

This example filters out questions where the first generation is correct and the second generation is incorrect. It can also compare fields directly:

data['generation1'][0]['is_correct'] != data['generation2'][0]['is_correct']

Note: These examples cannot be used in Filter Files mode.

Sorting

Sorting functions are similar to filtering functions, but there are key differences:

  1. Scope: Sorting functions operate on individual data entries (not dictionaries with multiple generations).
  2. Cross-Generations: Sorting cannot be applied across multiple generations at once. You must sort one generation at a time.

A correct sorting function might look like this:

def custom_sorting_function(generation: str):
    # Sort by the length of the generation text
    return len(generation)

# This line will be used for the sorting:
custom_sorting_function(data['generation'])

Statistics

NeMo Inspector supports two types of statistics:

  1. Custom Statistics: Applied to the samples of a single question (for each generation).

    Default custom statistics include:

    • correct_responses
    • wrong_responses
    • no_responses
  2. General Custom Statistics: Applied across all questions and all generations.

    Default general custom statistics include:

    • dataset size
    • overall number of samples
    • generations per sample

You can change the existing or define your own Custom and General Custom Statistics functions.

Custom Statistics Example:

def unique_error_counter(datas):
    # `datas` is a list of JSONs (one per file) for a single question
    unique_errors = set()
    for data in datas:
        unique_errors.add(data.get('error_message'))
    return len(unique_errors)

def number_of_runs(datas):
    return len(datas)

# Map function names to functions
{'unique_errors': unique_error_counter, 'number_of_runs': number_of_runs}

General Custom Statistics Example:

def overall_unique_error_counter(datas):
    # `datas` is a list of lists of dictionaries, 
    # where datas[question_index][file_index] is a JSON record
    unique_errors = set()
    for question_data in datas:
        for file_data in question_data:
            unique_errors.add(file_data.get('error_message'))
    return len(unique_errors)

# Map function names to functions
{'unique_errors': overall_unique_error_counter}

Note: The final line in both the Custom and General Custom Statistics code blocks should be a dictionary mapping function names to their corresponding functions.

Modifications

You can update each sample in the dataset programmatically. At the end of the code block, return the updated sample dictionary:

# For example, strip leading and trailing whitespace from the "generation" field
{**data, 'generation': data['generation'].strip()}