Sample Data Upload

Some data points to verify if everything is going fine.

To verify that the model uploaded and optimisations performed have been done correctly, we will need some data points to ensure that the model is not corrupted.

Expected Files

We will need 10 pairs of input/output data in .npy format, hence a total of 20 files uploaded.

File Names

Name inputs as {index}_in.npy and outputs as {index}_out.npy where the {index} is a positive integer, sequentially ordered from 0.

tip

You can drag or select multiple files at once into the upload box!

E.g.

1_in.npy
1_out.npy
2_in.npy
2_out.npy
...

🧪 Sample Data Generation

To help you generate sample data, we provide a utility function below. This takes a list of input texts, runs preprocessing and prediction functions, and saves the resulting .npy files:

import os
import numpy as np
from tqdm import tqdm

def generate_npy_folder(input_data, preprocess_fn, predict_fn, output_dir):
    """
    For each input_datum, runs preprocess_fn(text) -> input_arr,
    then predict_fn(input_arr) -> output_arr.
    Saves to output_dir/{i}_in.npy and output_dir/{i}_out.npy (1-indexed).
    """
    os.makedirs(output_dir, exist_ok=True)
    for i, text in enumerate(tqdm(input_data)):
        input_arr = preprocess_fn(text)
        output_arr = predict_fn(input_arr)
        np.save(os.path.join(output_dir, f"{i+1}_in.npy"), input_arr)
        np.save(os.path.join(output_dir, f"{i+1}_out.npy"), output_arr)

🧠 Example: Using BERT From Hugging Face

Below is a simple example using the BERT model from Hugging Face to generate .npy files:

from transformers import AutoTokenizer, AutoModel
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
model.eval()

def bert_preprocess(text):
    # Returns input_ids as numpy array, shape: (1, seq_len)
    tokens = tokenizer(text, return_tensors="pt")
    return tokens["input_ids"].numpy()

def bert_predict(input_arr):
    # input_arr: numpy array, shape (1, seq_len)
    input_tensor = torch.from_numpy(input_arr)
    with torch.no_grad():
        outputs = model(input_ids=input_tensor)
    return outputs.last_hidden_state.numpy()

You can now pass your sample text list, e.g. ["hello world", "machine learning is fun"], into generate_npy_folder(...) with bert_preprocess and bert_predict.

Expected Files​

File Names​

🧪 Sample Data Generation​

🧠 Example: Using BERT From Hugging Face​

Expected Files

File Names

🧪 Sample Data Generation

🧠 Example: Using BERT From Hugging Face