Sample Data Upload
Some data points to verify if everything is going fine.
To verify that the model uploaded and optimisations performed have been done correctly, we will need some data points to ensure that the model is not corrupted.
Expected Files
We will need 10 pairs of input/output data in .npy
format, hence a total of 20 files uploaded.
File Names
- Name inputs as
{index}_in.npy
and outputs as{index}_out.npy
where the{index}
is a positive integer, sequentially ordered from 0.
You can drag or select multiple files at once into the upload box!
E.g.
1_in.npy
1_out.npy
2_in.npy
2_out.npy
...
🧪 Sample Data Generation
To help you generate sample data, we provide a utility function below. This takes a list of input texts, runs preprocessing and prediction functions, and saves the resulting .npy
files:
import os
import numpy as np
from tqdm import tqdm
def generate_npy_folder(input_data, preprocess_fn, predict_fn, output_dir):
"""
For each input_datum, runs preprocess_fn(text) -> input_arr,
then predict_fn(input_arr) -> output_arr.
Saves to output_dir/{i}_in.npy and output_dir/{i}_out.npy (1-indexed).
"""
os.makedirs(output_dir, exist_ok=True)
for i, text in enumerate(tqdm(input_data)):
input_arr = preprocess_fn(text)
output_arr = predict_fn(input_arr)
np.save(os.path.join(output_dir, f"{i+1}_in.npy"), input_arr)
np.save(os.path.join(output_dir, f"{i+1}_out.npy"), output_arr)
🧠 Example: Using BERT From Hugging Face
Below is a simple example using the BERT model from Hugging Face to generate .npy files:
from transformers import AutoTokenizer, AutoModel
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
model.eval()
def bert_preprocess(text):
# Returns input_ids as numpy array, shape: (1, seq_len)
tokens = tokenizer(text, return_tensors="pt")
return tokens["input_ids"].numpy()
def bert_predict(input_arr):
# input_arr: numpy array, shape (1, seq_len)
input_tensor = torch.from_numpy(input_arr)
with torch.no_grad():
outputs = model(input_ids=input_tensor)
return outputs.last_hidden_state.numpy()
You can now pass your sample text list, e.g. ["hello world", "machine learning is fun"], into generate_npy_folder(...) with bert_preprocess and bert_predict.