Deploy a Speech-To-Text Model using Model Catalog
Following this guide you will learn how to use the catalog to swiftly deploy a w2v2 model with Hyperpod AI and perform an inference through a simple API.
๐ RECOMMENDED: Try our ready-made Colab notebook โ run everything in your browser with zero hassle!โ
๐๏ธ What is wav2vec2?โ
Wav2Vec2 is a state-of-the-art speech recognition model developed by Facebook AI. It transforms raw audio waveforms into text using self-supervised learning. Itโs ideal for building voice assistants, transcription tools, and voice-controlled applications.
To use Wav2Vec2, we convert audio into normalized waveform arrays โ the model then maps them directly to tokenized transcriptions.
While the pre-trained Wav2Vec2 model offers decent performance out of the box, it may not be highly accurate for all domains, accents, or specialized vocabularies. However, its architecture is well-suited for fine-tuning on custom datasets, making it a powerful and accessible option for creating speech-to-text models tailored to your specific use case. Hugging Face provides a detailed guide on how to fine-tune Wav2Vec2 using the Transformers library, allowing developers to quickly adapt the model to their own audio data.
๐ Step 1: Sign Up on Hyperpod AIโ
Head to app.hyperpodai.com
- Create a free account (includes 10 free hours)
- Click on โQuick Startโ
๐ ๏ธ Step 2: Set Up Your Projectโ
Fill in the project form as follows:
-
Project Name: (Any name โ for your reference)
-
Choose a Model: Select Wave2Vec Speech-To-Text model
-
Click Create Project. This may take a few minutes as Hyperpod sets up and optimizes the deployment for you automatically.
๐ Step 3: Set up your APIโ
-
After the deployment is done it should look like something this
-
Go to Deployment
-
You can choose between Test mode and Production mode. For this guide, select Test mode โ itโs the most cost-effective option. Read more about the difference between test and production mode.
โ ๏ธ Warning: Be sure to manually turn it off by clicking the Off button when youโre not using it, to avoid unnecessary charges.
- The deployment process may take around 10 minutes. Once itโs ready, copy the Endpoint URL and Project ID that will be visible from this page.
- Navigate to the API Keys section.
- Set a name (for your own reference), and choose a validity period. Then click Create Key
- After the key is generated, copy and store it somewhere safe โ you wonโt be able to view it again from the platform.
โ ๏ธ Warning: Never store your API key in public locations (like GitHub repositories). If exposed, malicious users could access your API and incur charges on your behalf.
โ Preprocess the Audio with Huggingfaceโ
To generate inputs, youโll need to first preprocess the audio:
from pydub import AudioSegment
import numpy as np
from transformers import Wav2Vec2Processor
endpoint_url = 'your-endpoint-url'
api_key = 'your-api-key'
project_id = 'your-project-id'
file_path="your-audio-file"
target_sampling_rate=16000
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
audio = AudioSegment.from_file(file_path).set_frame_rate(target_sampling_rate).set_channels(1)
samples = np.array(audio.get_array_of_samples()).astype(np.float32)
samples = samples[:target_sampling_rate]
data_values = processor(samples, return_tensors="np", padding=True)
โ Step 5: Call the APIโ
Now that you have the tokenized text array, send it to the inference endpoint:
import httpx
print(type(data_values))
client = httpx.Client(http2=True, timeout=httpx.Timeout(30.0))
response = client.post(endpoint_url,
headers={
'content-type': 'application/json',
'x-api-key': api_key
}, json = {
'project_id': project_id,
'grpc_data': {'array':data_values['input_values'].tolist()}
})
print(response.text)
๐ Final Step: Post-processingโ
The Wav2Vec2 model used here is from Hugging Faceโs Transformers library. After running inference, the model outputs token IDs representing predicted characters or subwords. To convert these token IDs into human-readable text, you need to use the Wav2Vec2Processor, which comes with the model. This processor handles decoding, including any special handling like removing padding tokens or applying a language model if one is attached. According to the modelโs documentation on Hugging Face, decoding with this processor is the recommended approach to get accurate transcriptions.
def to_numpy_array(array, dtype, shape):
arr = np.array(array, dtype=dtype)
return arr.reshape(shape)
output_ids = to_numpy_array(**output)[0]
res = np.argmax(output_ids, axis=-1)
# Assuming output_ids is the response from your API
decoded_text = processor.decode(res[0])
print(decoded_text)