Get Started • Community • Youtube • Discord • GitHub
The Speech to Text template is a powerful tool that leverages advanced speech recognition and natural language processing capabilities to generate accurate textual transcriptions from uploaded audio or video files. This template provides a seamless experience for users, enabling them to effortlessly extract meaningful information from audio-visual content.
This is a Node-RED flow that allows users to describe video/audio to text, and using your custom prompt make a conclusion of the received text.
- HTTP Input Node (`/convertSpeech`): This node listens for incoming HTTP POST requests at the `/convertSpeech` endpoint. It expects the request to include an audio file or a YouTube video URL, along with an OpenAI API key.
- Function Node (`check type`): This function node determines whether the user has provided an audio file or a YouTube video URL. If a URL is provided, it sets up the necessary parameters for downloading the audio from the video. If a file is provided, it prepares the payload for the OpenAI API request.
- YouTube-YTDL Node: If a YouTube video URL is provided, this node downloads the audio from the video.
- HTTP Request Node (to OpenAI): If an audio file is provided, this node sends a POST request to the OpenAI API (`https://api.openai.com/v1/audio/transcriptions`) with the audio file and the necessary headers, including the API key.
- Function Node (`response`): This function node processes the response from the OpenAI API. If the response status code is 200 (successful), it extracts the transcribed text from the response payload and assigns it to `msg.payload`. If there's an error, it constructs an error message and assigns it to `msg.payload`.
- HTTP Response Node: This node sends the final response back to the client, containing either the transcribed text or an error message.
The template features a user-friendly interface that allows users to upload audio files in popular formats such as WAV, MP3, FLAC, or provide YouTube video URLs. The uploading process is straightforward and intuitive, ensuring a smooth user experience.
At the heart of this template lies the powerful Whisper AI model from OpenAI, specifically designed for speech recognition and transcription tasks. Whisper employs advanced machine learning techniques to accurately transcribe audio content into textual form, capturing the spoken words with high fidelity.
To cater to diverse linguistic needs, the template offers support for multiple languages, allowing users to transcribe audio in various languages and dialects. The available language options are regularly updated to ensure wide coverage and accuracy.
To leverage the Whisper AI model, users need to obtain an OpenAI API key. The template provides clear instructions and guidance on how to acquire and utilize the API key effectively, ensuring secure and seamless integration with the transcription service.
Thanks to Whisper's efficient processing capabilities, users can expect quick turnaround times for transcribing audio files or videos. This feature ensures a smooth and responsive user experience, minimizing wait times and enabling users to access textual insights from audio-visual content promptly.
Whisper is trained on vast datasets and continuously updated to maintain high accuracy and reliability in transcribing speech across various domains, accents, and noise conditions. Users can trust the quality of the outputted text, ensuring that the transcriptions faithfully capture the spoken content.
The template offers various customization options, allowing users to fine-tune the output according to their specific requirements. This includes adjusting parameters such as language settings, formatting options, and specific areas of focus, ensuring that the transcriptions align with the user's needs.
The Speech-to-Text template empowers users with a robust and efficient solution for extracting valuable textual information from audio-visual content. Whether for accessibility purposes, content analysis, data mining, or simply capturing spoken words in written form, this template offers a comprehensive and user-friendly experience. By leveraging the advanced capabilities of the Whisper AI model, users can unlock the hidden potential of their audio-visual data and transform it into actionable and meaningful textual information.