Better support markdown for speech output #56

chaoqunxie · 2024-08-31T08:54:13Z

advice：

302 url：some content
use plugin：can use Ingest Attachment Plugin

for example:
need speak :can use Ingest Attachment Plugin
but not speak: can use Ingest Attachment Plugin https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html

matatonic · 2024-08-31T19:19:06Z

This was not very clear for me, perhaps can you try writing in your own language a more detailed question? I can try to translate on my side, or ask someone else for help translating.

chaoqunxie · 2024-09-01T13:06:07Z

matatonic · 2024-09-01T14:04:49Z

Ok, I think I understand. You would like markdown text to be filtered and pre-processed before speech, rather than just try to say everything. This way URL's are not said, but the title of the link is said. This is probably a good feature and I think it can be done without much trouble.

I am curious how the official openai api speech model handles markdown text, if anyone has any details I would like to know.

Thanks!

chaoqunxie · 2024-09-04T07:36:39Z

Ok, I think I understand. You would like markdown text to be filtered and pre-processed before speech, rather than just try to say everything. This way URL's are not said, but the title of the link is said. This is probably a good feature and I think it can be done without much trouble.

I am curious how the official openai api speech model handles markdown text, if anyone has any details I would like to know.

Thanks!

response by gpt4o:

OpenAI's speech functionality (like the text-to-speech feature in ChatGPT) is implemented using text parsing and natural language processing (NLP) technologies to enable more intelligent text reading. Here's an overview of the process and related technologies:

Text Parsing:
The system first uses a text parser to analyze the user's input. This parser can recognize and categorize different text formats, such as Markdown, HTML tags, URLs, headings, lists, code blocks, etc. The purpose of parsing is to differentiate between parts of the text that should be read aloud and parts that should be filtered out or converted into a more natural form.
Natural Language Processing (NLP):
NLP technology is used to understand the context and content of the text. For example, when encountering a hyperlink, the NLP model will try to extract the title or descriptive text of the link rather than reading out the URL string directly. This involves extracting relevant information from links or other formatted text and converting it into a form suitable for speech output.
Preprocessing and Text Conversion:
Before speech synthesis, the text undergoes preprocessing, which includes:
- Removing content that is not suitable for speech (such as long URLs, code snippets, HTML tags, etc.).
- Converting special characters, punctuation marks, and other elements into forms appropriate for speech.
- Reorganizing the text content to make it more suitable for natural reading, which may involve segmenting, simplifying, or rewriting parts of the text.
Speech Synthesis Engine:
The preprocessed text is then passed to a speech synthesis engine, which is typically a deep learning-based model, such as one built on a Transformer architecture (like Tacotron, WaveNet, or its improved versions). These models are trained on large amounts of speech data and can generate high-quality, human-like speech output.
Context Understanding and Speech Adjustment:
By combining context understanding and speech feature adjustment, the system can not only produce accurate text-to-speech but also use appropriate intonation, pauses, and emphasis based on the context, making the generated speech more natural and expressive.
User Customization:
To enhance user experience, the system may also support customization options, such as allowing users to choose different voice styles, speech speeds, etc.

These steps work together to enable OpenAI's speech functionality to effectively handle complex text input and generate natural, understandable speech output. This relies on the combined efforts of deep learning, NLP, and speech synthesis technologies.

thiswillbeyourgithub · 2024-09-08T17:14:18Z

I noticed that piper seems unable to read *italic*, but can read **bold** test fine. I have not yet found a way to fix that using preprocessing.

jmorto11 · 2024-12-27T19:19:52Z

I'm using this with open-webui and loving it. I'm actually trying to get the audio when I click Read Aloud to not read the italics markdown. I thought adding this to your pre_process_map.yaml in the config file would help, but it doesn't seem to change anything. ex in yaml file added:

- - \*(.*?)\*
  - ''
- - _(.*?)_
  - ''

Any thoughts? Either way love the project!

matatonic changed the title ~~will support markdown format?~~ Better support markdown for speech output Sep 1, 2024

matatonic added enhancement New feature or request help wanted Extra attention is needed labels Sep 1, 2024

matatonic self-assigned this Sep 1, 2024

matatonic mentioned this issue Sep 14, 2024

Bug: split_sentence does not seem to handle newlines well #60

Open

matatonic added bug Something isn't working and removed help wanted Extra attention is needed labels Sep 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better support markdown for speech output #56

Better support markdown for speech output #56

chaoqunxie commented Aug 31, 2024 •

edited

Loading

matatonic commented Aug 31, 2024

chaoqunxie commented Sep 1, 2024

matatonic commented Sep 1, 2024

chaoqunxie commented Sep 4, 2024

thiswillbeyourgithub commented Sep 8, 2024

jmorto11 commented Dec 27, 2024

Better support markdown for speech output #56

Better support markdown for speech output #56

Comments

chaoqunxie commented Aug 31, 2024 • edited Loading

advice：

matatonic commented Aug 31, 2024

chaoqunxie commented Sep 1, 2024

matatonic commented Sep 1, 2024

chaoqunxie commented Sep 4, 2024

thiswillbeyourgithub commented Sep 8, 2024

jmorto11 commented Dec 27, 2024

chaoqunxie commented Aug 31, 2024 •

edited

Loading