Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support markdown for speech output #56

Open
chaoqunxie opened this issue Aug 31, 2024 · 6 comments
Open

Better support markdown for speech output #56

chaoqunxie opened this issue Aug 31, 2024 · 6 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@chaoqunxie
Copy link

chaoqunxie commented Aug 31, 2024

advice:

  1. 302 url:some content

  2. use plugin:can use Ingest Attachment Plugin

for example:
need speak :can use Ingest Attachment Plugin
but not speak: can use Ingest Attachment Plugin https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html

@matatonic
Copy link
Owner

This was not very clear for me, perhaps can you try writing in your own language a more detailed question? I can try to translate on my side, or ask someone else for help translating.

@chaoqunxie
Copy link
Author

image

@matatonic
Copy link
Owner

Ok, I think I understand. You would like markdown text to be filtered and pre-processed before speech, rather than just try to say everything. This way URL's are not said, but the title of the link is said. This is probably a good feature and I think it can be done without much trouble.

I am curious how the official openai api speech model handles markdown text, if anyone has any details I would like to know.

Thanks!

@matatonic matatonic changed the title will support markdown format? Better support markdown for speech output Sep 1, 2024
@matatonic matatonic added enhancement New feature or request help wanted Extra attention is needed labels Sep 1, 2024
@matatonic matatonic self-assigned this Sep 1, 2024
@chaoqunxie
Copy link
Author

Ok, I think I understand. You would like markdown text to be filtered and pre-processed before speech, rather than just try to say everything. This way URL's are not said, but the title of the link is said. This is probably a good feature and I think it can be done without much trouble.

I am curious how the official openai api speech model handles markdown text, if anyone has any details I would like to know.

Thanks!

response by gpt4o:

OpenAI's speech functionality (like the text-to-speech feature in ChatGPT) is implemented using text parsing and natural language processing (NLP) technologies to enable more intelligent text reading. Here's an overview of the process and related technologies:

  1. Text Parsing:
    The system first uses a text parser to analyze the user's input. This parser can recognize and categorize different text formats, such as Markdown, HTML tags, URLs, headings, lists, code blocks, etc. The purpose of parsing is to differentiate between parts of the text that should be read aloud and parts that should be filtered out or converted into a more natural form.

  2. Natural Language Processing (NLP):
    NLP technology is used to understand the context and content of the text. For example, when encountering a hyperlink, the NLP model will try to extract the title or descriptive text of the link rather than reading out the URL string directly. This involves extracting relevant information from links or other formatted text and converting it into a form suitable for speech output.

  3. Preprocessing and Text Conversion:
    Before speech synthesis, the text undergoes preprocessing, which includes:

    • Removing content that is not suitable for speech (such as long URLs, code snippets, HTML tags, etc.).
    • Converting special characters, punctuation marks, and other elements into forms appropriate for speech.
    • Reorganizing the text content to make it more suitable for natural reading, which may involve segmenting, simplifying, or rewriting parts of the text.
  4. Speech Synthesis Engine:
    The preprocessed text is then passed to a speech synthesis engine, which is typically a deep learning-based model, such as one built on a Transformer architecture (like Tacotron, WaveNet, or its improved versions). These models are trained on large amounts of speech data and can generate high-quality, human-like speech output.

  5. Context Understanding and Speech Adjustment:
    By combining context understanding and speech feature adjustment, the system can not only produce accurate text-to-speech but also use appropriate intonation, pauses, and emphasis based on the context, making the generated speech more natural and expressive.

  6. User Customization:
    To enhance user experience, the system may also support customization options, such as allowing users to choose different voice styles, speech speeds, etc.

These steps work together to enable OpenAI's speech functionality to effectively handle complex text input and generate natural, understandable speech output. This relies on the combined efforts of deep learning, NLP, and speech synthesis technologies.

@thiswillbeyourgithub
Copy link

I noticed that piper seems unable to read *italic*, but can read **bold** test fine. I have not yet found a way to fix that using preprocessing.

@matatonic matatonic added bug Something isn't working and removed help wanted Extra attention is needed labels Sep 14, 2024
@jmorto11
Copy link

I'm using this with open-webui and loving it. I'm actually trying to get the audio when I click Read Aloud to not read the italics markdown. I thought adding this to your pre_process_map.yaml in the config file would help, but it doesn't seem to change anything. ex in yaml file added:

- - \*(.*?)\*
  - ''
- - _(.*?)_
  - ''

Any thoughts? Either way love the project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants