Helper functions/classes

class LibCategory(value)

An enumeration.

class LibGender(value)

An enumeration.

class LibAge(value)

An enumeration.

class LibAccent(value)

An enumeration.

class LibVoiceInfo(category: LibCategory | None = None, gender: LibGender | None = None, age: LibAge | None = None, accent: LibAccent | None = None, language: str | None = None)

Contains the information for a voice in the Voice Library.

to_query_params()

Converts filter attributes to a dictionary of query parameters, omitting None values.

class LibSort(value)

An enumeration.

class PlaybackOptions(runInBackground: bool = False, portaudioDeviceID: int | None = None, onPlaybackStart: ~typing.Callable[[], ~typing.Any] = <function PlaybackOptions.<lambda>>, onPlaybackEnd: ~typing.Callable[[], ~typing.Any] = <function PlaybackOptions.<lambda>>, audioPostProcessor: ~typing.Callable[[~numpy.ndarray, int], ~numpy.ndarray] = <function PlaybackOptions.<lambda>>)

This class holds the options for playback.

Parameters:
  • runInBackground (bool, optional) – Whether to play/stream audio in the background or wait for it to finish playing. Defaults to False.

  • portaudioDeviceID (int, optional) – The ID of the audio device to use for playback. Defaults to the default output device.

  • onPlaybackStart (Callable, optional) – Function to call once the playback begins.

  • onPlaybackEnd (Callable, optional) – Function to call once the playback ends.

  • audioPostProcessor (Callable, optional) – Function to apply post-processing to the audio. Must take a float32 ndarray (of arbitrary length) and an int (the sample rate) as input and return another float32 ndarray.

class GenerationOptions(model_id: str | None = None, latencyOptimizationLevel: int = 0, stability: float | None = None, similarity_boost: float | None = None, style: float | None = None, use_speaker_boost: bool | None = None, model: Model | str | None = 'eleven_monolingual_v1', output_format: str = 'mp3_highest', forced_pronunciations: dict | None = None)

This class holds the options for TTS generation. If any option besides model_id and latencyOptimizationLevel is omitted, the stored value associated with the voice is used.

Parameters:
  • model (Model|str, optional) – The TTS model (or its ID) to use for the generation. Defaults to monolingual english v1.

  • latencyOptimizationLevel (int, optional) – The level of latency optimization (0-4) to apply. Defaults to 0.

  • stability (float, optional) – A float between 0 and 1 representing the stability of the generated audio. If omitted, the current stability setting is used.

  • similarity_boost (float, optional) – A float between 0 and 1 representing the similarity boost of the generated audio. If omitted, the current similarity boost setting is used.

  • style (float, optional) – A float between 0 and 1 representing how much focus should be placed on the text vs the associated audio data for the voice’s style, with 0 being all text and 1 being all audio.

  • use_speaker_boost (bool, optional) – Boost the similarity of the synthesized speech and the voice at the cost of some generation speed.

  • output_format (str, optional) – Output format for the audio. mp3_highest and pcm_highest will automatically use the highest quality of that format you have available.

  • forced_pronunciations (dict, optional) – A dict specifying custom pronunciations for words. The key is the word, with the ‘alphabet’ and ‘pronunciation’ values required. This will replace it in the prompt directly.

Note

The latencyOptimizationLevel ranges from 0 to 4. Each level trades off some more quality for speed.

Level 4 might also mispronounce numbers/dates.

Warning

The style and use_speaker_boost parameters are only available on v2 models, and will be ignored for v1 models.

Setting style to higher than 0 and enabling use_speaker_boost will both increase latency.

output_format is currently ignored when using speech to speech.

Warning

Using pcm_highest and mp3_highest will cache the resulting quality for the user object. You can use user.update_audio_quality() to force an update.

class WebsocketOptions(try_trigger_generation: bool = False, chunk_length_schedule: ~typing.List[int] = <factory>, enable_ssml_parsing: bool = False, buffer_char_length: int = -1)

This class holds the options for the websocket endpoint.

Parameters:
  • chunk_length_schedule (list[int], optional) – Chunking schedule for generation. If you pass [50, 120, 500], the first audio chunk will be generated after recieving 50 characters, the second after 120 more (so 170 total), and the third onwards after 500. Defaults to [50], so always generating ASAP.

  • try_trigger_generation (bool, optional) – Whether to try and generate a chunk of audio at >50 characters, regardless of the chunk_length_schedule. Defaults to False, sent with every message (but can be overridden).

  • enable_ssml_parsing (bool, optional) – Whether to enable parsing of SSML tags, such as breaks or pronunciations. Increases latency. Defaults to False.

  • buffer_char_length (int, optional) – If the generation will be slower than realtime (using multilingual v2, for example) the library will buffer and wait to begin playback to ensure that there is no stuttering. Use this to override the amount of buffering. -1 means it will use the default (which is a safe, high value).

class PromptingOptions(pre_prompt: str = '', post_prompt: str = '', open_quote_duration_multiplier: float | None = None, close_quote_duration_multiplier: float | None = None)

This class holds the options for pre/post-prompting the audio, to add emotion.

Parameters:
  • pre_prompt (str, optional) – Prompt which will be place before the quoted text.

  • post_prompt (str, optional) – Prompt which will be placed after the quoted text.

  • open_quote_duration_multiplier (float, optional) – Multiplier indicating how much of the opening quote will be spoken (Between 0 and 1). Defaults to 0.70 if a pre-prompt is present to avoid bleedover.

  • close_quote_duration_multiplier (float, optional) – Multiplier for the duration of the closing quote (Between 0 and 1). Defaults to 0.70 if a post-prompt is present to avoid bleedover.

class GenerationInfo(history_item_id: str | None = None, request_id: str | None = None, tts_latency_ms: str | None = None)

This contains the information returned regarding a (non-websocket) generation.

class Synthesizer(defaultPlaybackOptions: ~elevenlabslib.helpers.PlaybackOptions = PlaybackOptions(runInBackground=True, portaudioDeviceID=None, onPlaybackStart=<function PlaybackOptions.<lambda>>, onPlaybackEnd=<function PlaybackOptions.<lambda>>, audioPostProcessor=<function PlaybackOptions.<lambda>>), defaultGenerationOptions: ~elevenlabslib.helpers.GenerationOptions = GenerationOptions(latencyOptimizationLevel=3, stability=None, similarity_boost=None, style=None, use_speaker_boost=None, model='eleven_monolingual_v1', output_format='mp3_highest', forced_pronunciations=None), defaultPromptingOptions: ~elevenlabslib.helpers.PromptingOptions | None = None)

This is a helper class, which allows you to queue up multiple audio generations.

They will all be downloaded together, and will play back in the same order you put them in. I’ve found this gives the lowest possible latency.

start()

Begins processing the queued audio.

stop()

Stops playing back audio once the current one is finished.

abort()

Stops playing back audio immediately.

change_output_device(portAudioDeviceID: int)

Allows you to change the current output device.

change_default_settings(defaultGenerationOptions: GenerationOptions | None = None, defaultPlaybackOptions: PlaybackOptions | None = None, defaultPromptingOptions: PromptingOptions | None = None)

Allows you to change the default settings.

add_to_queue(voice: Voice, prompt: str, generationOptions: GenerationOptions = None, playbackOptions: PlaybackOptions = None, promptingOptions: PromptingOptions = None) None

Adds an item to the synthesizer queue. :param voice: The voice that will speak the prompt :type voice: Voice :param prompt: The prompt to be spoken :type prompt: str :param generationOptions: Overrides the generation options for this generation :type generationOptions: GenerationOptions, optional :param playbackOptions: Overrides the playback options for this generation :type playbackOptions: PlaybackOptions, optional :param promptingOptions: Overrides the prompting options for this generation :type promptingOptions: PromptingOptions

class ReusableInputStreamer(voice: Voice, defaultPlaybackOptions: PlaybackOptions = PlaybackOptions(runInBackground=True, portaudioDeviceID=None, onPlaybackStart=<function PlaybackOptions.<lambda>>, onPlaybackEnd=<function PlaybackOptions.<lambda>>, audioPostProcessor=<function PlaybackOptions.<lambda>>), generationOptions: GenerationOptions = GenerationOptions(latencyOptimizationLevel=3, stability=None, similarity_boost=None, style=None, use_speaker_boost=None, model='eleven_monolingual_v1', output_format='mp3_highest', forced_pronunciations=None), websocketOptions: WebsocketOptions = WebsocketOptions(try_trigger_generation=False, chunk_length_schedule=[125], enable_ssml_parsing=False, buffer_char_length=-1))

This is basically a reusable wrapper around a websocket connection.

stop()

Stops playing back audio once the current one is finished.

abort()

Stops playing back audio immediately.

change_settings(generationOptions: GenerationOptions | None = None, defaultPlaybackOptions: PlaybackOptions | None = None, websocketOptions: WebsocketOptions | None = None)

Allows you to change the settings and then re-establishes the socket.

queue_audio(prompt: Iterator[str] | AsyncIterator, playbackOptions: PlaybackOptions | None = None) tuple[Future[OutputStream], Future[Queue]]

Queues up an audio to be generated and played back.

Parameters:
  • prompt – The iterator to use for the generation.

  • playbackOptions – Overrides the playbackOptions for this generation.

Returns:

A tuple consisting of two futures, the one for the playback stream and the one for the transcript queue.

Return type:

tuple

class ReusableInputStreamerNoPlayback(voice: Voice, generationOptions: GenerationOptions = GenerationOptions(latencyOptimizationLevel=3, stability=None, similarity_boost=None, style=None, use_speaker_boost=None, model='eleven_monolingual_v1', output_format='mp3_highest', forced_pronunciations=None), websocketOptions: WebsocketOptions = WebsocketOptions(try_trigger_generation=False, chunk_length_schedule=[125], enable_ssml_parsing=False, buffer_char_length=-1))

This is basically a reusable wrapper around a websocket connection.

stop()

Stops the websocket.

abort()

Stops the websocket.

change_settings(generationOptions: GenerationOptions | None = None, defaultPlaybackOptions: PlaybackOptions | None = None, websocketOptions: WebsocketOptions | None = None)

Allows you to change the settings and then re-establishes the socket.

queue_audio(prompt: Iterator[str] | AsyncIterator) tuple[Future[Queue], Future[Queue]]

Queues up an audio to be generated and played back.

Parameters:

prompt – The iterator to use for the generation.

Returns:

A tuple consisting of two futures, one for the numpy audio queue and one for the transcript queue.

Return type:

tuple

run_ai_speech_classifier(audioBytes: bytes)

Runs Elevenlabs’ AI speech classifier on the provided audio data. :param audioBytes: The bytes of the audio file (mp3, wav, most formats should work) you want to analzye

Returns:

Dict containing all the information returned by the tool (usually just the probability of it being AI generated)

play_audio_v2(audioData: bytes | ~numpy.ndarray, playbackOptions: ~elevenlabslib.helpers.PlaybackOptions = PlaybackOptions(runInBackground=False, portaudioDeviceID=None, onPlaybackStart=<function PlaybackOptions.<lambda>>, onPlaybackEnd=<function PlaybackOptions.<lambda>>, audioPostProcessor=<function PlaybackOptions.<lambda>>), audioFormat: str | ~elevenlabslib.helpers.GenerationOptions = 'mp3_44100_128') OutputStream

Plays the given audio and calls the given functions.

Parameters:
  • audioData (bytes|numpy.ndarray) – The audio data to play, either in bytes or as a numpy array (float32!)

  • playbackOptions (PlaybackOptions, optional) – The playback options.

  • audioFormat (str, optional) – The format of audioData - same formats used for GenerationOptions. If not mp3 (or numpy array), then has to specify the samplerate in the format (like pcm_44100). Defaults to mp3.

Returns:

None

ulaw_to_wav(ulawData: bytes, samplerate: int) bytes

This function converts ULAW audio to a WAV.

Parameters:
  • ulawData (bytes) – The ULAW audio data.

  • samplerate (int) – The sample rate of the audio

Returns:

The bytes of the wav file.

pcm_to_wav(pcmData: bytes, samplerate: int) bytes

This function converts PCM audio to a WAV.

Parameters:
  • pcmData (bytes) – The PCM audio data.

  • samplerate (int) – The sample rate of the audio

Returns:

The bytes of the wav file.

save_audio_v2(audioData: bytes | ndarray, saveLocation: BinaryIO | str, outputFormat: str, inputFormat: str | GenerationOptions = 'mp3_44100_128') None

This function saves the audio data to the specified location OR file-like object. soundfile is used for the conversion, so it supports any format it does.

Parameters:
  • audioData (bytes) – The audio data.

  • saveLocation (str|BinaryIO) – The path (or file-like object) where the data will be saved.

  • outputFormat (str) – The format in which the audio will be saved (mp3/wav/ogg/etc).

  • inputFormat – The format of audioData - same formats used for GenerationOptions. If not mp3, then has to specify the samplerate in the format (like pcm_44100). Defaults to mp3.

sts_long_audio(source_audio: bytes | BinaryIO, voice: Voice, generation_options: GenerationOptions = GenerationOptions(latencyOptimizationLevel=0, stability=None, similarity_boost=None, style=None, use_speaker_boost=None, model='eleven_multilingual_sts_v2', output_format='mp3_highest', forced_pronunciations=None), speech_threshold: float = 0.5) bytes

Allows you to process a long audio file with speech to speech automatically, using Silero-VAD to split it up naturally.

Parameters:
  • source_audio (bytes|BinaryIO) – The source audio.

  • voice (Voice) – The voice to use for STS.

  • generation_options (GenerationOptions) – The generation options to use. The model specified must support STS.

  • speech_threshold (float) – The likelyhood that a segment must be speech for it to be recognized (0.5/50% works for most audio files).

Returns:

The bytes of the final audio, all concatenated, in mp3 format.

Return type:

bytes