rhoadley.net music research software blogs

Max/MSP Tasks

The MBrola Task

Task	Name:	Set:	Due:	Weighting:	Courses:
Task Summary		All CMP tasks

MBrola is a public-domain voice synthesiser. A version of the MBrola engine has been rendered for use in Max/MSP.

To cover:

Voice/Vocal synthesis - Chant, fof, etc.

This also has ramifications concerning:

Sound synthesis

Make sure you have a functional understanding of the MBrola software. If necessary, load a number of phoneme databases (MBrolix/Preferences/Add Voices) and test each with a phoneme file appropriate for the voice. Feel free to experiment.

A Sample File

; Mbrola demo file for en1.

_ 50
e 40 0 102
m 50
b 50
r 30
@U 80 5 119 35 126 70 140
l 50
@ 50 50 173
w 100 75 133
#

Explanation

_ 50 - a pause for 50ms
e 40 0 102 - the vowel e (actually translates as a position within the audio file 'en1'. 40 reprepresents the length - alter this to 400 to hear a marked difference. The 0 represents the percentage of the audio file from the e is taken. 102 represents frequency.
The final hash clears the buffer - it's useful when you want to terminate a diphone.

The Voice Database Format

Signed 16 bit PCM
Little-endian
Mono
16000Hz

The Task

Ensure you have a version of MBrola running. This is called MBrolix on OS X.
Download this file
Keep a copy and either open the file within MBrola/ix or copy and paste the text in. Ensure that it works. The phoneme file will only work with the voice 'en1' which may be downloaded from here
You choose the voice through the Voice pop-up menu. If the voice en1 is not there, make sure you've downloaded it, then choose the menu item MBroliX/Preferences. Click the button Add Voices and locate the en1 file. The programme should now play.
Visit the MaxMBrola site, download some more samples and voices and try them out.
Make sure you have a reasonable understanding of how this works

Download the MBrolaTask patch and ensure that it opens and that the MBrola Object is available. Also check the Max window for errors.
Follow the instructions in the patch to get it working. The basic thing is to ensure that a voice file is loaded. For the in-patch examples to work, you'll need the voice file 'en1' loaded. You can get a zipped version of this here.
Use the patch and its resources (plus any others you may wish to search for), to create your own patch which uses the Max/MBrola interface to create an interesting and creative experience.
This may be as simple as recreating a sentence that happens automatically (perhaps you could use the random object to randomise the phonemes), or something complex such as a method for manipulating the vowel lengths and frequencies. There is an example of something rather over-the-top here, but I won't expect anything like this!
Feel free to experiment in the MBrola/ix application in order to test various things - it's a lot quicker, I think!
Feel free to use other voices, etc., but be sure to keep a careful record of everything you use.
Include full instructions for how to use your patch within the patch itself.
Make a demo recording of your patch working. Please keep the size down to a minimum (maximum duration approximately 10 seconds. Use adoutput~ for the easiest way of doing this:

Finally

Zip or Stuff your patches, demos, etc. into one file called your_student_number_"MBrola" (e.g. 0504335_MBrola.zip or 0504335_MBrola.sit), include a readme with your name and student number and, if necessary, how to use or just open the patch, and submit the whole thing here.

More information (not really processed yet, but feel free to have a look)

http://tcts.fpms.ac.be/synthesis/mbrola/mbruse.html#PHONETIC

The input file bonjour.pho supplied as an example with FR1 simply contains :

_ 51 25 114
b 62
o~ 127 48 170
Z 110 53 116
u 211
R 150 50 91
_ 91

This shows the format of the input data required by MBROLA. Each line contains a phoneme name, a duration (in ms), and a series (possibly
none) of pitch pattern points composed of two integer numbers each : the position of the pitch pattern point within the phoneme (in % of its
total duration), and the pitch value (in Hz) at this position.

Hence, the first line of bonjour.pho :

_ 51 25 114

tells the synthesizer to produce a silence of 51 ms, and to put a pitch pattern point of 114 Hz at 25% of 51 ms. Pitch pattern points define a
piecewise linear pitch curve. Notice that the pitch pattern they define is continuous, since the program automatically drops pitch information
when synthesizing unvoiced phones.

The data on each line is separated by blank characters or tabs.Comments can optionally be introduced in command files, starting with a
semi-colon(;). A comment begining with "T=ratio" or "F=ratio" changes the time or frequency ratio respectively.

Notice, finally, that the synthesizer outputs chunks of synthetic speech determined as sections of the piecewise linear pitch curve. Phones
inside a section of this curve are synthesized in one go. The last one of each chunk, however, cannot be properly synthesized while the next
phone is not known (since the program uses diphones as base speech units). When using mbrola with pipes, this may be a problem. Imagine, for
instance, that mbrola is used to create a pipe-based speaking clock on an HP :

speaking_clock | mbrola - -.au | splayer

which tells the time, say, every 30 seconds. The last phone of each time announcement will only be synthesized when the next announcement
starts. To bypass this problem, mbrola accepts a special command phone, which flushes the synthesis buffer : "#"

The MBROLA synthesiser is controlled by means of command files. These files
contain the following information:

* phonetic transcription, using phonemes and allophones
* durations of phonemes and allophones
* pitch points, i.e. turning points in a stylised pitch contour
* optional: comment

Information about the 53 phoneme and allophone symbols for this Dutch database
is available in a separate table
(http://www-uilots.let.uu.nl/~Hugo.Quene/onderwijs/lotsummer98/nl2.txt). This is
an example of a command file for the Dutch utterance Hallo!:

; Utterance: "Hallo!"
_ 100 100 120
h 96
A 48
l 76 5 100 75 120
o 224 25 85
_ 100 40 70

The first line is comment. Comment lines start with a semicolon (;). The other
lines contain the following information:

* 1st column: phoneme or allophone segment. The "underscore" represents silence;
each utterance begins and ends with a silence symbol.
* 2nd column: duration of the segment. The utterance starts with a silence of
100 ms duration. The initial [h] has a duration of 96 ms, [A] lasts 48 ms, etc.
* remainder: zero or more pitch points. Each pitch point is indicated by two
numbers. The first number of the pair indicates the position, in time, expressed
as a percentage of the segmental duration. The second number of the pair
indicates the pitch in Hz.

More about diphones
Diphones are short fragments of speech, recorded and processed. When you are
synthesising an utterance, the appropriate diphones are taken from the database,
concatenated, given the requested duration and intonation (using a PSOLA-like
procedure), and con verted to sound.
Hence, end users of the synthesizer have no control over the degree of
assimilation in the output speech. We have to wait and hear to what extent the
original speaker of the database has applied coarticulation or assimilation in
the speech fragments. We c an, however, request certain phonemes in the phonetic
transcription, thus forcing the synthesizer to 'complete' assimilation.


1. A bad example
Copy the command file zoutzuur.pho to your diphone directory, and synthesise
this file with the command dsyn zoutzuur. What is wrong with it? Inspect the
command file, with a text editor or with the Unix command cat zoutzuur.pho.
2. You can do better
Open the command file zoutzuur.pho in a text editor. [*] Adjust the durations of
the critical VCCV segments. Save the file, and synthesize it again. Do both
realisations sound perfect to your ears? If not, go back to the point marked [*]
above. W hat is the ratio between C and V2 durations, in both realisations? What
is the sum of their durations?
3. Now on your own
Using the now perfect file zoutzuur.pho as an example (e.g. for the pitch
contour), your task is now to make synthetic versions of all utterances which
were measured in session 2.
Make versions both with (complete) assimilation, and without assimilation. (Use
the phoneme table to determine the appropriate transcription symbols). Do this
for corresponding 'viable' and 'unviable' cases.
Compare the synthetic versions with the natural ones. You can even specify the
'natural' segment durations in the command files. How does that affect the
perceived segmentation?
Last updated on June 24, 1998, by Hugo QuenŽ.

http://www.phon.ucl.ac.uk/courses/spsci/spc/


2.2. NU-MBROLA synthesis
The NU-MBROLA synthesis engine performs synthesis with
the standard MBROLA algorithm [12] but it is no longer
limited to diphone-based synthesis. As opposed to the
MBROLA databases and synthesis engine, which embody
acoustic and phonetic information, both the NU-MBROLA
databases and the NU-MBROLA engine are purely acoustic
objects. Units could thus be words, syllables, phonemes, or
any other type of speech segment). NU-MBROLA need not
know about this. Examples of use are presented in section 3
on unit selector.
For synthesis, the user provides the list of speech
segments to be concatenated and produced with some target
prosody. Speech segments are defined by their location in the
original speech corpus, with their starting and ending points
in milliseconds. Target prosody must then be defined as in the
MBROLA .pho file format : duration (in milliseconds)
followed by optional pitch pattern point definitions (each
point being defined by its position in percent in the target
duration and target pitch value in Hz). An example of NUMBROLA
input format is:
snd01.wav 253 289 45 10 119 21 112
snd01.wav 123 189 56
snb09.wav 5078 5096 100 99 120
This will produce 201 ms of synthetic speech, using 3
units taken from 2 separate files, and applying a pitch
movement that goes from 112 Hz (at 10% of the 45 ms of the
first segment) to 120 Hz (at 99% of the .100 ms of the last
segment). Intonation is linearly interpolated from pitch
pattern points in a log scale.
During synthesis stage, speech segment descriptions (with
references to the original speech files) are translated into NUMBROLA
segment descriptions and are extracted from the
NU-MBROLA database. Synthesis is performed by the
standard MBROLA algorithm imposing the target prosody
features. Duration modification is uniform (i.e., the duration
scaling factor is constant) throughout each speech segment.
Some smoothing can advantageously be performed to
reduce the spectral differences at segment boundaries.
Consecutive segments (regarding their location in the original
speech corpus) naturally exhibit some spectral differences but
these differences correspond to the natural evolution of
speech and need to be preserved. Therefore, smoothing is
only performed at stable and voiced boundaries of nonconsecutive
segments by distributing linearly the difference of
boundaries frames in the right and left stationary frames by a
fading/fadeout operation. Since the NU-MBROLA corpus is
composed of constant length frames with constant phase
envelopes (for low ordered harmonics), sample by sample
subtraction of boundary frames provides the difference frame.
Linear distribution of this difference frame corresponds to
distributing the spectral difference linearly.

The Projects

The projects and tasks are designed to help you through the various courses and materials that you'll have to deal with, and also to provide an active and practical element to what could otherwise become a rather dry and technical exercise. Tasks are small exercises - you may be asked to complete one or two per week. Projects are larger and carry a higher percentage of the mark. We will undertake two, three, four or more projects and tasks. The final project is usually an individual choice project, and will be worth significantly more than the others in terms of percentages in your portfolio. We will usually try to set aside a time to perform the projects in a public setting.