hero hero

Organization Name: Pharo Consortium

Contributor: Neerja Doshi

Mentors: Domenico Cipriani, Nahuel Palumbo

Project Repository: PAM-GSoC-25-Project

Project Duration: May 9, 2025 – September 1, 2025

My Deliverables 😎

TTS conversion code snippet and its Transcript (Logs)

Transcript Logs

Plug in these Code Snippets in Playground to hear PAM speak ! ❀️


        | dsp |
        
        "initliase the DSP class"
        dsp := PAMDsp new.
        
        "audio for a array of numbers"
        dsp sayNumbers: #(5 6 7 2).
        
        "audio for all numbers from 0 to 9"
        dsp sayNumbers0to9.
        
        "audio for sentence"
        dsp sayText: 'Dogs'.
        
        "audio for sentence with prosody and child preset"
        dsp sayTextWithProsody: 'Dogs' asChild: true.
        
        "audio for sentence with prosody and adult preset"
        dsp sayTextWithProsody: 'golf day!' asChild: false.
        
        "Transcript for logs"
        Transcript open.
        
πŸ’‘ Head over to PAM-Core >> PAMDspExamples >> class-side β†’ click the notepad icon next to
sayNumbers0to9, sayText:, sayTextWithProsody:asChild to hear them speak! (A quick and easy way to test TTS πŸŽ™οΈ)

Project Abstract 😎

PAM (Pharo Automated Mouth) is a rule-based Text-to-Speech (TTS) system developed for the Pharo environment. Unlike modern deep-learning TTS systems that require extensive datasets and computational resources, PAM employs a lightweight rule-based approach inspired by the classic SAM (Software Automatic Mouth) system.

The system implements text-to-phoneme conversion using 402 English β†’ Grapheme rules and 47 Grapheme β†’ IPA phoneme rules, producing accurate IPA phoneme sequences with stress markers.

For synthesis, PAM adopts a concatenative sample-based approach: instead of generating audio purely algorithmically, it plays back pre-recorded audio samples of phonemes. These are sequenced and stitched together through a TpSampler-driven DSP pipeline (integrated with Phausto), enabling intelligible speech output.

On top of this, PAM introduces prosody control, allowing each phoneme playback to be modulated by parameters such as pitch (adult/child voice presets), amplitude (stress-based loudness), and duration (temporal stretching). This hybrid design β€” rule-based phoneme generation + sample-based concatenative synthesis + prosodic control β€” makes PAM lightweight and customizable while still producing natural-sounding speech.

Core Architecture

PAM consists of four main components:

Reciter Engine: Class Reciter β†’ Converts English text to Graphemes [ SAM Phoenemes ] using pattern-matching rules. The rules are adapted from SAM.

IPA Converter: Class PhonemeToIPAConvertor β†’ Transforms graphemes into International Phonetic Alphabet (IPA) format of phonemes.

Audio Synthesis: Class PAMDsp β†’ Leverages Phausto’s DSP capabilities for speech generation using pre-recorded audio samples of phonemes.

Prosody Generation: Class Parser β†’ Gives prosody to PAM by tweaking frequency, amplitude, length dynamics, and generating age-like voice variations.

Major Accomplishments ❀️

1. Complete Rule Engine Implementation (June–July 2025)

Duration: ~6 weeks

Achievement: Successfully implemented all 26 letter rules with comprehensive test coverage.


      "Gasoline test"
      
      | phonemeOutput inputText |
      inputText := 'Gasoline'.
      
      phonemeOutput := Reciter textToPhonemes: inputText.
      
      self assert: phonemeOutput equals: #('G' 'EY4S' 'OW' 'L' 'IH' 'N' 'EH').
          

Test Results: All English words can be correctly converted to phonemes like Gasoline β†’ ('G' 'EY4S' 'OW' 'L' 'IH' 'N' 'EH').

2. IPA Phoneme Conversion System (July–August 2025)

Duration: ~3 weeks

Achievement: Developed recursive SAM Phoneme (Grapheme) to IPA Phoeneme converter for audio file mapping.

Pseudocode for Recursive compound phoneme splitting


      splitCompoundPhoneme: 'RIY'
          β†’ splits to: #('R' 'IY')  
          β†’ converts to: #('r' 'i_colon')
          

3. Audio Synthesis Integration (August 2025)

Duration: ~2 weeks

Achievement: Complete integration with Phausto DSP system.

PAMDsp Class Features:


      "Complete TTS with prosody"
      dsp := PAMDsp new.
      dsp sayTextWithProsody: 'hello' asChild: false.
      
      "Number sequence generation"
      dsp sayNumbers: #(1 3 5 7).
          

Audio Parameters Implemented:


Prosody Formula:


          durationForStress: stress  
              "Map stress in range [4..6] with duration in range [0.1..0.4].
          
                  ----- INDEX ----
                   4 = short/fast (0.1s), 
                   6 = long/slow (0.4s)."
              ^ 0.1 + (((stress - 4) / 2.0) * 0.3)
          
          
          amplitudeForStress: stress  
              "Map stress in range [4..6] with amplitude in range [0.5..1.0].
          
                  ----- INDEX ----
                   4 = soft (0.5), 
                   6 = strong (1.0)."
              ^ 0.5 + (((stress - 4) / 2.0) * 0.5)
            

4. Package Management & Distribution

Achievement: Created production-ready Pharo package with Metacello integration.

One-line installation command


      Metacello new
          baseline: 'PAM';
          repository: 'github://neerja-1984/PAM-GSoC-25-Project:master/src';
          onConflict: [ :ex | ex useIncoming ];
          onUpgrade: [ :ex | ex useIncoming ];
          load.
          

πŸ“‚ Extract the below and place the extract in Documents folder

⬇️ Download Audio Samples

Folder Structure should be as below

          Documents/
          β”œβ”€β”€ phonemes
          └── numbers
            
Refer the following in case Doubts

Technical Challenges Overcome 😎

1. Rule Ordering Bug Resolution

1. self addRule: '(BREAK)' replacement: 'BREY5K'.
2. self addRule: '(B)' replacement: 'B'

Problem: Multiple rules matched the same input, leading to incorrect phoneme selection for word "BREAK".

Solution: Hence, first match all rules -> amongst all selected rules --> choose the one of longest length.

Impact: Ensured accurate English text-to-phoneme conversion.

2. DSP Stereo Bug


            sampler := TpSampler new.
            sampler label: aLabel.
            
            samplePlayer := sampler pathToFolder: folderPath.
            
            "problem inducing line"
            dspInstance := samplePlayer stereo asDsp.
            

Problem: Our audio files were a mix of both Mono and Stereo.Hence, some audio files couldn't be heard properly

Solution: Developed FFmpeg + helper scripts to convert mono audio files to stereo.

Impact: Reliable playback for all phonemes, standardized pipeline.

3. Phoneme File Indexing Bug


          myString := Reciter textToIPAPhonemes: 'book'.
          "myString = #('b' 'ʊ' 'k')"

          "list -> our audio files (presumably sorted by OS)"
          "for each character of myString -> find indexOf character from the list"
          "can be an issue: Files sorted by OS are platform-dependent and differ from how Phausto sorts them"
          
          [myString do: [ :i |  
              dsp setValue: ( list indexOf: i ) parameter: 'PAMSamplerIndex'.
              dsp trig: 'PAMSamplerGate'.
              0.2 seconds wait
          ]] fork.
            

Image 1 is how phausto sorts them, Image 2 is how Windows Filesystem sorts them

hero hero

Problem: Phausto expects files to be in a sorted order, as it uses the index to map the phoneme to the audio file as seen in the above code snippet.

Solution: Added renaming script to rename all files as number_phonemeName.wav.

Impact: Files are now sorted according to how Phauto sorts them. Files are now renamed as : 001_a_colon.wav, 002_aΙͺ.wav, 003_aʊ.wav,

4. Cross-Platform Path Handling


  folderPath := FileLocator documents / aFolderName.
    

Problem: Windows/macOS path separators differ, causing issues with folders of numbers and phonemes stored in Documents (e.g., C:\Users<your-name>\Documents\numbers).

Solution: Implemented FileLocator-based path resolution to generate OS-independent paths.

Impact: Seamless cross-platform deployment; generalized way to configure paths for any OS.

5. Audio Metadata Issue

Problem: Phausto (via libsndfile) cannot properly read audio files containing metadata tags.

Solution: Stripped metadata from all phoneme audio files using online-audio-converter.com. Confirmed playback success after conversion.

Timeline Chart of PAM Development ❀️

πŸ“š
πŸ”¬

Community Bonding & SAM Deep Dive

Initial mentorship meetings with Domenico & Nahuel to establish project foundation. Deep research into SAM's 450 phonetic rules and understanding reciter logic. Set up development environment with Pharo & Iceberg, created Git repository for project 😊
βš™οΈ
🧠

Pharo Foundation & Constants Implementation

Successfully converted SAM constants and character flags to PAM, made succesfull tests suites for it. Understood class-side vs instance-side vs class-instance side variables. Learned advanced Pharo debugging techniques.
🎯
πŸ“

Rules Engine & Pattern Matching Logic

Extended rules for all alphabets. Developed a OOPS based Design pattern for Letter Rules. Implemented complex rule parsing system with prefix-pattern-suffix matching.Created comprehensive letter-specific rule dictionaries, and achieved working textToPhoneme conversion for basic words like "HELLO" and "COLLEGE".
πŸ›
πŸ”§

Major Bug Fixes & Logger Implementation

Fixed critical sentence parsing issues and wrong phoneme generation caused by rule ordering problems. Implemented SpringBoot-style logger system of [CLASS NAME -> METHOD NAME -> LOG MESSAGE ], resolved space handling in textToPhoneme, and achieved green test status for all alphabet rules with proper pattern priority matching.
🎡
πŸ”Š

First Audio Success with Phausto Integration

Successfully integrated TPSampler and DSP for audio playback for numbers audio-samples. Implemented number audio samples (0-9), resolved Windows/Mac path separator issues, and created first working audio output. Added pragma annotations and fork-based asynchronous audio processing.
🌟
🎀

Phoneme-to-IPA Conversion & Speaking Words

Developed sophisticated phoneme-to-IPA converter using recursive greedy approach[ longest matching IPA phoneme first ]. Created comprehensive mapping dictionary, implemented compound phoneme splitting logic, and achieved first successful word pronunciation ("DOG") .. although veryy robotic . Established sorted audio file indexing system for accurate phoneme playback.
🎭
πŸŽ›οΈ

Prosody & Voice Synthesis Mastery

Implemented advanced prosody controls with stress-based amplitude and duration calculations. Added age-based voice presets (adult vs child), developed comprehensive audio parameter tuning (pitch, volume, note stretching), and converted mono audio samples to stereo for proper playback compatibility.
πŸ“¦
⚑

Metacello Baseline & Project Optimization

Created comprehensive Metacello baseline for seamless project installation. Resolved dependency conflicts, implemented proper package management, and achieved successful deployment in fresh Pharo images. All tests green with automated dependency resolution and conflict handling.
πŸŽ‰
πŸ†

Final Integration & GSoC Success

✨101st Commit Achievement ✨
Completed full text-to-speech pipeline with prosodic control! Final PAMDsp implementation supports speech generation with age-based voice characteristics and stress-sensitive pronunciation !! 😎

Code Quality & Testing ❀️

Comprehensive Test Suite

Clean Architecture

Performance Metrics

Metric Value
Package Memory Consumption < 1 GB [Package size]
Rule Database [English letters β†’ Graphene] 402 rules across 26 letters
Rule Database [SAM Phoneme (Grapheme) β†’ IPA phoneme] 47 rules
Audio Sample Count 44 IPA phonemes + 10 numbers
Package Load Time <1 Minute

Contribution Summary

Future Development Opportunities 😎

Enhanced Prosody: Implement formant synthesis for more natural speech patterns.

Voice Customization: Additional gender voice presets.

Real-time Processing: Streaming audio generation for long texts.

Acknowledgements ❀️

Special thanks to my mentors Domenico Cipriani and Nahuel Palumbo for all their guidance. Starting from Pharo architecture, DSP integration, software engineering best practices to Debugging sessions we've had. PAM wouldn't have reached this stage if it wasn't for their guidance. Their expertise in both linguistic processing and audio synthesis was instrumental in achieving the project goals.

Noteworthy links

1. The project builds upon the foundational work of the original SAM system and leverages the modern capabilities of Phausto for audio generation.

2. Understanding IPA phoenems : WalkOnCross Github. Dataset of 44 Phoneme taken from here: Github

3. Remove Metadata from audio files : online-audio-converter

Useful Links ❀️

1. PAM-GSoC-25-Project repository

2. GSoC'25 Weekly Updates Readme !

3. My Proposal for GSoC'25 -> PAM : Pharo's TTS model

4. My tutorial to learn all about TTS models ( from ancient rule-based system to current deeplearning based models)

5. My tutorial to learn Pharo as a beginner

Blogs ❀️

Let’s Connect