Monday, February 17, 2025
spot_img

Latest Posts

After creating 2 million GPT tokens, this UNILAG student has built an AI text-to-speech model with Nigerian accent

In November 2024, when I asked Saheed Azeez how difficult it was to create Naijaweb — a dataset of 230 million GPT-2 tokens based on Nairaland — he brushed it off as something simple. “It’s just web scraping,” he said.

However, in my latest conversation with him, his new passion project seems to have pushed him further. He calls it YarnGPT, a text-to-speech AI model that can read text aloud in a Nigerian accent.

In a world where AI can generate lifelike voices in seconds, a text-to-speech model with a Nigerian accent might not seem groundbreaking at first. But when you consider two things, it becomes a big deal.

First, Azeez is a Nigerian university student with limited resources. Second, developing a model that accurately captures the nuances of a Nigerian accent is technically challenging.

Advertisement

From tokenising audio to the many mathematical concepts Azeez referenced while explaining the process, it was clear that this wasn’t a simple task. Even Azeez, in his usual fashion, didn’t downplay the effort involved.

“It was quite tasking, especially gathering the data needed to make this happen.”

How YarnGPT was created

Screenshot of YarnGPT

Inspired by the success of Naijaweb, Azeez was eager to build something new. “The amount of conversations and interest people had in Naijaweb was a great motivation. Imagine getting featured on Techpoint Africa; it motivated me to do this.”

He was also motivated by failure. Before starting YarnGPT, he had applied for a job at a Nigerian AI company but didn’t perform as well in the interview as he had expected.

YarnGPT became the project that would help him improve his skills and increase his chances of securing such roles in the future.

Let the best of tech news come to you

Join 30,000 subscribers who receive Techpoint Digest, a fun week-daily 5-minute roundup of happenings in African and global tech, directly in your inbox, hours before everyone else.

Building an AI model that sounds Nigerian required gathering a vast amount of Nigerian voices.

“I used some movies that were available online. I extracted their audio and subtitles.”

Nollywood produces over 2,500 movies a year, and with many filmmakers uploading their work to YouTube, it seemed like Azeez had plenty of data to work with. But in reality, he had almost none.

“The problem with building in Nigeria is data. Replicating what has been built overseas isn’t that hard, but data always gets in the way.”

While there are thousands of movies for him to choose from the audio wasn’t up to the standard he wanted, and their subtitles were inaccurate. To compensate, Azeez turned to Hugging Face, an open-source platform for machine learning and data science. He combined the audio from Nigerian movies with high-quality datasets from Hugging Face to train his model.

The next step was training the AI model, but without access to his own GPU, he had to rely on cloud computing services like Google Colab. This cost him $50 (₦80,000) — a significant amount for a university student. Unfortunately, it was a waste.

“The model I built wasn’t working well, and the $50 cloud credit was burnt just like that. It was painful for me.”

Determined to find another way, he discovered Oute AI, a platform that had developed a text-to-speech model in an autoregressive manner.

“The way the model works is, you give it a piece of text, and it predicts one word at a time. It takes that word, adds it back to the text, then predicts the next one — kind of like how ChatGPT completes sentences. That’s what makes it autoregressive.”

While I found the autoregressive framework difficult to understand, Azeez pointed out that it simply gave him better results.

Maths, tokenisation, and the hard part   of YarnGPT

Oute AI provided a structure, but Azeez still had to build his own model. He took a language model called SmolLM2-360M from Hugging Face and added speech functionality to it, a process that involved major algorithmic changes.

After this, the final-year Mechanical Engineering student at the University of Lagos had to spend another $50 to train the model. The training took three days.

Interestingly, like he pointed out when he created Naijaweb, AI models need data to be tokenised. Large language models (LLMs) understand numbers, not words, so tokenisation converts words into numerical representations.

“If we were to tokenise the word CALCULATED, for example, we could split it into four tokens: CAL-CU-LA-TED. A number is assigned to each token.”

Meanwhile, tokenizing audio is different.

“Tokenizing audio is basically breaking down continuous sound waves into smaller, manageable pieces that a model can understand and process. Unlike text, which has clear breaks between words, audio is continuous—there are no natural pauses in a raw waveform.

“So, the model needs to convert the sound into a sequence of discrete values, kind of like turning a long speech into tiny puzzle pieces. These smaller audio tokens can then be used to train the AI, and later, the model can reassemble them to generate speech that sounds natural.”

This entire process was made possible by a wave tokenizer. Using resources from Hugging Face, Oute AI, and other Nigerian repositories, Azeez was able to create YarnGPT.

Publicising YarnGPT  

Saheed Azeez: He built Naijaweb which is 230 million GPT2 tokens based on nairaland

Azeez might be a nerd, but he isn’t afraid to put himself in front of a camera to showcase his work. In a two-minute video, he explained YarnGPT and caught the attention of 138,000 people on X (formerly Twitter), including Timi Ajiboye, Co-founder of Hellicarrier (formerly BuyCoins).

Creating YarnGPT was difficult, but making the video was another hurdle.

“I called my friend and logistics manager, Aremu, and told him I wanted to make a video. We reached out to another friend who had a camera he wasn’t even using, and then we went to yet another friend’s house to record.

“We rearranged the whole house and used their TV as the background. His mum wasn’t too pleased when she returned.”

The results were worth it. The video got thousands of views across social media, and people began testing YarnGPT. The model could not only pronounce English in a Nigerian accent but could also read Nigerian languages—Hausa, Igbo, and Yoruba.

It has various applications. Content creators can use it for voice-overs in Nigerian accents, Google Maps could provide directions in Nigerian languages, and it could even enhance accessibility for non-English speakers.

Nigeria and the AI race  

While innovators like Azeez and American-born Ijemma Onwuzulike (creator of Igbo Speech) are developing exciting AI models, Nigeria remains far behind in the AI race. The industry has evolved beyond a hobbyist’s playground into a battleground for global superpowers, with the U.S. government committing $500 billion to AI development.

Meanwhile, AI breakthroughs like DeepSeek have shaken up Wall Street, causing giants like Nvidia to lose billions in market value due to new competition.

Even Azeez acknowledges Nigeria’s position.

“Honestly, we’re way off. We’re not even in the race. The big AI models today — like OpenAI’s or the ones from China — are trained on massive datasets with huge computational resources, things we don’t have here.”

But he remains optimistic.

“I think there’s a way forward. Instead of trying to build from scratch, we can focus on localising AI for our own needs. We can take what’s already been built and adapt it for Nigerian languages and accents. That’s how we can start catching up.”

Nigeria’s Minister of Communications and Digital Economy, Bosun Tijani, has been vocal about positioning the country as a key player in AI development. Perhaps, with talents like Azeez, there is hope.

Latest Posts

spot_imgspot_img

Don't Miss

Stay in touch

To be updated with all the latest news, offers and special announcements.