Overview
In my years of studying Latin I have always been fascinated with unattributed texts, which represents ~10-30% of all the Latin we have today. Authorship is crucial for understanding historical context, literary analysis, and protecting intellectual property. Yet something so fundamental to our understanding of a text is oftentimes missing.
Author Attribution is the process of identifying the author of a given text. In the past, researchers would need advanced knowledge of each author to meticulously analyze each text. Classical texts are oftentimes fragmented and filled with copying errors which make Author Attribution even harder. Humans used to seriously struggle with this; however in the modern world researchers have developed quantitative algorithms such as N-gram analysis, Token Type Ratio. These methods are much more effective as they are able to quantify stylistic differences and eliminate individual researchers' bias.
The success of these techniques begs the question: Why do we need a human for this? Pondering this question one day, I realized: we don’t! In any math or statistics field, AI can do a better job than humans, and the aforementioned quantitative algorithms have turned Author Attribution into just that. I decided I could try to make an AI model of my own.
What I Did
The first step in my journey was making 3 key decisions on my roadmap.
The first major decision was what kind of model to use. In classification there are many different machine learning algorithms to choose from, but they all land into two main categories: Neural Networks and Clustering Algorithms. I decided to train two models, one of each. For the clustering model I chose the K-means algorithm. K-means is an unsupervised machine learning algorithm used for clustering data points into a predefined number of clusters (k). It works by assigning data points to the nearest centroid (the center of a cluster) and then recalculating the centroids based on the new cluster assignments. The 2 main benefits of K-means are “simple” implementation and scalability. For the Neural Network I decided to create a transformer based model which I would fine tune with a custom dataset. Tuned models are based on Large Language Models such as ChatGPT or Gemini retrained for a specific task. This allows the user to utilize the training and precision of these generic models and repurpose them for a specific use case. For a base model, I chose to tune Gemini 2.5 Flash.
The second major decision I made is how large to make the data and what authors to use. I decided to break the text into small chunks of 50 words each in order to better handle the text fragments the model would most likely be used for. I also decided for the first run to only use text from 7 different authors. Caesar, Cicero, Horace, Livy, Ovid, Tacitus, and Vergil. One more key item was lemmatization. Lemmatization reduces words to their base form (e.g., "running" becomes "run," "children" becomes "child"). This is a crucial step for clustering because it improves accuracy by ensuring that different inflections of a word are treated as the same word.
I had to format the data like this:

The third decision was what tools to use. The first question was the platform to build my AI models from and between AWS and Google Cloud. I decided to use google cloud since it had already integrated Gemini tuning into its infrastructure making the creation of my own tuned model easier.
The clustering model required two extra tools: a vectorizer and a lemmatizer. Vectorizers transform qualitative data like text or images into numerical values, allowing computers and AI models to interact with them. I chose the TF-IDF Vectorizer. TF-IDF is a statistic that measures a word's importance in a document within a collection. It increases with a word's frequency in a document but is reduced by its overall frequency in the corpus, thus downplaying common words. The TF-IDF vectorizer generates a matrix where rows represent documents and columns represent unique words, with each cell containing the word's TF-IDF score. Higher scores signify words specific to a document and less common overall, making them useful for clustering and classification. The other tool I needed was a lemmatizer and I used the Latin Lemmatizer from the Classical Language Tool Kit(CLTK).
Implementation
The first step in creating any model is assembling a dataset. I downloaded over 150 full Latin texts from The Latin Library totaling over 5 million tokens. Using the regex modules I then removed all brackets, numbers, and removed extra blank lines

I then cut up the data and reformatted it to look like this:

With the data in place, it was time to train the transformer module. As aforementioned I decided to train the Gemini models, specifically 2.5 Flash. I decided to use a 90/10 training/validation split for training the model. This means 90% of the dataset was used to train the model, and the remaining 10% was used to test the model on unseen data. The training results showed significant improvements, with the loss dropping from ~2.7 down to < 0.00030 across the different training epochs.

Training the Clustering Model was much more complicated. I built the model inside a jupyter notebook where I had to manually build a train and store the model centroid locations. I ran into deployment issues with the docker image I built. Docker images are containers holding all the files for an app you are about to deploy. I spent several weeks debugging this, changing around loading configurations, double, triple checking the code, swapping import paths, pre-loading models and data! Then I found the error, CLTK kept launching an interactive interface for a dependent model and an interactive download doesn’t work for containerized deployments. I assume this because this is used by researchers, so no one had run into this problem before. The library documentation indicated there are commands to bypass this but none of them worked. I ended up filing a bug report to the open source community.
Finally, with the model endpoints running on CloudRun in Google Cloud, I threw together this basic website.
Accuracy and Performance
In order to test my Supervised Fine Tuned Model performance against the Base Gemini Model, I built a side-by-side evaluation script which ran 1,000 50 token text snippets neither model had seen before. My SFT model correctly identified the authors in 98% of the evaluation texts, while the Base Gemini Model accurately predicted only 85% of the text snippets correctly.
I suspect the basel model's poor performance was driven by two areas of confusion. The first, was the histories of Livy, Caesar and Tacitius. Looking through the data the base model was struggling to identify the differences in their writing of the same events, this led to confusion and many inaccuracies. The other area of confusion was between Cicero with Quintilian. In his Institutio Oratoria Quintilian quoted Cicero numerous times tricking the base model.

The clustering model did not perform well. It had an accurary of 47.7% After looking through the data, I noticed a 2 key areas of mistakes

1. Poetry. The model was correct on only 6% of all the Vegil, Ovid, and Horace snippets, most of them being falsely attributed to Tactius. This actually makes sense since Tactius' writing style was known for being remarkably similar to that of poetry.
2. History. The model only correctly attributed 7% of Livy correctly. Most of the misattribution was, like the transformer based model, to Caesar and Tacitus. Both of them achieved much higher accuracies of 67% and 77% respecively. On top of being of the same subject, Livy's writing style is much more varying as supposed to Caesar's and Tactius's. This causes his dispersed text snippets to be pushed into their larger and more concerntrated clusters
Key Learnings
1. Anyone can build an AI. Thanks to all of these ready to go tools and modules, anyone with limited coding experience can build an AI. Open source APIs and platforms such as Google Cloud allow people to create their own models in some cases with almost zero coding experience. It’s not there yet, but soon, anyone can build their own AI. And if that's true, anyone can change the world.
2. Debugging is Hard. Another lesson I have learned is how painful debugging can be. More than Half of my summer was spent fixing code that should have been working long ago. One change that probably would have made my process easier is better organization and documentation. I found myself having to go back through other branches of code, older models and notebooks. It got pretty hard to remember what and why I did certain things.
Next Steps
While I have made great progress so far, there is still plenty of work left to do.
The first major improvement I will implement is making the dataset more diverse and across time periods. When the dataset is not diverse enough, authors with more data can actually “take control” of clusters and crowd out smaller authors.
I also want to get an even larger dataset. While The Latin Library has a lot of data for a free library, it pales in comparison to libraries such as The Library Of Latin Texts which has over four times the number of texts in The Latin Library. I am currently in the process of trying to get access to the full library. (If you can help, please let me know!)
One more improvement I will make is implementing a confidence score for both models. That way people using my models will be able to evaluate whether the prediction is accurate or not.
Finally, the real goal is to feed in unknown texts. While the transformer based model still outperformed the clustering, I don’t think it will be much help working with unattributed texts as it only has knowledge of known authors and no way of quantifying how similar or different authors writing styles are. The clustering model however builds a high dimensional map of each text snippet allowing us to see how close or far apart(similar or different) the text snippet is from those of known authors. While the clustering model is clearly not ready for this purpose, I will continue tuning and training it.
Real World Applications
Author Attribution is not simply a problem facing classics students, knowing who created it is more important than ever, especially in the age of AI generated content and misinformation. Misinformation can have severe consequences from eroding trust in institutions to influencing people with false narratives, and in some cases even inciting violence. An Author Attribution and Verification AI that is capable of identifying the true author of a speech or a paper would help combat misinformation.
About Me
I am currently a 9th grader at Cardigan Mountain School. I am a 3rd year Latin student. In my free time I enjoy skiing and playing tennis. You can find my source code on github.