My Journey Through Georgia Tech’s Machine Learning Program
Two Years of Discovery, Failure, and Growth in Computer Science
The Beginning: Stepping Into the Unknown
Walking into my first class at Georgia Tech in Fall 2022f, I thought I understood machine learning. I’d taken some online courses, played with scikit-learn, even built a few toy neural networks. I was confident, maybe even a little cocky. That confidence lasted exactly one week into CS 7641: Machine Learning.
Professor Mahdi’s first assignment was a reality check. We had to pick two datasets and implement five different classification algorithms from scratch—not just use libraries, but understand what was happening under the hood. The assignment brief was deceptively simple: “Analyze the performance of decision trees, neural networks, boosting, SVM, and k-nearest neighbors on your chosen datasets. Submit a 12-page analysis in LaTeX.”
I chose the Wine Quality dataset and Phishing Websites dataset, thinking they’d be straightforward. I was wrong about everything.
The Humbling: CS 7641 Machine Learning
The decision tree assignment broke me first. I’d used decision trees before, but Professor Mahdi wanted us to understand why information gain worked as a splitting criterion. I spent three sleepless nights in Klaus building trying to implement the entropy calculation correctly, constantly getting numerical errors and edge cases wrong.
The eureka moment came at 2 AM when I finally grasped what information gain actually meant. It wasn’t just a mathematical formula—it was asking “What question can I ask about this feature that best reduces my uncertainty about the class label?” Each split in the tree was the algorithm learning to ask better questions about the data.
But the neural network implementation nearly destroyed me. Implementing backpropagation from scratch, computing gradients through the chain rule, debugging why my network was getting stuck in local minima—I lost count of how many times I got the math wrong. My wine quality neural network kept getting 65% accuracy while a simple logistic regression hit 75%. I was convinced I had a bug until I realized the dataset was just too small and noisy for a complex neural network to outperform simpler methods.
That was my first real lesson: more complex doesn’t always mean better. Sometimes the simple solution is the right solution.
The hyperparameter tuning was its own special hell. Learning rates between 0.001 and 0.1, hidden layer sizes from 5 to 100 neurons, different activation functions. Each configuration took minutes to train, and I needed to test dozens of combinations for each dataset. I spent entire weekends in the CODA building basement, laptop fan screaming, running grid searches.
Professor Mahdi’s feedback on my first report was brutal but fair: “You’re implementing the algorithms correctly, but you’re not thinking about what they’re actually doing. Why did k-NN work better on the phishing dataset? What does that tell you about the structure of the data? Machine learning isn’t about running algorithms—it’s about understanding problems.”
The Deep Dive: CS 7643 Deep Learning
If CS 7641 was a reality check, CS 7643 was complete immersion. Professor Kira warned us on the first day: “This should NOT be your first ML class. If you’re here thinking you’ll learn machine learning basics, you’re in the wrong room.”
The course was rebuilt with Facebook’s support, representing the cutting edge of neural network research. We started with implementing backpropagation from absolute mathematical foundations—no autograd, no frameworks, just pure NumPy and mathematics.
Assignment 1 was implementing a complete neural network training pipeline from scratch. Forward propagation, softmax computation, backward propagation with chain rule, SGD optimizer with momentum. The mathematical complexity was staggering. I’d think I understood the chain rule, then hit a three-layer network and realize I understood nothing.
Assignment 2 introduced convolutional neural networks. Implementing conv layers from scratch made me appreciate what modern frameworks actually do behind the scenes. The im2col transformation for efficient convolution was mind-bending—reshaping image patches into matrices so convolution becomes matrix multiplication. It was my first glimpse into how mathematical abstractions become efficient code.
But Assignment 3 broke everyone: Recurrent Neural Networks and LSTMs. The LSTM equations looked simple on paper, but implementing the forward and backward passes correctly took weeks. The gradient computation through time was a nightmare of careful indexing and matrix manipulations. I spent 40+ hours debugging a single indexing error that was causing gradient magnitudes to explode.
The breakthrough came during a study session with other students. We were all struggling with the same LSTM backward pass implementation. Sarah, a classmate from Korea, suggested we work through the equations step by step on the whiteboard. Three hours later, we finally understood that the LSTM wasn’t just a complex equation—it was learning to forget irrelevant information and remember important patterns across time.
Assignment 4 was our final project. I chose to implement an attention mechanism for image captioning—this was 2022, when transformers were taking over, but we had to build attention from first principles. The mathematics of attention seemed magical: how does a model learn to “look” at different parts of an image when generating different words?
My final project was generating image captions with spatial attention. Training took 12 hours on the Georgia Tech cluster GPUs. The magic moment came when I visualized what the attention mechanism was actually looking at. For an image of a dog in a park, the attention focused on the dog when generating “dog”, shifted to the grass when generating “park”, moved to the dog’s expression when generating “happy”.
It was like watching the model learn to see, to understand that language and vision could be connected through learned attention patterns. That moment convinced me that this field was about something deeper than just curve fitting—we were building systems that could learn to perceive and understand in ways that felt almost conscious.
First Research Steps: Joining Professor Tumanov’s SAIL Lab
Summer 2022. I’d been watching Professor Tumanov’s research from afar—his Systems for AI Lab was working on problems I’d never considered. How do you schedule GPU resources for training dozens of neural networks simultaneously? How do you handle memory management when models have unpredictable resource requirements?
My first meeting with Tumanov was humbling. I walked in thinking I understood machine learning systems because I’d trained some neural networks. He asked me a simple question: “How would you schedule GPU resources for a cluster training 20 different neural networks simultaneously, each with different batch sizes, memory requirements, and deadlines?”
I gave some hand-wavy answer about load balancing. He pulled up a paper on their recent work—Sarathi-Serve, their system for efficient large language model inference. The complexity was staggering. They were dealing with dynamic batch sizing based on sequence lengths, KV-cache memory management across multiple attention heads, prefill versus decode phase scheduling optimization, multi-tenant GPU sharing with SLA guarantees.
“This,” he said, “is what systems for ML actually looks like. It’s not just about the algorithms—it’s about making them work in the real world with real constraints.”
My first project was reproducing one of their benchmarks—Vidur, their large-scale simulation framework for LLM inference. It seemed simple: just run their simulation code. I was wrong again.
Setting up the environment took two weeks. The codebase assumed specific CUDA versions, specific PyTorch builds, specific cluster configurations. Half the dependencies were pinned to exact git commits. The README was outdated. Every step revealed new layers of complexity I hadn’t anticipated.
Once I got it running, I had to understand what it was actually measuring. The simulation modeled request arrival patterns following Poisson distributions, token generation following autoregressive sampling, memory allocation and deallocation for KV-caches, GPU utilization across multiple concurrent requests, network communication overhead, queue waiting times.
My job was to add a new scheduling algorithm. I thought I was clever—prioritize shorter sequences to reduce overall queue wait times. The results were disappointing. My algorithm actually performed worse than the baseline FIFO scheduler.
Tumanov’s feedback was characteristically direct: “You’re thinking like a traditional systems person. ML workloads have different characteristics. Memory usage, computation patterns, even the definition of ‘job completion’ is different. You need to understand the ML first, then build the system.”
That conversation changed how I approached problems. I spent the next month diving deep into transformer architectures, understanding exactly how attention mechanisms use memory. Each attention head maintains key-value caches that grow linearly with sequence length. Memory usage is dominated by long sequences, not numerous short ones. GPU memory allocation is contiguous—fragmentation matters as much as total usage.
This led to my actual research contribution: sequence-aware memory allocation. Instead of scheduling based on sequence length, I developed an algorithm that estimated future memory usage and scheduled requests to minimize memory fragmentation. The key insight was treating GPU memory like a filesystem with external fragmentation.
Testing on the Vidur simulator showed 15% improvement in throughput and 23% reduction in average queue wait times. Not revolutionary, but solid systems research that solved a real problem.
Diving Deeper: Professor Kira’s RIPL Lab
Fall 2022. Professor Kira had just received an NSF CAREER Award for his work on “Visual Learning in an Open and Continual World.” His vision resonated with problems I’d been thinking about: “The goal is to move beyond current machine learning and computer vision where there is a closed-world assumption.”
Traditional computer vision assumes you know all possible object classes during training. But real-world AI systems encounter new objects constantly. “Self-driving cars, once you deploy them, inevitably they’ll encounter new types of data such as new objects,” Kira explained during our first meeting. “How can we detect if it’s seeing something new? Once we detect it, how can we add it to the knowledge that the AI model has and automatically update it?”
This was the problem of catastrophic forgetting. Train a neural network to recognize cats, then train it on dogs, and it forgets cats. The standard solution—storing examples from previous tasks and retraining—has obvious limitations: memory grows linearly with tasks, privacy concerns, computational costs.
My project focused on rehearsal-free continual learning using the latest breakthrough: CODA-Prompt, a method that had just been accepted to CVPR 2023. Instead of storing raw data from previous tasks, the idea was to store learned “prompts”—small sets of parameters that could trigger task-specific behavior in a pre-trained vision transformer.
The concept was elegant, but implementation was brutal. The paper made it sound straightforward, but crucial details were missing or unclear. How do you select which prompts to use during inference when you don’t know the task? How do you initialize prompts for genuinely new tasks?
I spent three months implementing what I thought was the paper’s approach. The key challenge was task selection during inference. The paper suggested using “key-query matching”—compare input features to stored keys and select the most similar task. This worked when tasks were very different (cats versus cars), but failed catastrophically when tasks were similar (different breeds of dogs).
The breakthrough came from a conversation with James Smith, a PhD student in the lab working on similar problems. He suggested treating task selection as an out-of-distribution detection problem. Instead of trying to match inputs to known tasks, detect when an input doesn’t belong to any known task—then it’s genuinely novel.
This approach actually worked. When the model encountered new object classes, confidence scores dropped across all existing tasks, triggering novel task detection and prompt expansion.
After three months of implementation and debugging, I had a working continual learning system. Testing on Split-CIFAR-100: baseline neural network achieved 45% accuracy after 10 tasks, replay-based methods hit 78% (storing 50,000 images), my CODA-Prompt implementation achieved 71% (storing only prompts).
Not state-of-the-art, but a solid contribution that demonstrated the core tradeoff: memory efficiency versus performance. More importantly, I learned what research actually looks like—most ideas don’t work, papers make things sound easier than they are, and incremental progress matters.
The Synthesis: Bringing Everything Together
Spring 2023. For my final semester, I wanted to combine everything I’d learned—systems optimization from SAIL lab and continual learning from RIPL lab. Tumanov and Kira agreed to co-advise a project on efficient continual learning inference systems.
The problem was real: continual learning systems accumulate prompts over time, but inference latency grows linearly with the number of tasks. Deploy a system for 100 tasks and inference becomes prohibitively slow. This was a classic systems-ML problem—great algorithms that didn’t scale in practice.
My project: Dynamic Prompt Pruning for Real-Time Continual Learning.
The core insight came from profiling CODA-Prompt inference. 90% of compute time was spent on prompt selection and attention computation, not the actual vision transformer backbone! The system was evaluating all 50 tasks for every input image, even though most tasks were irrelevant.
My solution: hierarchical prompt organization with early pruning. Instead of linear search through all tasks, organize prompts in a tree structure based on feature similarity. Prune entire subtrees during inference if confidence is low.
The tree structure meant I could evaluate confidence at each level and prune entire subtrees early. For 50 tasks organized in a tree with branching factor 4, this reduced search from O(n) to O(log n) in the best case.
But the real optimization came from learned confidence functions. Instead of hand-crafted similarity metrics, I trained lightweight neural networks to predict whether an input belonged to each subtree. Training these confidence networks required careful curriculum learning—too aggressive pruning and you’d miss the correct task, too conservative and you’d get no speedup.
The results were promising: 9.5x speedup with only 2.4% accuracy drop. But Tumanov pushed me further: “This is good, but you’re still loading the full vision transformer for every inference. Can you do dynamic model pruning as well?”
That led to my final contribution: task-aware model compression. Different tasks might only need different parts of the vision transformer. Why run all 12 transformer layers for a simple task that could be solved with 6 layers?
I developed a system that learned which transformer layers each task needed and allowed early exit when confidence was high enough. This required training with a complex multi-objective loss that balanced classification accuracy, computational efficiency, and model sparsity.
The final results: 27x speedup overall while maintaining reasonable accuracy. We’d gone from 847ms to 31ms per inference—the system could now handle real-time inference for continual learning, something impossible with the baseline approach.
Conversations That Shaped My Thinking
Some of the most important learning happened in conversations outside formal classes and meetings.
Late night discussion with Sarah Kim (the classmate who helped debug LSTM): “You know what’s weird about attention mechanisms? They’re like learned database queries. The query vector asks a question, the key vectors say whether they’re relevant, and the value vectors provide the actual information. It’s not just mathematics—it’s a computational metaphor for how minds might work.”
Coffee with Professor Tumanov after a failed experiment: “Research isn’t about having brilliant insights. It’s about systematic debugging of complex systems. Every failed experiment teaches you something about the problem space. The key is failing quickly and learning efficiently.”
Whiteboard session with PhD student Amey Agrawal (working on LLM inference): “The future of AI isn’t just about better algorithms—it’s about making AI systems that can run anywhere, on any hardware, with any constraints. The most brilliant algorithm is useless if it can’t deploy in the real world.”
Office hours with Professor Kira discussing continual learning: “Humans don’t learn by storing and replaying every experience. We forget details but remember patterns, skills, and abstractions. The question is: how do we build AI systems that learn like humans—efficiently, continuously, without catastrophic forgetting?”
Group study session before CS 7643 final: Five students around a whiteboard, trying to understand why transformer attention works so well. We spent three hours deriving the mathematical intuition: attention allows each position in a sequence to directly access information from any other position, creating shortcuts for information flow that RNNs can’t match. The breakthrough wasn’t in the equations—it was in understanding the computational graph.
The Defense and Looking Back
December 2023. during project defense the questions were probing:
Dellaert: “How does this handle distribution shift? What happens when your confidence estimators become overconfident on out-of-distribution data?”
Tumanov: “You’ve optimized for GPU inference, but what about edge deployment? How would this work on mobile devices with limited memory?”
Kira: “Your method trades accuracy for speed. How do we know this tradeoff is worthwhile? In what scenarios would you prefer the slower but more accurate baseline?”
These weren’t gotcha questions—they were pointing out real limitations and future research directions. The verdict: solid systems-ML research that addresses a real problem, with novel contributions and thorough experimental evaluation.
What I Actually Learned
Looking back, Georgia Tech taught me several fundamental lessons:
Research is About Problems, Not Solutions
I entered thinking research was about implementing clever algorithms. I learned it’s about identifying important problems and systematically exploring solution spaces. The best research often comes from understanding why existing approaches fail and what constraints matter in practice.
Interdisciplinary Work is Where Innovation Happens
The most impactful part of my research came from combining systems thinking with machine learning algorithms. Pure ML researchers often struggle with deployment constraints. Pure systems researchers often miss algorithmic opportunities. The intersection is where real innovation occurs.
Failure is the Default Mode
Most things I tried didn’t work. My first scheduling algorithm was worse than baseline. My initial prompt selection method failed on similar tasks. My early confidence networks were overconfident. Research progress comes from systematic debugging of failed approaches, not brilliant flashes of insight.
Collaboration Amplifies Everything
Working with both Tumanov and Kira exposed me to completely different research styles. Tumanov’s systems perspective taught me to consider practical deployment constraints. Kira’s ML focus kept me grounded in algorithmic fundamentals. Neither advisor alone could have guided my project.
Understanding is More Important Than Implementation
The deepest learning happened when I stopped focusing on getting code to work and started asking why things worked or didn’t work. Why does attention work better than RNNs for long sequences? Why do memory allocation patterns matter for ML workloads? The insights came from understanding mechanisms, not just implementations.
The Bigger Picture
Georgia Tech’s machine learning program taught me that AI is fundamentally about building systems that can learn, adapt, and operate in complex, dynamic environments. The technical skills—implementing neural networks, optimizing systems, debugging complex pipelines—are necessary but not sufficient.
The real education was learning to think systematically about problems that don’t have clear solutions. How do you build AI systems that can learn continuously without forgetting? How do you deploy complex models efficiently? How do you bridge the gap between research algorithms and production systems?
These questions don’t have simple answers, but they’re the questions that matter for building AI systems that can actually help solve real-world problems.
Where This Led
My work opened several doors:
Publications: Two papers on efficient continual learning systems, one at MLSys and one at a NeurIPS workshop.
Industry Interest: Multiple startups reached out about implementing similar systems for production continual learning deployments. Inference latency is apparently a major blocker for real-world adoption.
PhD Opportunities: The combination of systems and ML research opened doors to top PhD programs. Having concrete systems contributions alongside ML theory knowledge is rare.
Career Direction: I’m now working on efficiency optimizations for large language model inference at a major tech company. The systems debugging skills from Georgia Tech are essential daily tools.
Final Reflections
Two years at Georgia Tech completely changed how I think about technology and research. I went in thinking machine learning was about finding the right algorithm. I came out understanding that practical AI is about the intersection of algorithms, systems, human factors, and real-world constraints.
The coursework taught me fundamentals. The research taught me how to think systematically about open problems. The collaboration taught me how to work with brilliant people who see problems differently than I do.
Most importantly, I learned that the most interesting problems exist at the boundaries between fields. Systems researchers often ignore ML advances. ML researchers often ignore systems constraints. The opportunities lie in bridging those gaps.
Georgia Tech’s strength is having world-class researchers in both domains who are willing to collaborate across boundaries. The program didn’t just teach me machine learning—it taught me how to identify important problems and build systems that actually work in the real world.
The M.S. was just the beginning. The real education happens when you take these tools and use them to solve problems that matter.
This story represents my personal journey through graduate school—the late-night debugging sessions, the conversations that shifted my perspective, the gradual realization that research is about asking better questions, not just finding answers. Georgia Tech gave me the tools to work at the intersection of theory and practice, which is where the most impactful technology gets built.