Developed a GPU-accelerated data processing pipeline to categorize 1.7 billion Reddit posts into topics, implementing BERTopic with BERT embeddings, UMAP, and HDBSCAN. Conducted an analysis of approximately 12 billion Reddit posts to find the topic of each post and how those topics changed over time. Optimized large-scale NLP techniques, reducing the dataset to 43 million categorized posts for efficient analysis. Created and published an interactive website (redditopics.xyz) for exploring and downloading the processed dataset. I’ll spend the summer working on the project and update the results later.
Developed an AI-powered personal learning assistant that helps users revise more efficiently by answering questions about the material they upload. Questions are tuned based on the previous user's answers to reach the right level of difficulty. Skills: Retrieval-Augmented Generation, Python, Flask, React, AWS.
Project LinkAnalyzed cinema trends over decades using large-scale datasets, focusing on movie length, ratings, budgets, and plot complexity. Applied Latent Dirichlet Allocation to track changes in movie topics and originality, creating an interactive data story. Skills: Python, pandas, numpy.
View Data StoryDeveloped STEMERALD, a STEM course assistant using the Gemma-2b language model, implementing Supervised Fine-Tuning and Direct Preference Optimization. Achieved 70% accuracy across various multiple-choice question benchmark datasets. Successfully reduced the model’s memory footprint to 2GB, enabling its deployment on consumer-grade hardware.
ReportApplied RL to solve Wordle and similar few-step environments using large language models, focusing on behavioral cloning. Developed and tested a novel reward-weighted behavioral cloning method, demonstrating its ability to generalize traditional behavioral cloning approaches and effectively manage datasets of varying quality.
ReportIn this project, we used Graph Convolutional Network architectures to develop a book recommender system. Evaluated on a dataset composed of 6 million book reviews. Once trained, and given a user's purchase history, the model was able to recommend books that the user was most likely to enjoy.
GitHub Report