Alex Peczon

Hey, I'm Alex

Data software engineer. I like turning messy APIs, logs, reviews, surveys, and spreadsheets into pipelines people can actually use.

Favorite language: Python

Where I moved data around for real

Work that sits somewhere between data engineering, applied ML, and product automation.

Future Tilt

Software Engineer
Jul 2025 to Present
San Francisco
  • Collaborated with a Data Consultant to build ETL pipelines with Airbyte, BigQuery, and AWS Lambda, capturing customer behavior and demographic data across 50+ ecommerce clients and 20M+ daily queries.
  • Developed an AI email template platform with React, FastAPI, Docker, Google OAuth, and Klaviyo template logic, transforming Google Sheets campaign calendars into editable emails with Google Drive asset sync.
  • Built customer segmentation and churn monitoring workflows with SQL and Python across Shopify and Klaviyo data to support more targeted marketing strategies.
React FastAPI Airbyte BigQuery AWS Lambda Klaviyo API Google OAuth Docker
Future Tilt
Future Tilt

Superlinked (Series A)

SIE Demo Software Developer (Contract)
Mar 2026 to May 2026
San Francisco Bay Area · Hybrid
  • Developed a paid launch demo for Superlinked's SIE engine during early access, working with Valentin Marek and Eric Taylor to showcase explainable wine recommendations.
  • Built a RAG wine recommender that used Vivino data, small inference models, OCR, and text embeddings to surface similar wines with clearer reasoning.
  • Shipped a React UI and containerized Python monorepo in Docker, keeping OCR and embedding modules cleanly separated for documentation users.
Superlinked SIE React Python Docker RAG Small Models

USF MAGIC Lab

NLP Research Assistant
Mar 2025 to May 2026
San Francisco
  • Co-authored AeVAA, a modular explainable ABSA framework that models entity sentiment through clause extraction, coreference resolution, relation extraction, graph construction, and survey-derived sentiment formulas.
  • Designed a human annotation study with 36 participants and 3,900+ sentiment judgments across action, association, ownership, and temporal aggregation relationships, deriving interpretable formulas that explained roughly 50% of sentiment variance.
  • Evaluated AeVAA on SemEval 2014, achieving 78.58% restaurant accuracy and 68.52% laptop accuracy while producing explanatory traces that identified sentiment attribution and modifier extraction errors.
  • Built concurrent ETL pipelines across 20,000+ news articles into DuckDB relationship datasets for entity-level sentiment evaluation and media bias research.
Go Python NLP PyTorch ModernBERT NetworkX DuckDB SemEval

Alaris Security (Pre seed)

Junior Fullstack Engineer
Aug 2025 to Nov 2025
San Francisco
  • Designed orchestration workflows with Prefect and Airflow to normalize CrowdStrike, Elastic, and Microsoft Defender telemetry while powering an agent driven cybersecurity analysis platform that reduced investigation time from 10 hours to approximately 1 hour.
  • Built customer facing compliance tooling with Next.js, React, and React PDF, generating NIS2 reports that linked agent generated findings from millions of raw SOC2 logs back to source telemetry.
  • Established reusable tRPC service patterns across a Next.js monorepo, consolidating 50+ frontend database calls into typed backend procedures.
Next.js React PDF tRPC Prefect Airflow CrowdStrike Microsoft Defender
WeWork build room
WeWork build room
Office view
Office view
Top view
Top view

Future Tilt

Software Engineering Intern
Jun 2025 to Aug 2025
San Francisco
  • Built a Lambda campaign orchestration service that syncs Google Sheets planning calendars with Klaviyo campaigns and Trello production tasks, cutting setup time by 50%.
  • Worked across Google Sheets, Klaviyo, Trello, and AWS Lambda to turn campaign planning data into production ready automation.
AWS Lambda Google Sheets Klaviyo API Trello API Automation

Candle Stories

Production Assistant
Apr 2025 to Aug 2025
San Francisco
  • Supported documentary shoots, equipment handling, and on set logistics. Less data pipeline, more real world pipeline.
Production Logistics

USF Strategic Enrollment Management

Data Analyst / Web Intern
Jul 2024 to Jul 2025
San Francisco
  • Transferred to USF in 2024 from UC Merced because I wanted to learn in San Francisco, be closer to working engineers and startup builders, and make a name for myself in a real tech ecosystem.
  • Applied to 100+ roles before starting at USF and converted that search into a Data Analyst / Web Intern role supporting enrollment marketing, admissions analytics, and prospective-student web operations.
  • Conducted year-over-year exploratory analysis on 500,000+ SLATE prospect records, identifying COVID-era enrollment declines, geographic acceptance and declination trends, and regional recruitment event patterns linked to stronger student conversion.
  • Built a Go microservice that canonicalized inconsistently named recruitment events from raw CSV exports, surfacing attendance gaps and event performance metrics used by admissions leadership to evaluate which events to host or cut.
  • Automated recurring prospective-student website updates with Python and Jinja2 workflows over a legacy SLATE-hosted HTML and React system, eliminating manual copy-paste updates across enrollment marketing pages.
Go SQL Pandas SLATE Jinja2 React Admissions Analytics

iD Tech Camps (Stanford)

Machine Learning Instructor
Jun 2024 to Aug 2024
Stanford, California
  • Taught project based Python and machine learning lessons to high school students at Stanford, covering neural networks, NumPy, Pandas, and Keras.
Python PyTorch Keras NumPy Pandas

UC Merced to SATAL

Data Analyst Intern
Aug 2023 to May 2024
Merced
  • Started my first day as a Data Analyst Intern at SATAL UC Merced, using survey data and student feedback to improve enrollment support, course experiences, and faculty decision making.
  • Built a survey response normalization pipeline using Pandas and OpenAI assisted categorization to bucket thousands of open ended Qualtrics responses into structured themes, transforming unstructured student feedback into analyzable datasets for faculty action.
  • Analyzed survey responses tied to student grade outcomes across lab sections and lectures, splitting fieldwork across a team of 6 to conduct weekly focus groups and large scale surveys, then compiling findings into weekly reports delivered to 5+ faculty members, contributing to measurable improvements in student outcomes and faculty relationships.
  • Presented research on methodology at the Fresno State Exemplary Practices in Higher Education Conference.
Pandas Qualtrics OpenAI Survey Analysis Research Methods

Acme Builders Incorporated

Construction Worker → Accounting Assistant
May 2021 to Dec 2023
Oakland · On site · Part time
  • Built internal data systems in Python with NumPy and Pandas to clean, organize, and standardize records across departments.
  • Updated, organized, and archived company documents to support payroll cycles, budgeting, and reliable business data management.
  • Used OCR workflows to reduce manual document sorting and make scanned account records easier to organize.
Python Pandas NumPy OCR Business Data Accounting Construction

Projects

These are mostly passion projects that I made with friends.

show me

showing everything

All projects, no bucket applied.

Live
Stars

nextsteamgame.com

A semantic Steam recommender for 80,000+ games built around the idea that games should match by what they are, not only by player overlap. The pipeline filters up to 2,000 reviews per game, classifies useful review pools with ModernBERT, extracts identity tags and focus vectors, canonicalizes noisy generated tags, then precomputes candidate relationships so users can rerank recommendations by soundtrack, setting, mechanics, narrative, pacing, and vibe.

Long term PostgreSQL ChromaDB Qdrant ModernBERT FastAPI Docker
Superlinked Wine Recommender
Superlinked
Series A

Superlinked Wine Recommender

A wine recommender developed with the Superlinked team during early access to their SIE engine. It uses document processing, vector embeddings, and small model inference to explain why a result appears, whether the match came from fizz, cherry notes, body, acidity, or other wine attributes.

Long term Superlinked SIE Vector Search OCR Small Models Chroma PostgreSQL
2nd Place

Maldemic Simulator

We built Maldemic to help close the gap between researchers and the public. Disease models can feel locked behind papers and equations, so we turned SIR dynamics and Markov chain mobility into a 3D globe people can watch, question, and reason about. Python computes the stochastic population transitions, then Godot makes the spread visible for public education.

Long term Python NumPy SciPy Godot Markov Chains SIR
Next Chapter
Hackathon

Next Chapter

A hackathon project built to make retirement questions feel less foggy. Users can ask things like "Can I retire in the Philippines?" or "How much should I start saving?" and the system answers with retrieved context and visible data instead of pretending a prompt is a financial plan.

RAG LLMs FinTech Personal Finance AI for Good
Antidote Intelligence
Open Source

Antidote Intelligence

An open source ML security project that treats training data as the place where model risk often starts. The system uses a multi agent analysis pipeline to inspect dataset content, generate hypotheses, and surface examples worth investigating before bad data becomes expensive behavior.

Long term Python OpenAI ML Security Data Quality Agent Pipeline
Dreamville
In Progress

Dreamville

A gamified Canvas LMS tracker that pulls assignments into a game loop, then scores urgency from completion patterns and difficulty signals. The useful part is turning school workflow data into a next action system students can act on without another dashboard yelling at them.

Long term Godot Go Canvas API Regression Workflow Data
Hyper Rosen
Hackathon

Hyper Rosen

A hackathon built Godot experiment in systems that can keep expanding. Swirled Perlin noise places planets, wave function collapse handles city placement, and procedural rules create enemies and asteroids, making the project feel like a small galaxy generated from reusable data rules.

Godot Hackathon Procedural Generation Perlin Noise Wave Function Collapse
Cake Walk
GDC
GDC Jam

Cake Walk

A fast game jam pitch: make a tiny character readable, charming, and playable in a single day. We built and demoed Cake Walk at GDC Festival of Gaming with Keriya Son on 3D, Angie Peczon on art, Eric Taylor on shaders, and Ilce Perez on music.

Godot Game Jam 3D Shaders Team Project
Old Man Climbs
First Project, 2022

Old Man Climbs

A small vertical climber built over a weekend for a UC Merced game jam in 2022. It is here less as a technical flex and more as the first shipped artifact: a reminder that finishing a small loop teaches more than endlessly planning a bigger one.

Godot Game Jam 2022
Quick Autocorrect
Obsidian

Quick Autocorrect

A small community plugin for reducing friction while writing in Obsidian. It catches repeated misspellings, applies quick corrections, and keeps a personal dictionary for words Obsidian should stop fighting you on: a tiny version of the same pattern I like, cleaning a messy text stream into something easier to use.

Long term TypeScript Obsidian Plugin Text UX
NutriFinder

NutriFinder

A small dietary search project with a practical pitch: pull in messy menu and nutrition information, normalize it enough to filter, and give people a cleaner way to decide what they can eat.

React Flask Python Search Filters
Spiral Visualizer

Spiral Visualizer

A compact teaching visualization for spiral growth using queued directions. The pitch is simple: when a system changes step by step, showing the state often teaches faster than another paragraph of explanation.

Python Matplotlib Queues

Vlog

Hey hey! Didn't think many people would see this haha. These are basically leftover thoughts from projects I made: recommendation systems, explainable AI, data poisoning, and the parts that did not fit cleanly on a resume.

Comic strip reference for editorial layout
Since I'm feeling newspaper-y here I put a Garfield strip (it's in the public domain).
Case Study May 2026

How I Built a Semantic Recommendation Engine for 80,000 Steam Games

NextSteamGame is a Steam recommendation project built around a simple complaint: most game recommenders know that two games are related, but they rarely explain why. Player-overlap signals are useful, but they flatten intent. Someone may like Persona 5 for the jazz fusion soundtrack and modern Tokyo setting; another person may like it for social simulation and dungeon crawling. Those are different reasons, and a good recommendation system should let users separate them.

Games as vectors

I think games can be represented as weighted profiles: not just genre, but the parts that actually make the game feel like itself.

MusicSettingSystemsNarrativeVibe

Persona 5 Royal: jazz fusion, modern urban fantasy, social simulation, dungeon crawling, stylish UI.

Micro-tags normal genres miss
confidant systempersona fusion systemday-night cyclestylish art directionmodern Tokyo settingoppression and rebellionsocial linkdungeon explorationjazz fusion soundtrackcharacter driven narrative

The problem

Most recommendation systems lean on the pattern "players who liked X also liked Y." That works well for popular games, but it struggles with niche tastes and gives weak explanations. I wanted a system that could represent a game as a shape: soundtrack, setting, systems, narrative, vibe, and the small micro-tags that genre labels leave behind.

The pipeline

  • Collect Steam metadata, appids, genres, tags, descriptions, release data, and storefront artwork.
  • Pull up to 2,000 reviews per game, then remove spam and low-signal reviews with regex filters, word diversity scoring, quality heuristics, and descriptive phrase detection.
  • Classify useful reviews with ModernBERT into pools for gameplay, art, soundtrack, systems depth, narrative, and general description.
  • Generate semantic identity data: focus vectors, mechanics, narrative, vibe, structure loop, signature tags, niche anchors, music tags, and micro-tags.
  • Canonicalize noisy generated tags with heuristics, fuzzy matching, embedding similarity, and vector search so tags like fast action, quick action, and high-speed combat can be grouped without losing useful distinctions.
  • Precompute candidate relationships offline, then let the live FastAPI and React app apply user-controlled reranking at runtime.

Why the architecture is cool

The key design choice is splitting expensive semantic work from cheap interactive reranking. Computing every similarity at runtime would be wasteful, so candidate relationships are built offline. When a user searches, the app retrieves candidates, applies the user's weights, and reranks recommendations based on the profile dimensions they care about.

From review to recommendation

A raw review like "the combat is fast, the soundtrack goes hard, and the boss fights feel like rhythm puzzles" becomes structured signals: fast combat, high-energy soundtrack, boss-focused structure, rhythm-like timing, and mechanical precision. Those signals can then be weighted independently by the user.

What it demonstrates

  • 80,000+ Steam games indexed
  • Up to 2,000 reviews analyzed per game
  • Semantic vectors, identity tags, and canonicalized genre/tag relationships
  • 30,000+ users and discovery across 8,000+ unique games
  • A retrieval design cheap enough to run on constrained cloud infrastructure

What I learned

Review text is noisy, so filtering before embeddings matters. LLM-generated tags are useful, but raw generated tags need canonicalization. Most importantly, recommendations feel better when users can inspect and control the reason behind a match instead of accepting a mystery list.

Recommendation Systems Vector Search ModernBERT FastAPI Semantic Retrieval Steam
Friends + Games GDC Game Jam

Cake Walk: A One-Day Game Jam at GDC

Cake Walk at GDC
Cake Walk started as a tiny joke and became a playable floor demo by the end of the day.

Cake Walk was a one-day game jam project I made at GDC with friends. The whole thing was intentionally small: make a little cake cross the street, make it readable, make it charming, and ship something people could actually try.

The shape of the day

Everyone had a lane. We split up character work, art, shaders, music, and gameplay, then kept cutting scope until the core loop was visible. That is the best part of game jams: you cannot hide behind architecture for too long. Either the thing plays or it does not.

Cake Walk group photo
The real artifact was less the game and more the tiny production pipeline we built under pressure.

Why it matters

I keep these projects on the site because they show a different kind of engineering. Hackathons are messy, but they force prioritization, communication, and taste. You learn how much polish can come from a few good decisions when the team is moving fast.

GDC Game Jam Godot Team Project Friends
Friends + Games Hackathon Notes

Hyper Rosen: A Tiny Galaxy From a Hackathon Weekend

Hyper Rosen hackathon photo
Hyper Rosen was one of those weekend builds where the idea was bigger than the time limit, which is kind of the whole point.
The same silent gameplay capture from the project card, dropped into the newsletter so the build feels alive instead of only described.

Hyper Rosen was a hackathon game I made with friends. The pitch was simple and probably too ambitious: build a procedural space game where planets, cities, enemies, and asteroid fields come from generation rules instead of hand placement.

What we tried

The fun part was treating the game like a small systems experiment. Swirled noise placed planets, procedural rules filled out the galaxy, and wave-function-collapse-style logic helped with city layout. It was not polished in the normal product sense, but it had that good hackathon feeling where every hour made the world a little more alive.

Why I still like it

I like projects like this because they make constraints obvious. You learn what actually matters when the deadline is close: readable movement, a loop people can understand, and enough visual feedback that the system feels real even if half of it is held together by deadline energy.

One day

Long term, I still want to make a full Mario Galaxy-style procedural game from this idea: tiny planets, playful gravity, generated worlds, and a sense that the level is wrapping around you. That is probably a post-college version of the project, though. The kind you build when breakfast is no longer mostly oats and coffee.

Hackathon Godot Procedural Generation Friends Game Dev
Update October 2025

Turns Out We Weren't Crazy About Data Poisoning

In December 2024, I built Antidote Intelligence around a simple belief: training data is infrastructure, and poisoned examples can become model behavior if nobody inspects the dataset early enough.

Anthropic, the UK AI Security Institute, and The Alan Turing Institute later published a large-scale poisoning study that makes that concern feel a lot less speculative. Their result: in their experimental setup, as few as 250 malicious documents were enough to introduce a backdoor across models from 600M to 13B parameters.

250 docsenough to backdoor tested models in Anthropic's denial-of-service setup

Why this reinforced the project

A lot of people think poisoning only matters if an attacker controls a meaningful percentage of the training set. The Anthropic result challenges that. Their finding suggests the absolute number of poisoned documents can matter more than the percentage of the corpus, at least for the narrow backdoor they tested.

What Antidote was trying to do

Antidote was not trying to solve all model security. It was a dataset inspection tool: look at examples before they become model behavior, generate hypotheses about suspicious content, and make data quality visible enough for a human to investigate.

The larger lesson

This is the same theme as my recommender and ABSA work: AI systems can be powerful without becoming completely opaque. If model behavior depends on messy upstream data, then inspection, provenance, and explainable intermediate artifacts are not extras. They are part of the system.

Read Anthropic's research post.

Data Poisoning ML Security Dataset Inspection AI Safety Antidote Intelligence
Research Note October 2025

Building Explainable ABSA Without Hiding the Reasoning

AeVAA is a research project about a question I keep coming back to: machine learning and AI are powerful, but can we build systems where the important reasoning stays inspectable?

Aspect-based sentiment analysis usually tries to predict whether a sentence is positive, negative, or neutral toward a target. That is useful, but it often hides the path from text to judgment. AeVAA takes a different route: split the problem into modules, keep intermediate artifacts, and use survey-derived formulas to explain how sentiment moves between entities.

Σ(x)k = σ(sk, ik, rk)Sentiment as a function of local score, interaction, and relation context.

The core idea

Instead of asking one model for one answer, AeVAA builds a trace. It extracts clauses, resolves entities, identifies relationships, constructs a graph, and then calculates valence-aware sentiment over that graph. The model can still use black-box components, but the system around them exposes what each component contributed.

Why this matters

Document-level sentiment can miss the point. In a sentence like "the person was bad, but the child was good," the total sentiment is not enough. The meaningful question is who the sentiment is aimed at and why it changed. That becomes even more important for media bias, long-form narrative, and texts where framing matters.

What we built

  • A modular pipeline for constituency clause extraction, entity/coreference resolution, relation and modifier extraction, graph construction, and sentiment aggregation.
  • A human annotation study with 36 participants and 3,900+ sentiment judgments across action, association, ownership, and temporal aggregation cases.
  • Survey-fitted formulas for action, target, association, ownership, and aggregate sentiment dynamics.
  • Explanatory traces that show where errors came from instead of only reporting a final label.

Results

The fitted formulas explained roughly half of the variance in pilot sentiment judgments. On SemEval 2014, AeVAA reached 78.58% restaurant accuracy and 68.52% laptop accuracy. It did not beat state-of-the-art DeBERTa systems, but that was not the point of the prototype. The point was to show that a modular, inspectable ABSA system can produce plausible results and make debugging easier.

The bigger theme

I like projects that score well without becoming total black boxes. The goal is not to reject ML; it is to use ML where it helps, then design the surrounding system so people can inspect the evidence, the intermediate state, and the reason a result appeared.

Explainable AI ABSA NLP Human Annotation SemEval Research
Earlier Post Feb 2025

Maldemic: A Pandemic Model You Can Watch

Maldemic is a stochastic disease spread simulator that turns SIR equations and city mobility into a live 3D globe. I like projects where the math becomes something you can inspect with your eyes.

Data flow

Python computes population movement with a Markov matrix, updates local susceptible/infected/recovered states, then passes the evolving state into a Godot visualization.

Technical shape

  • Markov chain mobility between cities
  • SIR disease dynamics for local spread
  • Population cleanup to keep totals consistent
  • 3D globe rendering for real time visual feedback

The project won 2nd place at BLOOM Hackathon and received grant support for neural network forecasting work.

NumPy SciPy Godot 3D Markov Chains SIR Model Simulation
Latest Post December 2024

Data Poisoning Detection at Continue DX

Continue DX presentation header
Continue DX demo: inspecting training data before it becomes model behavior.

I demoed Antidote Intelligence at Continue DX, showing a content aware data poisoning detection system for ML training datasets. The basic pitch: before we argue about the model, let's look harder at the data we fed it.

Why this matters

Training data quality is one of those problems that hides until it becomes expensive. Bad examples, poisoned content, or subtle distribution weirdness can leak into model behavior long before anyone notices.

What I built

  • Multi agent review pipeline for suspicious dataset content
  • Hypothesis generation and validation around poisoned examples
  • Content aware checks instead of only metadata based filtering
  • Reports aimed at making data issues inspectable, not magical

I am interested in this space because it treats data quality as infrastructure. The model gets the attention, but the dataset is where a lot of the story starts.

ML Security Data Quality Training Data Agent Pipeline
Journal Entry August 2024

USF, Sentiment, and Moving Into the City

View from my USF dorm
The view from my dorm at USF. This was the point where school started feeling connected to the city instead of separate from it.

This one is more of a journal entry than a project breakdown.

I transferred to USF in 2024 after UC Merced because I wanted to be in San Francisco. Merced felt too far away from the people and companies I wanted to learn from. I wanted to make a name for myself, be around real builders, and learn from people actually working in tech.

Before I even got to USF, I applied to more than 100 jobs. That search eventually turned into a Data Analyst / Web Intern role, where I worked on enrollment analytics, SLATE data, admissions event cleanup, and prospective-student web updates. It was not glamorous, but it taught me something important: useful software usually starts as messy data, weird processes, and people who need better tools.

Where sentiment came in

Later, at the MAGIC Lab, I worked on AeVAA, an explainable sentiment project. The technical version is about aspect-based sentiment analysis, graphs, coreference, relation extraction, and survey-derived formulas. The personal version is simpler: I was trying to understand how a system could make a judgment without hiding the reasons.

That thread shows up in a lot of my projects. Recommenders should explain why a game matches. Sentiment systems should explain who a sentence is about and why the score moved. Data poisoning tools should show what suspicious examples look like before they become model behavior.

USF campus
USF became the place where those ideas started turning into real projects instead of just things I was reading about.

The pattern

I like AI systems, but I do not like when the answer is the only artifact. The projects I keep coming back to are the ones where the intermediate state matters: vectors, tags, traces, formulas, records, examples, and the evidence behind a prediction.

USF was where that started to become a theme instead of a coincidence.

USF Journal Sentiment Analysis Research San Francisco
Journal Entry August 2023

UC Merced Game Dev Club and My First Data Internship

UC Merced Game Dev Club event
UC Merced Game Dev Club, back when I was trying to get more students to actually start making games.

In 2023, I was the secretary of the UC Merced Game Dev Club. A lot of the work was not glamorous: planning, messaging people, getting rooms, keeping events moving, and making sure students felt like they could show up even if they had never shipped anything before.

We hosted a successful showcase where people brought in games they had been building, talked through what worked, and got to see other students care about the same weird problems: controls, art, music, level design, scope, and how to make a tiny idea feel playable.

Game jams and mixers

We also hosted a game jam that produced some genuinely cool student projects. The best part was watching people form teams quickly and make something real under a deadline. I also helped host a mixer for students who wanted to get started with game development but did not know who to work with yet.

That year also overlapped with my first day as a Data Analyst Intern at SATAL UC Merced. I was using survey data and student feedback to improve enrollment support, course experiences, and faculty decision making. Looking back, both roles were about the same thing: turning scattered student energy into something organized enough that people could act on it.

SATAL Fresno State presentation poster
We ended up presenting this SATAL work at Fresno State, showing how student feedback could become faculty-facing evidence instead of disappearing into end-of-term forms.

That presentation mattered to me because it made the internship feel real. We were not just cleaning survey data for a class assignment; we were turning student perspectives into something instructors could discuss, revise around, and bring back into their courses.

UC Merced Game Dev Club Game Jam SATAL Journal