Ugur Tuna

Architecting Commercial-Grade Data Solutions.

Senior Data Scientist with a track record of transforming complex, high-stakes data assets—from raw genomics to clinical text—into validated, research-ready, and commercially viable products.

34k+ Participants Analysed
445k+ Images Processed
>99.5% Target Accuracy
View Technical Project Deep Dive

Project Deep Dive: Gut Reaction IBD Data Hub

My Role Lead Data Scientist (Strategy & Curation)
Project Date 23 September 2025
Status Strategic Blueprint Delivered
Core Technologies Python, NLP (spaCy), LLMs, SQL, HPC

Executive Summary

[cite_start]

This project involved a forensic analysis of the Gut Reaction Health Data Research Hub, one of the UK's most significant multi-modal datasets for Inflammatory Bowel Disease (IBD)[cite: 4, 31]. [cite_start]The initial task was to inventory the asset, which comprises linked genomic, clinical, and imaging data for over 34,000 participants[cite: 6, 48].

[cite_start]

My investigation revealed that the asset was locked in a partially curated, unusable state due to an outdated, unverified de-identification process[cite: 7, 74]. [cite_start]The previous "95% complete" status was misleading and unauditable[cite: 8, 76]. [cite_start]I formulated a new, end-to-end blueprint to rebuild the curation pipeline using a state-of-the-art, AI-assisted workflow incorporating NLP and LLMs to achieve certifiable de-identification accuracy of over 99.5%[cite: 10, 11, 12]. [cite_start]The outcome is a strategic roadmap to transform this stalled academic asset into a commercially viable, research-grade data product[cite: 14, 15].

Forensic Analysis: Critical Failures of Legacy Pipeline

[cite_start]

The previous rule-based (Regex) pipeline was insufficient for complex clinical data, primarily due to its inability to understand context[cite: 83]. This led to critical data governance risks. [cite_start]My analysis identified several terminal flaws[cite: 86]:

    [cite_start]
  • No Validation Framework: The "95% accuracy" was an assumption, not a measured metric, rendering the entire process unauditable[cite: 87, 88].
  • [cite_start]
  • Inability to Handle Contextual PII: The system could find "Name: John" but would miss PII in narrative sentences like "...patient discussed his case with his brother, John"[cite: 84, 85].
  • [cite_start]
  • Unaddressed Image Metadata PII: No process existed to scrub identifiers from the filenames or metadata of the 445,000 histopathology images[cite: 91].
  • [cite_start]
  • No Auditable Logs: A lack of logging made it impossible to trace which files were processed, which failed, or what redactions were made[cite: 89].
  • [cite_start]
  • Brittle & Outdated Codebase: The legacy Python/R scripts lacked modularity and would be impossible to scale or maintain reliably[cite: 92].

Proposed Architecture: AI-Assisted Curation Pipeline

To address the legacy failures, I designed a modern, robust, and auditable pipeline. [cite_start]This moves beyond simple pattern matching to an intelligent, multi-stage process designed for scalability and certifiable accuracy[cite: 95, 104].

Ingestion & Harmonisation
Raw data (VCF, PDF, WSI) ingested into a data lake. [cite_start]Standardised with OMOP-CDM model[cite: 102, 103].
NLP Pre-processing
[cite_start]
Unstructured text processed with spaCy for Named Entity Recognition (NER) to tag PII with context[cite: 105, 106].
AI-Powered Redaction
[cite_start]
LLM (e.g., Gemini) reviews NER output to find and redact residual contextual PII, providing near-human verification[cite: 107, 108].
Validation & QA
Output benchmarked against a 'Gold Standard' dataset. [cite_start]A Human-in-the-Loop (HITL) step reviews low-confidence flags to achieve >99.5% accuracy[cite: 113, 118].

Code & Artifacts Browser

This is a representation of the project structure. These links can be updated to point to your GitHub repositories, technical documents, and output files.

Professional Experience

Data Scientist @ NIHR BioResource, Cambridge University

March 2023 - Present

    [cite_start]
  • Spearheaded the forensic analysis of the "Gut Reaction" IBD data hub, uncovering critical gaps in the legacy de-identification pipeline and preventing use of a high-risk dataset[cite: 5, 43].
  • [cite_start]
  • Engineered secure data pipelines for 10,000+ participants' genomic data, packaging VCFs and manifests which accelerated external research partners' analysis and reproducibility[cite: 182].
  • [cite_start]
  • Delivered a recall-by-genotype system by scripting secure SQL queries to clinical databases, accelerating cohort selection from weeks to hours and boosting volunteer identification by 35%[cite: 185].
  • [cite_start]
  • Developed a scalable HLA imputation pipeline by optimising HPC resources, which successfully reduced failed runs by 35% and improved data turnaround time by 30%[cite: 178].
  • [cite_start]
  • Mentored junior data scientists in Python and SQL, leading to a 25% reduction in production defects and enhancing team-wide delivery speed[cite: 200, 250].

Forecasting Analyst @ FCIT Solution Ltd

2021 - 2023

    [cite_start]
  • Delivered robust demand forecasts for a catalogue of 600,000 products, successfully reducing inventory stockouts by 18% and decreasing excess inventory by 12%[cite: 268].
  • [cite_start]
  • Developed and A/B tested pricing optimisation models for sales and rentals, resulting in a verifiable 5% uplift in sales and improved gross margin[cite: 270].
  • [cite_start]
  • Built customer churn prediction models using service histories, enabling targeted outreach that successfully reduced customer attrition by 7%[cite: 271].

Machine Learning Specialist @ XCEPTOR

2020 - 2021

    [cite_start]
  • Architected an invoice digitisation solution using computer vision and NLP, achieving 96% field extraction accuracy and increasing straight-through processing by 35 percentage points[cite: 277, 285].
  • [cite_start]
  • Integrated the automated pipeline with core ERP systems, significantly reducing payment delays by 22% and streamlining month-end reconciliation[cite: 288].

Marketing Data Analyst (Freelance) @ Patreon

2019 - 2020

    [cite_start]
  • Orchestrated data-driven campaigns for creators, attracting over 150,000 new followers and increasing creator revenue by 30% through targeted segmentation and channel optimisation[cite: 302].

Technical Toolkit

Languages & Core Libraries

  • Python (Pandas, NumPy, Scikit-learn)
  • SQL (PostgreSQL, BigQuery)
  • R
  • Bash Scripting

Platforms & Infrastructure

  • Cloud (AWS, GCP)
  • High-Performance Computing (HPC)
  • Docker
  • CI/CD (GitHub Actions)

ML & Data Science

  • NLP (spaCy, Transformers, LLMs)
  • Computer Vision (OCR)
  • Predictive Modelling & Forecasting
  • Genomic Data Analysis (PLINK, VCFtools)

Methodologies & Governance

  • Agile / Scrum
  • Data Curation & Validation
  • GDPR & Data Privacy
  • Commercialisation Strategy

About Me

I am a senior data scientist passionate about bridging the gap between raw data potential and tangible commercial or research value. My expertise lies in taking on ambiguous, complex data challenges, conducting deep forensic analysis, and architecting robust, scalable solutions that stand up to rigorous validation.

Beyond the technical, I believe in the importance of clear communication and mentorship. [cite_start]I enjoy translating complex technical concepts for diverse stakeholders and helping junior team members develop their skills, which I've found leads to more resilient teams and higher quality outcomes[cite: 200, 249].

What's Next?

Get In Touch

I am actively seeking senior roles where I can lead high-impact data projects. If you have a complex challenge that requires both technical depth and strategic vision, I would be delighted to connect.

[email protected]