Hi,

Web Data Engineer | Web Scraping & Data Extraction

I build Python systems that automatically extract data from websites, PDFs, and APIs, clean it, and deliver it in the format you need (CSV, Excel, or Google Sheets).
I focus on scraping dynamic sites, handling JavaScript-rendered content, pagination, and anti-bot challenges.
My solutions reduce manual efforts by 70–80%, delivering consistent and reliable data, so teams can focus on insights instead of data wrangling.

Let’s connect and streamline your data needs!

About Me

I’m a Web Data Engineer focused on building reliable extraction systems that automatically pull data from websites, PDFs, and APIs, then clean and structure it into usable formats such as CSV, Excel or Google Sheets.

Most of my work involves Python and Playwright to solve scraping challenges on dynamic websites by handling JavaScript-rendered content, pagination, session states, and access restrictions. Where possible, I use direct API integration to reduce overhead, and PDF parsing and OCR tools to process large volumes of complex PDF data.

I design systems that reduce manual data handling by 70–80% while improving consistency and reliability, enabling immediate business impact.

My approach is shaped by my background in mathematics and years of teaching, which trained me to break down problems and choose the most efficient solution, whether through API access or parsing-based extraction.

The result is simple: cleaned, structured, reliable data delivered in the format you need.

If you’re working with data that’s difficult to collect or organise, I can help simplify it.

My Skills

My Technical Skills

Soft Skills

Featured Projects

Real-World Web Scraping — From E-Commerce to Property Markets

Resilient Multi-Page Property Listings Scraper

Built a reliable Python–Playwright scraping pipeline for NigeriaPropertyCentre.com with retry logic, configurable filters, robust error handling, and clean CSV export using Pandas. Designed for resilient multi-page extraction under unstable page loads.

Python

Playwright

Pandas

View Project

Automated Daily Price Intelligence Tracker – Samsung Galaxy A06 (Jumia Nigeria)

End-to-end automated pipeline for tracking Galaxy A06 listings on Jumia. Powered by Playwright, playwright-stealth, and GitHub Actions with structured historical tracking and change detection using pandas. Enhanced with autogenerated user agents, human-like interaction simulation, and exponential backoff retry logic for improved resilience against bot detection and transient failures. This solution delivers continuous competitive pricing intelligence.

Python

Playwright

Playwright-Stealth

Pandas

GitHub Actions

View Project

Movoto Property Data Pipeline

Extracted structured listing data from JS-rendered pages using Playwright with pagination control, DOM synchronization, and reliable CSV normalization.

Python

Playwright

Pandas

View Project

Dynamic E-Commerce Scraper (Thomann UK)

Automated data extraction from a JavaScript-rendered e-commerce site with cookie handling, pagination logic, and reliable selector strategies.

Python

Playwright

View Project

Asynchronous Web Scraping Pipeline (IMDb)

Engineered an asyncio-powered Playwright pipeline to extract and normalize data for the top 250 TV series at scale, with resilient error handling and structured export.

Python

Playwright

Asyncio

Pandas

View Project

Automated E-Commerce Price & Availability Tracker (Jumia TVs)

Automated daily extraction of product data — price, stock, discounts, reviews, and ratings — from Jumia over 5 days, producing a validated historical dataset ready for trend analysis.

Python

BeautifulSoup

Requests

lxml

Pandas

View Project

View All 14 Projects

Data Projects

My Recent Projects

Resilient Multi-Page Property Listings Scraper for NigeriaPropertyCentre

Project Summary: Designed and implemented a data extraction solution for NigeriaPropertyCentre.com using Playwright and Python. The system handles dynamic content rendering, pagination, and configurable search filters such as listing type, city, and bedroom count to extract structured property data including pricing, location, and key features (agent info, bedrooms, bathrooms, parking spaces, and square metre area). Built with a focus on reliability, the scraper incorporates retry logic for intermittent page failures, logging, dynamic user-agent rotation, and efficient error handling to ensure consistent data collection across multiple pages. Post-processing includes data cleaning, normalisation, and anonymisation of sensitive agent information, with output structured in CSV format. This solution is tailored for efficient, reliable and repeatable extraction from NigeriaPropertyCentre, ensuring accuracy

Python

Playwright

Pandas

View Project

Automated Daily Price Intelligence Tracker – Samsung Galaxy A06 (Jumia Nigeria)

Project Summary: A fully automated price monitoring system that tracks Galaxy A06 listings on Jumia Nigeria. Runs daily via GitHub Actions, using Playwright with stealth enhancements for real-time extraction and pandas for historical tracking with price change detection. Automatically classifies each price movement (new, increased, decreased, no change) while preserving complete historical records. Built with anti-bot resilience techniques including playwright-stealth integration, autogenerated user agents to mask identity, human‑like browsing behaviors to evade detection, and enhanced retry logic using exponential backoff for stronger resilience. It also includes robust error handling, structured logging, and a controlled execution timeline of 4 weeks and 3 days. This solution delivers continuous competitive intelligence on pricing dynamics, ensures reliable data collection, and provides a scalable foundation for advanced analytics such as dashboards or predictive modeling.

Python

Playwright

Playwright-Stealth

Pandas

GitHub Actions

View Project

Movoto Property Listings Data Pipeline

Project Summary: A Python-based web scraping pipeline using Playwright to extract structured real estate listing data from Movoto, with a configurable city parameter (currently set to Phoenix, AZ). The scraper programmatically navigates JavaScript-rendered listing pages, iterates through pagination, collects property URLs, and extracts key details from individual listings including address, price, bedrooms, bathrooms, square footage, property type, and year built. Implements dynamic Chrome user-agent generation, controlled navigation timing, and DOM-state synchronization via explicit selector waits, with defensive error handling under dynamic content shifts ensuring high reliability. Output is normalised for CSV export using Pandas, ideal for market research, price monitoring, and real estate analysis.

Python

Playwright

Dynamic User-Agent Generator

Pandas

View Project

Automated E-Commerce Price & Availability Tracker (Jumia TVs)

Project Summary: Automated the extraction of daily an E-Commerce product data (price, stock, discount, reviews, ratings, url) from Jumia, eliminate manual tracking. The scraper ran automatically for 5 days, generating a historical dataset that was cleaned and validated for reliable trend analysis and reporting. The Python script handles scraping, data cleaning, validation and storing data in a structured CSV file ready for analysis.

Python

BeautifulSoup

Requests

lxml

Pandas

View Project

Thomann UK ST-Style Guitars Dynamic Web Scraper

Project Summary: Built a Python Playwright scraper that simulates real browser behavior to extract ST-style guitar data from Thomann UK, a dynamic e-commerce website. Implemented automated cookie consent handling, pagination control, and reliable extraction from JavaScript-rendered content handling using safe locator strategies, DOM manipulation, and controlled waits to reliably extract product titles, prices, reviews, ratings, and availability. Output can be easily extended to CSV, JSON, or database

Python

Playwright

View Project

LuxuryEstate Property Data Scraper

Project Summary: This Python web scraper extracts luxury property listings from LuxuryEstate.com, focusing on New York City, United States. It navigates pagination, collects unique listing URLs, and parses key details such as title, price, room counts, construction year, amenities, and views using BeautifulSoup for HTML parsing and httpx for robust HTTP requests. Randomized user agents and adaptive sleep delays help reduce detection. The data collected is compiled into a Pandas DataFrame and exported as a CSV file for easy analysis or integration. Ideal for market research or data aggregation tasks and easily configurable for other cities/countries through simple variable updates.

Python

BeautifulSoup

httpx

lxml

Pandas

View Project

IMDb Top 250 TV Shows – Asynchronous Web Scraping Pipeline

Project Summary: Engineered an asynchronous web scraping pipeline with Playwright, asyncio, and Pandas to extract web data from IMDb's Top 250 TV series. The pipeline automated navigation of JavaScript-rendered pages, normalized show URLs with regex, and applied resilient error handling for failed requests—achieving reliable data extraction. This initiative produced a clean CSV dataset ready for analytics, visualization, or integration into media recommendation systems.

Python

Playwright

Asyncio

Pandas

View Project

Automated 3-Day Crypto Monitoring from Yahoo Finance with Python

Project Summary: This project automates the extraction of real-time cryptocurrency data, including coin name, price, change, change percentage, and volume, from Yahoo Finance every 6 hours over a 3-day period. The goal is to track market behavior and spot short-term trends. The script uses Python libraries to scrape, store and visualize the data in a structured CSV file for ongoing analysis.

Python

BeautifulSoup

Requests

Pandas

Seaborn

Matplotlib

View Project

WhiskyExchange API Data Pipeline

Project Summary: A Python scraper that extracts whisky product data (name, brand, and price) from The Whisky Exchange search API. The scraper uses requests.Session() for persistent connections, handles headers and cookies to mimic browser requests, iterates through paginated results, and stores the structured output in a CSV file using pandas. This project demonstrates skills in web data extraction, JSON parsing, session management, and data structuring for analysis.

Python

API Integration

Requests

Pandas

Session & Cookies Management

JSON Parsing

View Project

Customer Value Segmentation & Revenue Behaviour Analysis for E-Commerce Growth

Project Summary: Analyzed 42,992 e-commerce transactions using RFM segmentation to classify customers into top-value, potential, and low-engagement tiers, and to uncover revenue trends across age groups, countries, and product categories. This project delivers a comprehensive view of customer behaviour and revenue dynamics to support strategic growth initiatives. The analysis was conducted using Python to ensure scalable, data-driven insights for decision-making.

Python

Pandas

Matplotlib

Seaborn

Excel

Dashboard

View Full Analysis

Customer Churn Prediction using Machine Learning

Project Summary: Developed a Random Forest model to predict customer churn using demographic and behavioral data. Evaluated performance with precision, recall, F1-score, and optimized thresholds to improve churn detection, delivering actionable insights to support proactive business decisions.

Python

Numpy

Pandas

scikit-learn

Random Forest

Accuracy Score

Precision

Recall

F1-score

Precision-Recall Curve

View Model & Code

Natural Gas Storage Contract Valuation & Price Curve Modeling

Project Summary: This project forecasts natural gas prices for any date using historical monthly data and interpolated daily values. It identifies seasonal trends, visualizes monthly patterns, and applies an ARIMA model to predict prices from October 2024 to September 2025, supporting storage contract pricing and trading desk decisions.

Python

Pandas

Numpy

ARIMA

scikit-learn

SeaBorn

Matplotlib

View Time-Series Model & Insight

Predictive Credit Risk Assessment – JPMorgan Chase Simulation

Project Summary: Developed a logistic regression model to predict personal loan default probabilities and estimate expected losses (10% recovery), providing actionable insights for credit risk management and capital allocation.

Python

Numpy

Pandas

scikit-learn

View Model & Code

Explore skills I bring to the table

View Web Development Projects

My Experience

My Work Experience

Extensive

Experience in The Field

A Diverse

Portfolio of Projects

Numerous

Satisfied Customers

Available for freelance & contract work — Let's talk

My Contact

Contact Us At

mathiasmichael2@gmail.com

+234 912 146 1656