Data Extraction Engineer

--PST.AG--

Apply Now!

PST.AG is a premier provider of bespoke data and software solutions on a global scale. We collaborate with legal authorities worldwide to develop and distribute customs and global trade data, aiding a variety of industries. Our mission is to empower businesses to adapt and excel in a dynamic global landscape.

We're a growing medium-sized enterprise with a global team, embracing a near-complete remote work culture. PST.AG was founded in the center of Germany, now our diverse team spans over 20 nations across four continents, and we're poised for further growth.

Committed to open-source technology, we actively contribute to and support the open-source community, believing it fosters innovation and collaboration.

Key Responsibilities

Specification-Driven Extraction Engineering

• Design and maintain declarative extraction specifications—using Pydantic models, JSON schemas, or domain-specific languages—that describe exactly which fields to capture, their types, and validation rules.

• Implement pipelines that translate these specifications into executable extraction plans, leveraging both classical (Scrapy, Playwright) and AI-augmented (LLM-based semantic parsing) backends.

• Build reusable specification libraries for recurring data types (product prices, tariff codes, regulatory texts) to accelerate onboarding of new sources.

Autonomous & Self-Healing Systems

• Deploy self-healing spiders that automatically detect website layout changes and repair themselves using Model Context Protocol (MCP) servers (e.g., Scrapy MCP Server, Playwright MCP).

• Integrate semantic extraction (Scrapy-LLM, custom LLM pipelines) to eliminate selector brittleness—spiders rely on field descriptions, not fragile XPaths.

• Orchestrate complex, multi-step browsing workflows with agentic frameworks (BMAD/TEA, AutoGPT-like agents) that reason about page state, adapt to anti-bot measures, and correct their own behaviour in real time.

Platform Thinking & Reusability

• Move beyond one-off scrapers: build a component-based extraction platform where selectors, login handlers, and pagination logic are shared, versioned, and tested.

• Implement monitoring, alerting, and automatic rollback for failed extraction runs.

• Champion ethical crawling by design—rate limiting, robots.txt respect, and compliance with GDPR/CCPA are built into the specification layer, not retrofitted.

Collaboration & Continuous Innovation

• Partner with data scientists and domain experts to refine extraction specifications for complex, unstructured domains (e.g., legal texts, tariff classifications).

• Evaluate and pilot emerging tools to push automation coverage beyond 90%.

• Document and evangelise specification-driven best practices across the engineering organisation.

Specification Driven Extraction

Documentation Skills

Problem solver

Data Transformation

Troubleshooting

What's great in the job?

- Above-average salary structure from the market
- 52 days paid time off per year
- Entry into a collegial, modern, and supportive working atmosphere at a market leader
- A young, dedicated, and global multicultural team
- Flexible working hours and remote working opportunity
- Flat hierarchies
- A wide range of training opportunity

Your experience

Bachelor’s degree in Computer Science
3+ years of experience in web scraping or data extraction

Must-haves

Experience with specification-Driven Extraction
Hands‑on use of Scrapy‑LLM, Scrapy MCP Server, or similar systems that decouple field definitions from page structure
Familiarity with frameworks that give LLMs browser control (Playwright + MCP, BMAD/TEA) to handle complex, non‑deterministic crawling tasks.
Classical Scraping Fundamentals
Data Validation & Storage – Ability to define validation rules within specifications and land clean data into SQL/NoSQL databases or data lake
Basic API integration and authentication flows.
HTTP, DOM, XPath, CSS.
Experience with Python

Nice-to-haves

Contributions to open-source scraping or AI-automation projects.
Contributions to open-source scraping or AI-automation projects.
Familiarity with data privacy engineering (GDPR, CCPA) baked into specification design.
DevOps light – Docker, CI/CD for testing extraction specifications.

Trainings

Improve your skills on a daily basis

Perks

Enjoy the stability of a full-time and permanent position

WFH

A fully remote opportunity

Inclusive Team

Always feel at home

Apply Now!