At Littlebird, we're building systems that understand the world through text. The quality of our data is the foundation of our intelligence. We're looking for a pragmatic, algorithm-focused engineer to take on the critical challenge of cleaning and refining our raw text data at scale. This isn't a standard data plumbing role. You will design and build the core systems that transform noisy, semi-structured text into clean, coherent documents. This involves tackling complex problems like:
- Content De-duplication: Architecting systems to identify and merge near-duplicate or overlapping text content using techniques like shingling, MinHash, or other similarity algorithms.
- Signal & Noise Separation: Developing robust methods to strip non-essential content (e.g., UI boilerplate, ads, navigation) from raw inputs, using a combination of heuristics, pattern matching, and lightweight models.
- Text Normalization: Creating and optimizing high-performance pipelines that clean and structure text for downstream consumption by our core product and ML models.
The ideal candidate is a strong backend engineer (Python) who enjoys reasoning from first principles and has a deep appreciation for the craft of writing efficient, well-tested, and performance-conscious code. You should be comfortable designing algorithms, managing data pipelines with caching (Redis) and asynchronous processing, and making pragmatic trade-offs to solve ambiguous, real-world data problems.
If you are passionate about the foundational challenge of creating pristine data from messy inputs, let's chat.