SUMMARYScale’s Remote Labor Index measures whether AI agents can complete real paid freelance projects end to end using briefs and files from platforms like Upwork. The benchmark spans 240 projects across 23 domains, including software development, design, architecture, data analysis, game development, and video work, with the human labor originally worth more than $140,000. The described automation rate of about 16% means current frontier models can fully handle a limited share of remote work tasks, but most projects still require humans.

REMOTE LABOR INDEX IS NOW EXPONENTIAL!! CLAUDE FABLE 5 SCORES OVER 16%, CLAUDE OPUS 4.8 DOUBLES ITS PERCENTAGE OVER OPUS 4.6. AI CAN NOW DO 16% OF REMOTE WORK. SERIOUS RISK FOR JOB DISPLACEMENT COMING SOON‼️‼️💨💨🚀🚀🔥🔥

https://labs.scale.com/leaderboard/rli

The Remote Labor Index, or RLI, is basically trying to answer a much more practical question than most AI benchmarks: can an AI agent actually do real paid remote work from start to finish?

Instead of testing models on isolated coding problems, math questions, or multiple-choice exams, RLI uses real freelance projects from platforms like Upwork. These are actual end-to-end jobs where a human freelancer was paid to produce a final deliverable. The benchmark gives the AI the project brief, files, and materials, then checks whether the AI can produce something that would be acceptable compared to the human freelancer’s work.

The main score is called Automation Rate. That means the percentage of projects where the AI’s output is judged good enough that a reasonable client would accept it instead of the human-made version. So a score of 16% means the AI successfully completed around 16 out of 100 real freelance-style projects at an acceptable level.

That’s why this benchmark feels more important than a lot of the usual AI benchmarks. It’s not asking “is the model smart?” in some abstract way. It’s asking how much real economic work can this thing actually automate?

The dataset is also pretty broad. Scale says RLI is based on 240 real paid freelance projects across 23 domains, including things like software development, design, architecture, data analysis, game development, and video/media work. The original human work represented over $140k of paid freelance labor, so the benchmark is grounded in real economic value rather than artificial test questions.
The interesting part is that the scores were tiny at first, like low single digits. But if newer frontier agents are now hitting the mid-teens, that’s a pretty big jump. It still means most real projects are not automated yet, but the rate of improvement is what makes it worth watching.

Basically, RLI is one of the better benchmarks for tracking AI’s progress toward replacing actual remote work.