Solving CAPTCHAs With Machine Learning to Enable Dark Web Research

A joint academic research project from the United States has developed a method to foil CAPTCHA* tests, reportedly outperforming similar state-of-the-art machine learning solutions by using Generative Adversarial Networks (GANs) to decode the visually complex challenges.

Testing the new system against the best current frameworks, the researchers found that their method achieves more than 94.4% success on a carefully curated real-world benchmark dataset, and has proved capable of ‘eliminating human involvement’ when navigating a highly CAPTCHA-protected emerging Dark Net Marketplace, automatically resolving CAPTCHA challenges in a maximum of three attempts.

Architecture for DW-GAN. Source: https://arxiv.org/pdf/2201.02799.pdf Workflow for DW-GAN. Source: https://arxiv.org/pdf/2201.02799.pdf

The authors contend that their approach represents a breakthrough for cybersecurity researchers, who traditionally have had to bear the costs of supplying humans-in-the-loop to manually solve CAPTCHAs, usually via crowdsourcing platforms such as Amazon Mechanical Turk (AMT).

If the system can prove adaptable and resilient, it may further pave the way for more automated oversight systems, and for the indexing and web-scraping of TOR networks. This could enable scalable and high-volume analyses, as well as the development of new cybersecurity approaches and techniques, which have been hamstrung, to date, by CAPTCHA firewalls.

The paper is titled Counteracting Dark Web Text-Based CAPTCHA with Generative Adversarial Learning for Proactive Cyber Threat Intelligence, and comes from researchers at the University of Arizona, the University of South Florida, and the University of Georgia.

Implications Since the system – called Dark Web-GAN (DW-GAN, available at GitHub) – is apparently so much more performative than its predecessors, there is the possibility that it will be used as a general method to overcome the (usually less difficult) CAPTCHA material on the standard web, either in this specific implementation, or based on the general principles that the new paper outlines. Due to limited storage at GitHub, however, it is currently necessary to contact the lead author Ning Zhang in order to obtain the data associated with the framework.

Because DW-GAN has a ‘positive’ mission for breaking CAPTCHAs (much as TOR itself originally had a positive mission for protecting military communications and, later, journalists), and because CAPTCHAs are both a legitimate defense (frequently and controversially used by ubiquitous CDN giant CloudFlare) and a favorite tool of illegitimate dark web marketplaces, the approach is arguably a ‘leveling’ technology.

‘[While] this study is mainly focused on dark-web CAPTCHA as a more challenging problem, the proposed method in this study is expected to be applicable to other types of CAPTCHA without loss of generality.’

Presumably DW-GAN, or a similar system, would need to become widely and evidently diffused in order to prompt dark web markets to seek less machine-resolvable solutions, or at least to evolve their CAPTCHA configurations periodically, a ‘cold war’ scenario.

As the paper observes, the dark web is the primary font of hacker intelligence relating to cyber attacks, which are estimated to cost the global economy $10 trillion USD by 2025. Therefore onion networks remain a relatively safe environment for illicit dark net communities, which can repel boarders by various methods, including session timeouts, cookies, and user authentication.

However, the authors observe, none of these obstacles are so great as the tranche of CAPTCHAs that punctuate the browsing experience in a ‘sensitive’ community:

‘While most of these measures can be effectively circumvented through implementing automated counter measures in a crawler program, CAPTCHA is the most hampering anti-crawling measure in the dark web that cannot be easily circumvented due to high cognitive capabilities that are often not possessed by automation tools’

Text-based CAPTCHAs are not the only available option; there are variants, familiar to many of us, that challenge the user to interpret video, audio, and especially images. Nonetheless, as the authors observe, text-based CAPTCHA is currently the challenge of choice for dark web markets, and a natural starting-place to make TOR networks more susceptible to machine analysis.

Though a prior approach from Northwest University in China used Generative Adversarial Networks to derive feature patterns from CAPTCHA platforms, the authors of the new paper note that this method relies on interpretation of a rasterized image, rather than a deeper examination of letters recognized in the challenge; and that DW-GAN’s effectiveness is not impacted by the variable length of nonsense words (and of numbers) that are typically found in dark web CAPTCHAs.

DW-GAN uses a four-stage pipeline: first the image is captured, and then fed to a background denoising module which uses a GAN that has been trained on annotated CAPTCHA samples, and is therefore able to distinguish letters from the perturbed background that they are resting on. The extracted letters are then further filtered out from any remaining noise after the GAN-based extraction.

Next, segmentation is performed on the extracted text, which is then broken down into what appear to be constituent characters, using contour detection algorithms. Finally, the ‘guessed’ character segments are subject to character recognition via a Convolutional Neural Network (CNN).

Sometimes characters can overlap, a hyper-kerning that’s specifically designed to fool machine systems. DW-GAN therefore uses interval-based segmentation to enhance and isolate borders, effectively separating characters. Since the words are usually nonsense, there is no semantic context to aid in this process.

DW-GAN was tested against CAPTCHA images from three diverse dark web datasets, as well as a popular CAPTCHA synthesizer. The dark markets from which the images originated comprised two carding shops, Rescator-1 and Rescator-2, and a novel set from a then-emerging market called Yellow Brick (which was reported to have later disappeared in the wake of the takedown of DarkMarket).

According to the authors, the data used in testing was recommended by Cyber Threat Intelligence (CTI) experts based on their wide diffusion across dark net markets.

Testing each dataset involved the development of a TOR-facing spider tasked with collecting 500 CAPTCHA images, which were subsequently labeled and curated by CTI advisors.

Three experiments were devised. The first evaluated the general CAPTCHA-defeating performance of DW-GAN against standard SOTA methods. The rival methods were image-level CNN with preprocessing, involving grayscale conversion, normalization, and Gaussian smoothing, a joint academic effort from Iran and the UK; character-level CNN with interval-based segmentation; and image-level CNN, from the University of Oxford in the UK.

The researchers found that DW-GAN was able to improve on prior results across the board (see table above).

The second experiment was an ablation study, where various components of the active framework are removed or disabled in order to discount the possibility that external or secondary factors are influencing the results.

Here too, the authors found that disabling key sections of the architecture reduced the performance of DW-GAN in nearly all cases (see table above).

The third offline experiment compared the efficacy of DW-GAN against benchmark image-based method and two character-level methods, in order to determine the extent to which DW-GAN’s character evaluation influenced its usefulness in cases where a nonsense CAPTCHA word was an arbitrary (rather than predefined) length. In these cases, the CAPTCHA length varied between 4 to 7 characters.

For this experiment, the authors used a training set of 50,000 CAPTCHA images, with 5,000 reserved for testing in a typical 90/10 split.

Finally, DW-GAN was deployed against the (then live) Yellow Brick dark net market. For this test, a Tor web browser was developed which integrated DW-GAN into its browsing capabilities, automatically parsing CAPTCHA challenges.

In this scenario, a CAPTCHA was presented to the automated crawler for every 15 HTTP requests, on average. The crawler was able to index 1,831 illegal items for sale in Yellow Brick, including 1,223 drug-related products (including opioids and cocaine), 44 hacking packages, and nine forged document scans. In total the system was able to identify 286 cybersecurity-related items, including 102 purloined credit cards and 131 stolen account logins.

The authors state that DW-GAN was in all cases able to crack a CAPTCHA in three or fewer attempts, and that 76 minutes of processing time were necessary to account for CAPTCHAs guarding all 1,831 products. No humans were needed to intervene, and no endpoint failure cases occurred.

The authors note the emergence of challenges that offer a greater level of sophistication than text CAPTCHAs, including some that seem modeled on Turing tests, and observe that DW-GAN could be enhanced to accommodate these new trends as they become popular.

This sounds great. I just wish I knew what half of it meant. But, I’m excited.

...the researchers found that their method achieves more than 94.4% success on a carefully curated real-world benchmark dataset

["carefully curated"] == ["cherry picked"]

Neil Diamond: Cherry Cherry.

Why are you excited. Is this a good thing?

I'm not sure what it means. But I think it means the Dark Web will be less dark, more easily monitored.

Lots of bad guys hide on the Dark Web. But also lots of political dissidents, which should include an increasing number of conservatives, the way things are going.

So I'm not sure if this is a good thing.

Good. Is that a letter O or a zero? A one or a small l? Why do they even have those as choices?

6 posted on 01/16/2022 8:51:56 PM PST by Dr. Sivana ("There are only men and women."-- George Gilder, Sexual Suicide, 1973)

G**gle’s reCAPTCHA is a Denial-of-Service mechanism they use to harass netizens who would prefer to surf the web anonymously into using other methods that would allow G**gle to track their browsing habits and collect data on them.

Like everything else G**gle does, it’s part of their effort to create and all-seeing, all-knowing, all-controlling one-world government.

Like everything else G**gle does, it’s evil posing as something beneficial.

All I know is that I absolutely HATE Captcha & reCAPTCHA; Especially on sites where you are already registered and have a a password.

I quit going to some of my usual sites that started using it, including ones I had spent lots of money at.

As far as I’m concerned it is just another evil Google trick.

They always use the ‘protecting you from pedos and terrorists’ angle. But there is nothing magical about this technique that will only allow it to be used on bad guys.

“This sounds great. I just wish I knew what half of it meant. But, I’m excited.”

Why?

11 posted on 01/17/2022 4:26:10 AM PST by Openurmind (The ultimate test of a moral society is the kind of world it leaves to its children. ~ D. Bonhoeffer)

12 posted on 01/17/2022 4:27:06 AM PST by Chickensoup ( Leftists totalitarian fascists are eradicating conservatives)

” I’m not sure what it means. But I think it means the Dark Web will be less dark, more easily monitored.

Lots of bad guys hide on the Dark Web. But also lots of political dissidents, which should include an increasing number of conservatives, the way things are going.”

What can be used on the darkweb can also be used on the indexed web... So that means anywhere they like, it is not limited to just the “Evil darkweb”.

13 posted on 01/17/2022 4:29:47 AM PST by Openurmind (The ultimate test of a moral society is the kind of world it leaves to its children. ~ D. Bonhoeffer)

Q&A “What color is an Orange” kept bots out of our site for a year and a half before a human finally loaded the automated answer in his bot. All we had to do was make a new Q&A.

14 posted on 01/17/2022 4:42:18 AM PST by Openurmind (The ultimate test of a moral society is the kind of world it leaves to its children. ~ D. Bonhoeffer)

What if they used words that were considered politically incorrect? From day to day?

15 posted on 01/17/2022 5:33:37 AM PST by daniel1212 ( Turn to the Lord Jesus as a damned+destitute sinner, trust Him to save + be baptized + follow Him!)

“They always use the ‘protecting you from pedos and terrorists’ angle.”

Absolutely, it will be used across the board for everything as they like.

16 posted on 01/17/2022 6:28:28 AM PST by Openurmind (The ultimate test of a moral society is the kind of world it leaves to its children. ~ D. Bonhoeffer)

The crooks found the easiest way to beat CAPTCHA images is simply to pay low wage workers to decode them.

EcoTech article on CAPTCHA jobs

19 posted on 01/17/2022 8:15:28 AM PST by Albion Wilde (If science can’t be questioned, it’s not science anymore, it’s propaganda. --Aaron Rodgers)

I get these computer voice telephone calls.

The do pretty good till I ask them how may moons does Jupiter have.

. . . I’ll connect you to a manager . . .

20 posted on 01/17/2022 8:38:34 AM PST by Scrambler Bob (My /s is more true than your /science (or you might mean /seance))

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.