Over the years, techniques for email obfuscation, like modifying characters (for instance, replacing '@' with '(at)'), have been utilized to prevent automated programs from easily collecting email addresses. While these methods were effective against basic web-scraping techniques, they have become less effective in light of advancements in AI technology.
In a recent project, I attempted to gather data from the 'Ask HN: Who is Hiring' threads on Hacker News. The goal was to organize job offers and related information into a structured Google Sheet for convenient access.
In the course of this data collection, I requested ChatGPT to include the contact information provided in the job postings. Knowing that these emails were often concealed using common obfuscation methods, I expected difficulties in extracting them.
To my surprise, ChatGPT demonstrated an exceptional ability to decipher the concealed email addresses. Even when multiple obfuscation methods were employed, the AI model adeptly identified and retrieved the intended email addresses with remarkable accuracy.
Ultimately, I decided to exclude the contact emails from the final Google Sheet, as individuals who obfuscate their emails clearly do not wish for them to be publicly accessible.
Let me share with you some intriguing techniques I encountered while reviewing the extracted data:
Splitting Information within the Post:
This approach proved to be quite effective:
MyCompany | <a href="https://company.so" rel="nofollow noreferrer">https://company.com</a> | Senior Backend Engineer | REMOTE (USA only) | Full Time ...
and feel free to follow send drop me a note: john@[MyCompany domain]
However, it was no match once I introduced the "think step by step" magic words.
Information Indirection:
MyCompany | St Paul, MN, USA | Full Time | REMOTE | Wine and General Open Source Developers | C-language systems programming
<https://www.company.com/about/jobs>
...
Please direct any questions to the email address on our Jobs page.
This method proved highly effective since my code lacked browsing functionality.