How AI & proxies drive web scraping

clock • 8 min read

As public online data acquisition becomes increasingly important to decision-making, AI, web scraping and proxies will continue to find their way into business activities. While the inclusion of AI into web scraping is rather new, some data acquisition companies are already harnessing the power of machine learning.

In fact, proxies themselves are already being used in fast-growing industries like ecommerce and cybersecurity in one way or another, says Tomas Montvilas, the chief commercial officer (CCO) at Oxylabs, a proxy service provider:

"In short, proxies act as an intermediary that accepts connection requests from its user and sends them to a destination server. That means that servers - in most cases, plain old websites - think that the proxy is the original source of the request. In web scraping, proxies are mostly used for data request distribution and anonymity.

"There is no way to overstate the importance of proxies for certain business models. Some profit models rely on external data gathering (e.g. Semrush, who do SEO monitoring). These companies essentially sell data analysis software or the data itself.

"However, tried-and-true industries such as retail and financial services are beginning to incorporate public data gathering into their processes. Public data allows these businesses to gain a competitive advantage and drive additional growth.

"Proxies are a necessity for any business that wants to acquire high-quality public data. There are numerous ways they make the entire gathering process more reliable. Certain data is displayed differently based on the perceived location or device of the visitor (e.g. the price of an iPhone in the UK vs the price in Singapore). Proxies allow businesses to gather accurate information by harnessing the power of different IP addresses."

Building blocks of web scraping

Public data gathering, on the face of it, is a rather simple process. An application goes through a list of URLs, downloads the data stored there, and eventually provides an output of everything that has been downloaded. 

Montvilas continues: "However, public data gathering processes need consistent access to accurate data. Different types of proxies help applications handle most aspects related to data access and accuracy. Businesses generally choose between residential or data centre proxies, depending on the data source, if they are looking for a simple solution.

"AI and machine learning-based solutions are still quite rare in the web scraping industry. Currently, machine learning is mostly being used to automate certain tricky processes where trial and error would otherwise be used. For example, with our Next-Gen Residential Proxy solution we have created AI-based models that greatly increase data acquisition success rates for our clients."

There are many different proxy types used in web scraping activities. We asked Montvilas to describe the primary types and use cases for the different types of proxies in brief.

Residential proxies

"Residential proxies are the IP addresses of the computers, phones, or other devices granted by ISPs to regular customers. These devices become proxies whenever users install related software and consent to the related terms and services.

"We have sourced our 100 million+ residential proxy pool mostly by using a Tier A+ acquisition model. Put simply, it is the process of gaining IPs from consenting, aware users of a dedicated application and providing a monetary reward to them for any traffic use."

Residential proxies are widely used by businesses that need rotating IP addresses and city-level targeting. "A part of our residential proxy users are ad verification businesses. Fighting against ad fraud means checking various websites from different locations and devices to determine whether ads are being displayed faithfully. Our development teams worked hard to provide global coverage and city-level targeting to our residential proxy pool, making it a great fit for ad verification businesses.

"We predict that proxy use for this business model is only going to increase from here onwards. An unfortunate reality is that ad fraud is on the rise. Predicted costs of ad fraud from 2018 to 2022 may rise from $19 billion to $44 billion. Residential proxies simply cannot be replaced by anything else, necessitating greater use over time if the trends continue. There are even businesses whose model is completely reliant on them. For example, Trivago, a renowned accommodation comparison service, needs residential proxies to accurately deliver location-based pricing."

Next-Gen Residential Proxies

Next-Gen Residential Proxies are a unique product tied to Oxylabs themselves. Next-Gen Residential Proxies are an innovation in the industry by adding AI and machine learning to proxies.

"We developed Next-Gen Residential Proxies as an advanced version of residential proxies for those who are struggling with acquiring public data from complex targets. Our goal with Next-Gen Residential Proxies is to help businesses achieve 100 per cent data delivery success rates, making them perfect for targets with high failure rates such as ecommerce platforms.

Source: Oxylabs

 

"We know that AI & ML have garnered a lot of hype in the IT sector over the recent years. However, hype means nothing if there are no results to show for it. Therefore, in order to ensure the success and effectiveness of our AI & ML innovation, we created an advisory board who guide us during our development processes. Our advisory board is composed of people who are actively involved in PhD level research on AI or are working with companies that are machine learning industry leaders.

"Next-Gen Residential Proxies are proof that AI and machine learning do have their place in public web scraping. Currently, our solution has two primary features that employ AI: dynamic fingerprinting and adaptive parsing. The former is an automated process that picks the best way to send an HTTP request to maximise success rates; the latter is the process of automatically structuring data found in ecommerce product pages and returning a structured result."

Data centre proxies

Unlike residential IPs, data centre proxies are generally created by businesses that have access to reliable server infrastructure. Dozens of data centre proxies are borne out of one machine, making them a lot cheaper than their residential counterparts. Additionally, data centres have more reliable and faster internet connection than any device a regular consumer might have.

"Data centre proxies are the backbone of businesses that need to go through vast arrays of information on a daily basis. Data centre proxies are most commonly utilised in areas where access to data is not geographically restricted and traffic by IP is not as actively tracked. For example, brand protection companies comprise a large portion of our data centre proxy users.

"Performing daily brand protection activities (e.g. scanning the internet for counterfeit products) usually involve web scraping lots of data-heavy websites such as ecommerce platforms. Thus, using data centre proxies with the highest possible speeds and uptime is key to optimal business performance."

Real-Time Crawler

Real-Time Crawler exists as an out-of-the-box solution for public data acquisition. Instead of developing a web data acquisition tool in-house and using proxies, Real-Time Crawler does everything outside of data analysis.

"While Real-Time Crawler is not a proxy, it utilises them to allow its users to perform their requests. Of course, we implement it with all the advancements made with AI and machine learning. For example, Real-Time Crawler takes advantage of AI-powered dynamic fingerprinting, just like Next-Gen Residential Proxies.

"As a solution, Real-Time Crawler can be considered as a data API. Users can use highly customisable HTTP requests to scrape data according to their needs. These requests can contain many different parameters, such as proxy location, device, result language, etc."

All types of businesses use Real-Time Crawler as their primary source of external web data, including any business that needs to monitor search engines, ecommerce platforms, or other websites.

"In ecommerce, data acquired from Real-Time Crawler is often used for pricing tracking and analysis, modelling market trends, and doing platform-specific keyword research. Real-Time Crawler is tailored for those businesses that want to quickly kickstart their public external data gathering without the hassle of managing and maintaining gathering tools.

"Use cases with search engines vary but most are heavily related to SEO. Predictions about optimisation can often be made only with the help of reverse engineering ranking algorithms from data, making Real-Time Crawler a candidate for some SaaS businesses in the SEO industry."

Rising tides in the proxy industry

Proxies are here to stay. With the Covid pandemic accelerating the movement from retail to ecommerce for nearly all businesses, the proxy traffic per day is projected only to rise from here onwards.

"Our internal data reveals a meteoric rise of proxy traffic use in Q4 of 2020 alone. During Q4, traffic use increased to previously unseen heights. For example, on Black Friday residential proxy traffic shot up by 301 per cent, while data centre proxy traffic rose by 97 per cent compared to the same period in 2019. Additionally, surges in traffic use rose a week in advance of Black Friday in 2020, compared to a day [in advance] in 2019. Therefore, as we can clearly see, more and more companies are getting involved in public data gathering in order to stay relevant and attain profitable insights.

"Enquiries regarding various ecommerce and scraping aspects, including some well-known names in the industry, rose exponentially over the past year. While Real-Time Crawler hasn't struggled to meet demand, it has been stress tested numerous times by the rising need of data."

Web scraping and proxy use is expected to continue to rise as businesses want to unlock the insights provided by online public data. As AI and machine learning become increasingly popular, the effectiveness of external data acquisition is only going to increase. Businesses that want to keep raising profits will need to, in one way or another, implement public data gathering and analysis.

Tomas Montvilas

Tomas Montvilas is a chief commercial officer at Oxylabs, a leading big data infrastructure and proxy solutions provider. He is an expert of organisational growth with over seven years of experience in leadership roles in the areas of sales, marketing, product development and digital transformation.

You may also like
'Meticulously commendable':  AI's fingerprints found all over recent academic papers

Education

Tell-tale signs of genAI are rising fast

clock 27 March 2024 • 3 min read
Peter Cochrane: Fish can't climb trees and AI won't eclipse humanity

Strategy

AI will steer our evolution but not take over

clock 27 March 2024 • 3 min read
Power demand from UK datacentres set to surge six-fold over the next decade

Datacentre

National Grid CEO's warning comes as OpenAI chief Sam Altman says nuclear fusion is the answer

clock 27 March 2024 • 3 min read

Sign up to our newsletter

The best news, stories, features and photos from the day in one perfectly formed email.

More on Internet

Microsoft accused of illegally sourcing Google Chrome's data

Microsoft accused of illegally sourcing Google Chrome's data

Edge is playing up, again

Muskan Arora
clock 31 January 2024 • 1 min read
Russia further ramps up internet censorship

Russia further ramps up internet censorship

Federal agency plans to build database of location data

Penny Horwood
clock 19 January 2024 • 2 min read
Mozilla Review Checker to launch in November

Mozilla Review Checker to launch in November

Firefox users will be able to shop on key sites in the knowledge that product reviews have been checked for integrity

Penny Horwood
clock 11 October 2023 • 2 min read