Awesome
!!!REPO DEPRACTICATED!!!
Look into backend headless-task-server and php helper headless-task-server-php.
Why? Main reasons is...
- Hero(ex SecretAgent) provide more stable, patched version of Chrome (not Chromium)
- Hero(ex SecretAgent) contains builded-in a lot of techics to make headless browser undetectabe for bot detectors.
PLAYWRIGHT task server
It's a Node.Js server that's hold playwright to process tasks (mainly - crawling)
Concept:
- Express hold RESTful API and receive authorized(or not) request with task script
- Add your task to queue (will be runed as soon as any worker will be free)
- Run separate context(incognito env.)
- Run your task script in something like isolated context
- Return to express callback result of task script and answer for your request
Example of request:
POST to "http://server_address:port/task"
Content-Type: application/x-www-form-urlencoded
if in config.json AUTH_KEY is not null, add header
Authorization: HERE_AUTH_KEY
in form, field with name 'script'
Example of request
!!!WARNING!!!
Field script should be a string.
fetch("http://server_address:port/task", {
"method": "POST",
"headers": {
"content-type": "application/x-www-form-urlencoded",
"authorization": "HERE_AUTH_KEY"
},
"body": {
"options": {
"proxy": {
"server": "PROTOCOL://ADDRESS:PORT",
"bypass": "",
"username": "USERNAME",
"password": "PASSWORD"
}
},
"script": "HERE_IS_SCRIPT"
}
});
Example of script (playwright docs)
//Creating page inside context
const page = await context.newPage();
//Preparing key's for data storage
let data = {
hosts: [],
res: [],
ip: null
};
//Listener, that's catch all requests, block everything except HTML and loging them.
page.route('**', route => {
//Used module.URL (instance of node.js URL)
data.hosts.push(modules.URL.parse(route.request().url()).hostname);
if (route.request().resourceType() !== 'document')
{
route.abort('aborted');
}
else {
data.res.push(route.request().resourceType());
route.continue();
}
});
//Open 2ip main page and waiting for load
await page.goto('https://2ip.ru/');
//Extracting ip from html
data.ip = (await page.$('div.ip')).innerText();
//End script execution and return data
//also can be reject in case of script failure
resolve(data);
Var data
locally created and puted throw resolve. Everything from var, will be displayed in response.
All manually created var's/const's/e.t.c. inside script will be ignored in response.
Also task server support modeules
, custom libs set, that will be available inside runed script context.
config.json
Proxy
In config, proxy property can be null
, object
or per-context
(default: per-context
), follow this docs.
Example of proxy object
{
"server": "hostname:port",
"bypass": "",
"username": "usernameForProxy",
"password": "passwordForProxy"
}
Proxy per-context configuration docs
To set GLOBAL proxy, use ENV
In case of unnecessary authorization with username & password, fields username
and password
can be skipped or can be null
Env
PW_TASK_KEY - Key for Authorization
PW_TASK_PORT - Running port
PW_TASK_PROXY - Proxy hostname:port
PW_TASK_USERNAME - Proxy username
PW_TASK_PASSWORD - Proxy password
Additional
PHP-Lib for generating simple task script. (lib cover min. req.)
todo
- Cover node inside docker container with xvfb
- Submit issues with ideas