-
Notifications
You must be signed in to change notification settings - Fork 261
Home
Osmosis is a utility for easily extracting data from HTML or XML documents.
These are all of the "commands" that are available for chaining in an Osmosis instance.
- click
- config
- contains
- data
- debug
- delay
- do
- doc
- dom
- done
- failure
- filter
- find
- follow
- get/post
- login
- match
- paginate
- parse
- set
- submit
- then
Click on nodes found by selector
Discard any nodes whose contents do not match string
Set HTTP options and configure Osmosis
Calls
callbackwith the current data objectEmpty the data object
Add or replace each
keyin the data object with a newval
Call
callbackwhen any debug messages are received
Delay starting next promise for
seconds(float or int)
Call each Osmosis instance with the current context. This will always continue, even if an instance fails.
Reset the current context to the
Document
Create a DOM object from the current context.
The
callbackwill be be called with 3 arguments (window,data, andnext). Thenext([context], [data])function must be called at least once
Calls
callbackwhen parsing has completely finished
Call
callbackwhen any error messages are received
Discard any nodes that match
selector
Discard any nodes that do not match
selector
Find elements based on
selectoranywhere within the current document
Follow URLs found via
selector. Ifselectorisn't provided,followwill search the current element text or common URL attributes (href, src, etc).
.follow()
.follow('@href')
.follow('a')
.follow('a@href')
.follow('span.outlink')
.follow('input.cloneURL@value')
.follow('link[type="application/rss+xml"]@href')Make an HTTP request
url - A string containing a URL, which can be relative to the current context.
data (optional) - An object containing GET query parameters or POST request data.
opts (optional) - An object containing HTTP request options.
Note: Query parameter values will be urlencoded by needle so make sure that your parameter values are not urlencoded.
Call
callbackwhen any log messages are received
Submit a login form.
user - A string containing a username, email address, etc.
pass - A password string
success (optional) - A selector string determining if the login attempt succeeded
fail (optional) - A selector string determining if the login attempt failed
loginfinds the first form containinginput[type="password"]and uses that input as the password field. It will use the preceding<input>element as the user field.
Discard any nodes whose contents do not match
RegExp
Paginate the previous request
limittimes based onselector.selector (String) - A selector string for either:
- an element with the next page URL in its inner text or in an attribute that commonly contains a URL (href, src, etc.)
- an element whose
nameandvalueattributes will respectively be added or replaced in the next page query.selector (Object) - An object where each
keyis a query parameter name and eachvalueis either a selector string or an increment amount (+1, -1, etc.).limit (Number) - Total number of "next page" requests to make.
limit (String) - A selector string for an element containing the total number of requests to make.
.paginate('a.nextPage') // go to `a.nextPage` `@href` .paginate('link[rel="next"]@href') // go to `link` `@href` .paginate('input[name="page"]') // update `page` parameter of the next query // adds 20 to the `startIndex` query parameter // sets `page` query parameter to `a.nextPage` content // stops after 15 requests are made .paginate({ startIndex: +20, page: 'a.nextPage' }, 15)
Pause, resume or stop an osmosis instance.
Parse an HTML or XML string
string - A string or buffer containing the HTML/XML data
Set
nameto the value ofselectorSet each
keyto the value of eachvalselector.
.set('title') // set 'title' to current element text
.set('title', 'a.title') // set 'title' to text of 'a.title'
.set({
title: 'a.title',
description: 'p.description',
url: 'a.permalink @href',
images: ['img @src'],
comments: [
osmosis
.follow('a.comments')
.find('div.comment')
.set({
'author': '.author'
'content': 'p.content',
'date': '.date'
})
]
});Submit a form
selector - A selector for the
<form>element orsubmitbutton.data (optional) - An object where each
keyandvaluerepresents a form input name and value
Calls
callbackwith the context of the current element.The
contextargument is the current context at that point in the command chain. If the previous command wasget,post,follow, orparsethen the context will be a Document. If the previous command wasfindthen the current context will be one of the Elements that was found.The
dataargument contains values set viaosmosis.set. This object can be modified in any way.The
nextargument is a function that will call the next command. It takes two arguments: context and data.The
doneargument is a function to call whenthenwill no longer callnext. This is only required ifthencallsnextasynchronously any number of times.Note: If the callback accepts
doneas an argument, it must always calldone, even ifnextwas never called.The callback will have these functions bound to its
thisvalue:
- this.request(method, url, [data], callback([err], context), [opts])
- this.log(msg)
- this.debug(msg)
- this.error(msg)
Example 1: find every
ul > liand pass it to the next command
osmosis
...
.then(function(context, data, next) {
var items = context.find('ul > li');
items.forEach(function(item) {
next(item, data);
})
})Example 2: set data.url to the current page URL
osmosis
...
.then(function(context, data, next) {
data.url = context.doc().request.url;
next(context, data);
})Example 3: only continue if lastname != undefined
osmosis
...
.then(function(context, data, next) {
if (data.lastname != undefined)
next(context, data)
})Example 4: using the done function
osmosis
...
.then(function(context, data, next, done) {
if (db.connected == false) {
this.error('database disconnected');
done();
return;
}
data.someArray.forEach(function(obj, index) {
db.save(obj, function() {
next(context, data);
if (index == data.someArray.length-1)
done();
})
})
})