Class: HTMLReader

Extract the significant text from an arbitrary HTML document. The contents of any head, script, style, and xml tags are removed completely. The URLs for a[href] tags are extracted, along with the inner text of the tag. All other tags are removed, and the inner text is kept intact. Html entities (e.g., &) are not decoded.

Implements

FileReader

Constructors

new HTMLReader()

new HTMLReader(): HTMLReader

Returns

HTMLReader

Methods

getOptions()

getOptions(): object

Wrapper for our configuration options passed to string-strip-html library

Returns

object

An object of options for the underlying library

skipHtmlDecoding

skipHtmlDecoding: boolean = true

stripTogetherWithTheirContents

stripTogetherWithTheirContents: string[]

See

https://codsen.com/os/string-strip-html/examples

Source

packages/core/src/readers/HTMLReader.ts:48

loadData()

loadData(file, fs): Promise <Document <Metadata>[]>

Public method for this reader. Required by BaseReader interface.

Parameters

• file: string

Path/name of the file to be loaded.

• fs: GenericFileSystem= defaultFS

fs wrapper interface for getting the file content.

Returns

Promise <Document <Metadata>[]>

Promise<Document[]> A Promise object, eventually yielding zero or one Document parsed from the HTML content of the specified file.

Implementation of

FileReader . loadData

Source

packages/core/src/readers/HTMLReader.ts:21

parseContent()

parseContent(html, options): Promise<string>

Wrapper for string-strip-html usage.

Parameters

• html: string

Raw HTML content to be parsed.

• options: any= {}

An object of options for the underlying library

Returns

Promise<string>

The HTML content, stripped of unwanted tags and attributes

See

getOptions

Source

packages/core/src/readers/HTMLReader.ts:38

Class: HTMLReader

Implements​

Constructors​

new HTMLReader()​

Returns​

Methods​

getOptions()​

Returns​

skipHtmlDecoding​

stripTogetherWithTheirContents​

See​

Source​

loadData()​

Parameters​

Returns​

Implementation of​

Source​

parseContent()​

Parameters​

Returns​

See​

Source​

Implements

Constructors

new HTMLReader()

Returns

Methods

getOptions()

Returns

skipHtmlDecoding

stripTogetherWithTheirContents

See

Source

loadData()

Parameters

Returns

Implementation of

Source

parseContent()

Parameters

Returns

See

Source