Knowledge Scraping / Knowledge Sources – Mainstay

Overview

Mainstay's generative AI features are able to provide institution-specific responses based on content we scrape from your websites and documents. This tool allows you to indicate which pages we should pull information from. You can test the generative responses using the Test the Bot tool.

Recommended Pages

Depending on your use case and institution's website, you may choose to ingest different sources than Mainstay recommends. Above all, the goal is to ensure you have up-to-date factual coverage for the kinds of questions your learners are likely to ask.

Admissions

Retention

How You Can Use Scraped Content

You can use scraped content during AI-assisted Live Chat or when creating or updating Understandings by using the Firefly / AI button (Knowledge Base Responses). However, scraping alone does not automatically generate responses, update existing Understandings, create new Understandings, or respond directly to Contacts. The bot will only use scraped information if it has been manually reviewed and added to an Understanding.

If you want the bot to generate responses based on scraped content automatically, you’ll need to enable Flash Responses. This feature allows the bot to pull from your scraped knowledge sources without requiring manual approval. Flash Responses is an experimental feature in beta testing. To enable this for your institution, please reach out to your Partner Success Manager to learn more and activate it.

Knowledge Sources

Adding Sources

To add a new knowledge source, click + New Source:

There are 3 options:

Specific Webpages: one or more URLs. Mainstay will scrape the content from these individual pages. This includes online PDFs, publicly accessible Google Docs, and more.
Domain/Site Section: one or more URLs. Mainstay will scrape the content from these pages and other pages on the site that include this URL (ie, child pages).
- So for example, if you input "https://example.com/a", we will scrape that page, as well as "https://example.com/a/b" and "https://example.com/a/b/c".
- However, we would not scrape "https://example.com/x", "https://something.example.com/a" or even just "https://example.com" by itself.
- Before scraping, you will see a list of pages that meet your criteria. You can uncheck any that you don't wish to add as Knowledge Sources.
File Upload: a PDF, TXT, DOCX, CSV, MD, or HTML file from your computer.

Optionally select one or more Audiences to restrict which learners this source will be used for. For example, if your institution is using Flash Responses, when a learner asks a question that doesn't match to anything in the Knowledge Base, we'll use your Knowledge Sources to generate a response; if a Knowledge Source is tagged to an Audience that the learner isn't in, that source will be skipped.

You can also tag Knowledge Sources with Labels. These appear in the sidebar; clicking on one filters the list of Knowledge Sources.

Scraping Queue

New Knowledge Sources will be queued and scraped immediately. Knowledge Sources will also be automatically re-scraped every 30 days. (This does not include file uploads, which don't change.)

You can also trigger a re-scrape of one or more existing sources using the bulk selector and then Re-Scrape.

Knowledge Sources will show a status of QUEUED if they are set to be scraped soon.
The system will process these one at a time, oldest to newest. The status will change to STARTED.
When a source is done scraping, its Status will change to either SUCCESS or FAIL. Its Last Scraped timestamp will also update, and the Title will change to the webpage's title.

Filtering Sources

You can search for sources by URL and/or title using the Search input.
You can filter by whether the source is a website or file upload.
You can also filter sources by status or failure message. (See "Scraping Errors" below for more details.)
You can also filter sources by the date of the Last Scrape attempt.
- Note: The page will display a warning if there are any files older than 6 months.
You can also filter sources by Audience. Only Sources that have the selected Audience as one of their tagged Audiences will appear in the list.
You can also filter by Label using the sidebar.

Bulk Editing Sources

Select one or more Knowledge Sources using the checkboxes to access bulk actions:

Re-scrape: Mainstay will fetch the latest content for these web URLs. Note that uploaded files will be skipped, since these don't change.
Manage Audiences: Tag these sources with specific Audiences to restrict which learners have access to this content (for gen-AI tools). Alternately, remove all Audiences from these sources.
Manage Labels: Tag these sources with specific Labels to keep them organized. Alternately, remove all Labels from these sources.
Delete: Remove these sources entirely.

Individual Sources

Below these options is the list of all Knowledge Sources you have added. Each includes:

URL: the full link or original file name of the webpage, Google Doc, hosted PDF, or file
- Note: if you selected "Domain/Site Section" above, that will become multiple knowledge sources, where each one represents a specific page we've scrape.
Title: the <title> element from the webpage or the name of the uploaded file
Labels: the list of tags attached to this source
Audience: the names of any Audiences that this source is restricted to
Last Scraped: the date and time that Mainstay last scraped the webpage or document
# Matches: the number of times this Knowledge Source was used to generate a response (whether direct-to-learner or for an assistant feature). Click on the number to be redirected to the All Messages page in Knowledge Maintenance to view all the messages where this Knowledge Source was used to generate the response.
Status: Queued | Started | Success | Fail

From the ... menu on each knowledge source, you can take the following actions:

Edit: Update the following items:
- Title
- Description
- Audiences
- Labels
Scrape (for online sources): Trigger a re-scrape of this knowledge source. This is helpful if that page or document was recently updated.
Reupload (for files): Display a modal to upload a new version of the file. This allows you to replace the content of a Knowledge Source without losing all the places it's referenced.
Download: Export this Knowledge Source. For an uploaded file, this returns the original file. For a scraped site, this exports a text file containing the scraped content.
Delete: Remove this knowledge source and all content we've scraped from it.

Note: Only admin users are able to add, edit, and delete knowledge sources.

Exporting Sources

To export a CSV of all Knowledge Sources, click Download Sources in the top-right corner. The spreadsheet includes the following columns:

ID: a unique identifier for this Knowledge Source. You can access any individual Knowledge Source at https://app.mainstay.com/knowledge-sources?id={id}
Title: the title of the webpage or document; this can be edited
Description: the description of the source; this can be edited
URL: for websites, the full URL of the Knowledge Source
File Name: for uploaded files, the full name of the original file
Last Scraped: a timestamp of when the website was last scraped or the file was last uploaded
Labels: if the Knowledge Source is tagged with one or more Labels, these will be listed on separate lines
Status: Success, Failed, Queued, Started, or Unknown
Error: if the Status is Failed, then a description of why it failed
Audiences: if the Knowledge Source is tagged to one or more Audiences, these will be listed on separate lines in the following format:
{{ audience.ID [Name] }}

Scraping Errors

Here are the possible Fail statuses you may encounter while scraping sources:

403. Forbidden.

The webpage is not accessible to our scraper. This may be because it's password-protected or intentionally blocking bots. If you control this website, investigate whether you have any bot restrictions in place.

403. This Google doc is not shared.

The Google doc must be publicly accessible in order for the scraper to view its contents. Update the sharing settings so "Anyone with the link can view".

404. Content not found.

The URL provided does not resolve to an accessible webpage. Check that the URL is valid by entering it into your browser directly. If you selected "Domain/Site Section", we get a list of top pages from Bing and attempt to scrape those, so if you're seeing this, it means Bing has indexed pages that no longer exist.

404. This file is currently not accessible.

The PDF URL provided does not resolve to an accessible file. Check that the URL is valid by entering it into your browser directly. An alternative solution is to copy the text content into a Google doc and use that instead.

408. Attempting to load page timed out.

The webpage took too long to load, so the scraper was not able to parse its contents. Check that the URL loads when entering it into your browser directly.

422. Foreign language webpage detected.

The webpage has an explicit lang attribute set to something other than en. If the webpage contents are actually in English, ask your site admins to update the lang setting.

500. Scraper API ran into error.

500. Scraper API terminated connection.

500. Could not set up a secure connection.

These indicate an issue with the system we use for scraping webpages - often a "too many requests"-type problem. This is usually temporary, so if you try again later, the page should scrape.

Note that there is a rolling limit of 1000 scrape attempts every 24 hours.