Overview
Mainstay's generative AI features are able to provide institution-specific responses based on content we scrape from your websites and documents. This tool allows you to indicate which pages we should pull information from, and lets you test the generative AI by asking it questions.
How You Can Use Scraped Content
You can use scraped content during AI-assisted Live Chat or when creating or updating Understandings by using the Firefly / AI button (Knowledge Base Responses). However, scraping alone does not automatically generate responses, update existing Understandings, create new Understandings, or respond directly to Contacts. The bot will only use scraped information if it has been manually reviewed and added to an Understanding.
If you want the bot to generate responses based on scraped content automatically, you’ll need to enable Flash Responses. This feature allows the bot to pull from your scraped knowledge sources without requiring manual approval. Flash Responses is an experimental feature in beta testing. To enable this for your institution, please reach out to your Partner Success Manager to learn more and activate it.
Ask the AI
After selecting and scraping knowledge sources (see below), you can test your coverage by asking the AI a question. This tool uses the same AI settings/prompt as KB Response Generation and AI-Assisted Live Chat.
When the suggested response was crafted using your scraped Knowledge Sources, those will be indicated for reference:
Knowledge Sources
Adding Sources
To add a new knowledge source, click + New Source:
There are five types available:
- Specific Webpages: one or more URLs. Mainstay will scrape the content from these individual pages.
-
Domain/Site Section: a full URL. Mainstay will scrape the content from this page and other pages on the site that include this URL.
- So for example, if you input "https://example.com/a", we will scrape that page, as well as "https://example.com/a/b" and "https://example.com/a/b/c".
- However, we would not scrape "https://example.com/x", "https://something.example.com/a" or even just "https://example.com" by itself.
- PDF: a URL to a hosted PDF document available online.
-
Google Document: a URL to a Google Doc that is publicly accessible, or shared with an @mainstay.com email address.
- File Upload: a PDF, TXT, DOCX, CSV, MD, or HTML file from your computer.
A new knowledge source will be scraped immediately. You can also trigger a re-scrape of all existing sources by clicking Scrape All.
Optionally select one or more Audiences to restrict which learners this source will be used for. For example, if your institution is using Flash Responses, when a learner asks a question that doesn't match to anything in the Knowledge Base, we'll use your Knowledge Sources to generate a response; if a Knowledge Source is tagged to an Audience that the learner isn't in, that source will be skipped.
Scraping Queue
If you add or rescrape multiple sources at once, they will go into a Queue:
The system will process these one at a time, oldest to newest:
When a source is done scraping, its Status will change to either "Success" or "Fail", its Last Scraped will change to a timestamp indicating when the scrape completed, and the header text may change to the webpage's title.
The page will automatically update as these statuses change, meaning you will see a pattern of statuses like this gradually moving upwards:
Filtering & Bulk Editing Sources
- You can search for sources by URL and/or title using the Search input.
- You can also filter sources by status or failure message. (See "Scraping Errors" below for more details.)
- You can also filter sources by the date of the Last Scrape attempt.
- You can also filter sources by Audience. Only Sources that have the selected Audience as one of their tagged Audiences will appear in the list.
When a search or filter is applied, you can also Rescrape just these sources, or instead Delete just these sources.
Individual Sources
Below these options is the list of all knowledge sources you have selected. Each includes:
- Title: the <title> element from the webpage or the name of the PDF file.
-
URL or File: the full link or original file name of the webpage, Google Doc, hosted PDF, or file
- Note: if you selected "Domain/Site Section" above, that will become multiple knowledge sources, where each one represents a specific page we've scrape.
- Last Scraped: the date and time that Mainstay last scraped the webpage or document
- Status: Queued | Started | Success | Fail
- Audiences: the names of any Audiences that this source is restricted to
From the ... menu on each knowledge source, you can take the following actions:
- Edit: Update the Title and add an optional Description.
- Scrape (for online sources): Trigger a re-scrape of this knowledge source. This is helpful if that page or document was recently updated.
- Reupload (for files): Display a modal to upload a new version of the file. This allows you to replace the content of a Knowledge Source without losing all the places it's referenced.
- Delete: Remove this knowledge source and all content we've scraped from it.
Note: Only admin users are able to add, edit, and delete knowledge sources.
Exporting Sources
To export a CSV of all Knowledge Sources, click Download Sources in the top-right corner. The spreadsheet includes the following columns:
-
ID: a unique identifier for this Knowledge Source. You can access any individual Knowledge Source at https://app.mainstay.com/knowledge-sources?id={id}
-
Title: the title of the webpage or document; this can be edited
-
Description: the description of the source; this can be edited
-
URL: for websites, the full URL of the Knowledge Source
-
File Name: for uploaded files, the full name of the original file
-
Last Scraped: a timestamp of when the website was last scraped or the file was last uploaded
-
Status: Success, Failed, Queued, Started, or Unknown
-
Error: if the Status is Failed, then a description of why it failed
-
Audiences: if the Knowledge Source is tagged to one or more Audiences, these will be listed on separate lines in the following format:
- {{ audience.ID [Name] }}
Scraping Errors
Here are the possible Fail statuses you may encounter while scraping sources:
403. Forbidden.
The webpage is not accessible to our scraper. This may be because it's password-protected or intentionally blocking bots. If you control this website, investigate whether you have any bot restrictions in place.
403. This Google doc is not shared.
The Google doc must be publicly accessible in order for the scraper to view its contents. Update the sharing settings so "Anyone with the link can view".
404. Content not found.
The URL provided does not resolve to an accessible webpage. Check that the URL is valid by entering it into your browser directly. If you selected "Domain/Site Section", we get a list of top pages from Bing and attempt to scrape those, so if you're seeing this, it means Bing has indexed pages that no longer exist.
404. This file is currently not accessible.
The PDF URL provided does not resolve to an accessible file. Check that the URL is valid by entering it into your browser directly. An alternative solution is to copy the text content into a Google doc and use that instead.
408. Attempting to load page timed out.
The webpage took too long to load, so the scraper was not able to parse its contents. Check that the URL loads when entering it into your browser directly.
422. Foreign language webpage detected.
The webpage has an explicit lang attribute set to something other than en. If the webpage contents are actually in English, ask your site admins to update the lang setting.
500. Scraper API ran into error.
500. Scraper API terminated connection.
500. Could not set up a secure connection.
These indicate an issue with the system we use for scraping webpages - often a "too many requests"-type problem. This is usually temporary, so if you try again later, the page should scrape.
Comments
0 comments
Article is closed for comments.