Sign up for your FREE personalized newsletter featuring insights, trends, and news for America's Active Baby Boomers

Newsletter
New

Migrating Your Content Management System (cms) Assets With Mongodb And Node.js

Card image cap

Content platforms evolve as business strategies shift. At MongoDB, we embraced external publishing platforms like Dev.to, Medium, The Polyglot Developer, etc. to better engage developer communities, requiring us to redistribute content while maintaining our existing CMS data in MongoDB.

To support our multi-platform publishing strategy, we created a system to publish content between our MongoDB CMS and external platforms. As a result, we needed to migrate the content we had in our CMS to its new home. The migration process included exporting the written content stored in MongoDB and downloading a copy of the media assets that were stored on third-party servers.

In this tutorial, we'll explore the export process to get the job done with as little friction as possible.

The requirements

If you plan to reproduce this project or follow along, you'll want to make sure you have everything available and properly configured. In particular, we want the following:

  • A MongoDB database, either locally hosted or on MongoDB Atlas
  • Node.js v22+

I am using Node.js v22.14.0, but earlier and later versions might work as well. You may have to refer to the API documentation to be sure. For MongoDB, our CMS was hosted on MongoDB Atlas, but most variants should work fine for the purpose of the script. Just make sure the MongoDB instance is properly configured to allow for connections from your script.

It is important to note that our CMS was very custom and it's not likely that your setup will be the same. However, the concepts used here should put you on the right track.

Creating a Node.js project with the proper dependencies

To kick things off, we're going to need a project created with a few environment variables set. Pick a working directory with your terminal and execute the following:

npm init -y  
npm install mongodb axios dotenv --save  

The above commands will initialize a new Node.js project and download our three project dependencies. The MongoDB driver will allow us to query our database for our content, Axios will allow us to download our media assets through HTTP requests, and Dotenv will allow us to easily make use of environment variables within our project.

Next, execute the following commands:

touch main.js  
touch .env  

The touch command just creates files. If you’re on Windows, go ahead and create those files however it makes sense.

This is just a small script so all our code can go in the main.js file. Before we get there, open the .env file and add the following:

MONGODB_ATLAS_URI=  
MONGODB_DATABASE=  
MONGODB_COLLECTION=  
  
EXPORT_URLS=./export_list.txt  
OUTPUT_DIR=./output/  

For safety, if you plan to track this project in Git, make sure you add the .env file to your .gitignore file.

Your CMS project probably won't be the same as mine and that is fine. It is just to give you an idea of how everything is working together. Feel free to follow along with the MongoDB side of things, or experiment with your own connection information. From a data perspective, everything we do will be read-only.

In my situation, I had a list of URLs that needed to be migrated out of MongoDB. This was not the entire content portfolio. The export_list.txt file in this example contains a list of URL slugs, one slug per line. For example, it might look something like this:

/products/mongodb/tensorflow-mongodb-charts/  
/products/mongodb/trader-joes-llamaindex-vector-search/  

The output directory is where we plan to save our content, which is Markdown in our system, along with any media assets.

Let's add some boilerplate code to our main.js file before we get into the specifics:

const { MongoClient } = require("mongodb");  
const fs = require("node:fs/promises");  
const { createWriteStream } = require("fs");  
const axios = require("axios");  
  
require("dotenv").config();  
  
const MONGODB_URI = process.env.MONGODB_ATLAS_URI;  
const MONGODB_DATABASE = process.env.MONGODB_DATABASE;  
const MONGODB_COLLECTION = process.env.MONGODB_COLLECTION;  
const EXPORT_URLS = process.env.EXPORT_URLS;  
const OUTPUT_DIR = process.env.OUTPUT_DIR;  
  
const mongoClient = new MongoClient(MONGODB_URI);  
var database, collection;  
  
(async() => {  
    try {  
        await mongoClient.connect();  
        database = await mongoClient.db(MONGODB_DATABASE);  
        collection = await database.collection(MONGODB_COLLECTION);  
        // Code here soon ...  
        await mongoClient.close();  
    } catch (e) {  
        console.error("ERROR: ", e.message);  
    }  
})();  

In the above code, we've extracted our environment variables and established a connection to our database. What comes next will be two parts: downloading the Markdown, and downloading the media.

Downloading the content from MongoDB

Like previously mentioned, the CMS data we had stored in MongoDB included the raw Markdown of our tutorial content, but it also had various metadata information like author, tags, URL slug, etc.

In the main.js file, add the following function:

async function loadUrlFileData(removeTrailingSlash = false) {  
    let data = await fs.readFile(EXPORT_URLS, { encoding: "utf8" });  
    let slugs = data.split('\n');  
    if(removeTrailingSlash == true) {  
        slugs = slugs.map(slug => slug.replace(/\/$/, ""));  
    }  
    return slugs;  
}  

Remember, in this example, the URLs we want to download were specified in a file. Your use-case might differ, but it shouldn't be difficult to pivot.

With the file scenario, the loadUrlFileData function will take each line of the file and add it to an array. In this example, we can specify if we want to remove the trailing slash from any of our URL slugs.

With a file loader in place, we can revisit our main asynchronous function:

(async() => {  
    try {  
        await mongoClient.connect();  
        database = await mongoClient.db(MONGODB_DATABASE);  
        collection = await database.collection(MONGODB_COLLECTION);  
        let slugs = await loadUrlFileData(true);  
        let cursor = collection.aggregate([  
            {  
                "$match": {  
                    "calculated_slug": {  
                        "$in": slugs  
                    }  
                }  
            },  
            {  
                "$project": {  
                    title: "$name",  
                    description: 1,  
                    slug: "$calculated_slug",  
                    authors: 1,  
                    tags: 1,  
                    content: 1  
                }  
            },  
        ]);  
        let matchedContent = await cursor.toArray();  
        await mongoClient.close();  
    } catch (e) {  
        console.error("ERROR: ", e.message);  
    }  
})();  

Notice the revisions we've made in the above code.

First, we load our URL slugs into an array. Then, we make use of an aggregation pipeline in MongoDB to find content matches based on each of the slugs. The calculated_slug, name, description, authors, tags, and content are all fields within my data so no need to feel like you missed a step. For all the matches in the first pipeline stage, we can specify the fields that are valuable to us in the $project stage.

Depending on the scope of your results, you can use the toArray() function or plan to iterate over the cursor. My data was less than 200 content entries, so the toArray() was fine.

So at this point, we queried for the data we wanted, but we haven't actually done anything with the results. We need to save the results to a file on our filesystem.

Let's create two more functions in our main.js file:

function saveMarkdownToFile(slug, markdown) {  
    let parts = slug.split("/").filter(part => part != "");  
    let entryName = parts[parts.length - 1];  
    return fs.mkdir(`${OUTPUT_DIR}${entryName}/assets`, { recursive: true })  
        .then(() => fs.writeFile(`${OUTPUT_DIR}${entryName}/${entryName}.md`, markdown));  
}  
  
function saveMetadataToFile(slug, data) {  
    let parts = slug.split("/").filter(part => part != "");  
    let entryName = parts[parts.length - 1];  
    return fs.mkdir(`${OUTPUT_DIR}${entryName}/assets`, { recursive: true })  
        .then(() => fs.writeFile(`${OUTPUT_DIR}${entryName}/meta.json`, data));  
}  

The two functions above, more or less, do the same thing. You provide the slug, which will be used for file naming purposes, and the data. In this case, the data is either going to be the raw Markdown or the metadata for the entry. We want to save both, but as two separate files.

Looking back at our main logic-driving asynchronous function, we can add the following:

// Other code above ...  
let matchedContent = await cursor.toArray();  
for (const match of matchedContent) {  
    await saveMarkdownToFile(match.slug, match.content);  
    delete match.content;  
    await saveMetadataToFile(match.slug, JSON.stringify(match, undefined, 4));  
}  

The code above is looping through each CMS entry matched by our aggregation pipeline. For every entry, we save the Markdown and the metadata. We make sure to remove the content from the metadata prior to saving it because it isn't necessary to store our massive amounts of Markdown in that file as well.

At this point, we've exported our Markdown entries from MongoDB. The problem is that none of our media assets were exported. We want to export them in case they are ever removed from the original source.

Saving the media assets hosted on a third-party platform

In our old CMS, the media assets, such as images, were not stored separately within our database. The assets themselves existed on S3 and similar, and the references to those assets were stored directly within the Markdown. This means we had to do some basic web scraping.

Within the main.js file, add the following function:

function extractAssetUrls(text) {  
    let assetUrlRegex = /(https?:\/\/.*\.(?:png|jpg|jpeg|gif|webp|pdf))/gi;  
    let assetUrls = text.match(assetUrlRegex) || [];  
    assetUrls = assetUrls.map(asset => {  
        let urlObject = new URL(asset);  
        let urlParams = new URLSearchParams(urlObject.search);  
        if(urlParams.has("url")) {  
            asset = urlParams.get("url");  
        }  
        return asset;  
    });  
    return assetUrls;  
}  

The above function will take text—in this case, our entire Markdown content—and look for a pre-defined list of assets with a regular expression. Some of those assets had funky URL problems, so for every asset match, we had to further parse it with a URL object in JavaScript.

The assets discovered in the extractAssetUrls function get added to an array and returned. To be clear, only the URLs are added to this array. We still need to download them.

Include the following function for downloading:

function downloadAsset(assetUrl, slug) {  
    let urlParts = assetUrl.split("/");  
    let slugParts = slug.split("/").filter(part => part != "");  
    let fileName = urlParts[urlParts.length - 1];  
    let filePath = `${OUTPUT_DIR}${slugParts[slugParts.length - 1]}/assets/`;  
    return axios({  
        url: assetUrl,  
        method: "GET",  
        responseType: "stream"  
    }).then(response =>   
            new Promise((resolve, reject) => {  
                response.data  
                    .pipe(createWriteStream(filePath + fileName))  
                    .on("error", reject)  
                    .once("close", () => resolve(filePath + fileName));  
            })  
    ).catch(error => {  
        if(error.response && error.response.status >= 400 && error.response.status <= 500) {  
            console.warn(`Skipping asset due to client error (status: ${error.response.status}): ${assetUrl}`);  
        }  
    })    
}  

The above downloadAsset function will take a URL and a slug, the slug being used just for naming purposes when saving. The asset is downloaded with an HTTP request and saved directly to the file system.

To make these functions work, let’s revisit our main asynchronous function:

// Previous code up here ...  
let matchedContent = await cursor.toArray();  
for (const match of matchedContent) {  
    await saveMarkdownToFile(match.slug, match.content);  
    let assetUrls = extractAssetUrls(match.content);  
    await Promise.all(assetUrls.map(asset => downloadAsset(asset, match.slug + "/")));  
    delete match.content;  
    await saveMetadataToFile(match.slug, JSON.stringify(match, undefined, 4));  
}  

Notice that before we save our metadata to a file, we are extracting the asset URLs from the Markdown content and then downloading them.

For clarity, the Markdown and each asset will live in a separate directory within our output path. The separate directory in our scenario is the slug name.

Seeing the migration in action

If you made it to this point and you either want to try to run the code or see all the code together, no problem. The main.js file should look like the following:

const { MongoClient } = require("mongodb");  
const fs = require("node:fs/promises");  
const { createWriteStream } = require("fs");  
const axios = require("axios");  
  
require("dotenv").config();  
  
const MONGODB_URI = process.env.MONGODB_ATLAS_URI;  
const MONGODB_DATABASE = process.env.MONGODB_DATABASE;  
const MONGODB_COLLECTION = process.env.MONGODB_COLLECTION;  
const EXPORT_URLS = process.env.EXPORT_URLS;  
const OUTPUT_DIR = process.env.OUTPUT_DIR;  
  
const mongoClient = new MongoClient(MONGODB_URI);  
var database, collection;  
  
(async() => {  
    try {  
        await mongoClient.connect();  
        database = await mongoClient.db(MONGODB_DATABASE);  
        collection = await database.collection(MONGODB_COLLECTION);  
        let slugs = await loadUrlFileData(true);  
        let cursor = collection.aggregate([  
            {  
                "$match": {  
                    "calculated_slug": {  
                        "$in": slugs  
                    }  
                }  
            },  
            {  
                "$project": {  
                    title: "$name",  
                    description: 1,  
                    slug: "$calculated_slug",  
                    authors: 1,  
                    tags: 1,  
                    content: 1  
                }  
            },  
        ]);  
        let matchedContent = await cursor.toArray();  
        for (const match of matchedContent) {  
            await saveMarkdownToFile(match.slug, match.content);  
            let assetUrls = extractAssetUrls(match.content);  
            await Promise.all(assetUrls.map(asset => downloadAsset(asset, match.slug + "/")));  
            delete match.content;  
            await saveMetadataToFile(match.slug, JSON.stringify(match, undefined, 4));  
        }  
        await mongoClient.close();  
    } catch (e) {  
        console.error("ERROR: ", e.message);  
    }  
})();  
  
// Load the URLs to be used in the query from a file  
// Optionally remove the trailing slash from the URLs if necessary  
async function loadUrlFileData(removeTrailingSlash = false) {  
    let data = await fs.readFile(EXPORT_URLS, { encoding: "utf8" });  
    let slugs = data.split('\n');  
    if(removeTrailingSlash == true) {  
        slugs = slugs.map(slug => slug.replace(/\/$/, ""));  
    }  
    return slugs;  
}  
  
// Save the Markdown content at the $OUTPUT/$SLUG/$SLUG.md path  
function saveMarkdownToFile(slug, markdown) {  
    let parts = slug.split("/").filter(part => part != "");  
    let entryName = parts[parts.length - 1];  
    return fs.mkdir(`${OUTPUT_DIR}${entryName}/assets`, { recursive: true })  
        .then(() => fs.writeFile(`${OUTPUT_DIR}${entryName}/${entryName}.md`, markdown));  
}  
  
// Save the metadata at the $OUTPUT/$SLUG/meta.json path  
function saveMetadataToFile(slug, data) {  
    let parts = slug.split("/").filter(part => part != "");  
    let entryName = parts[parts.length - 1];  
    return fs.mkdir(`${OUTPUT_DIR}${entryName}/assets`, { recursive: true })  
        .then(() => fs.writeFile(`${OUTPUT_DIR}${entryName}/meta.json`, data));  
}  
  
// Extract any RegEx match for an asset in a file and return it  
// Note, some assets were parameterized weird using URL parameters  
function extractAssetUrls(text) {  
    let assetUrlRegex = /(https?:\/\/.*\.(?:png|jpg|jpeg|gif|webp|pdf))/gi;  
    let assetUrls = text.match(assetUrlRegex) || [];  
    assetUrls = assetUrls.map(asset => {  
        let urlObject = new URL(asset);  
        let urlParams = new URLSearchParams(urlObject.search);  
        if(urlParams.has("url")) {  
            asset = urlParams.get("url");  
        }  
        return asset;  
    });  
    return assetUrls;  
}  
  
// Download any asset file and store it in the $OUTPUT/$SLUG/assets directory  
function downloadAsset(assetUrl, slug) {  
    let urlParts = assetUrl.split("/");  
    let slugParts = slug.split("/").filter(part => part != "");  
    let fileName = urlParts[urlParts.length - 1];  
    let filePath = `${OUTPUT_DIR}${slugParts[slugParts.length - 1]}/assets/`;  
    return axios({  
        url: assetUrl,  
        method: "GET",  
        responseType: "stream"  
    }).then(response =>   
            new Promise((resolve, reject) => {  
                response.data  
                    .pipe(createWriteStream(filePath + fileName))  
                    .on("error", reject)  
                    .once("close", () => resolve(filePath + fileName));  
            })  
    ).catch(error => {  
        if(error.response && error.response.status >= 400 && error.response.status <= 500) {  
            console.warn(`Skipping asset due to client error (status: ${error.response.status}): ${assetUrl}`);  
        }  
    })    
}  

Remember, your CMS probably doesn't look like the CMS we were using at MongoDB. Make the changes where you need to, or feel free to pull bits and pieces from the code to cover your use-case.

With the code in place, you can run the project with the following:

node main.js  

Depending on how many URLs you are exporting and how many image assets there are, it should take a minute to run. Images that cannot be accessed or are not found will be skipped and reported.

Conclusion

You just saw how to pull data out of MongoDB and save it to a file. There are countless ways of doing this, but this code worked for our needs at MongoDB. To be clear, our needs were to switch from publishing on our internal platform and instead start publishing on external platforms. This is why we needed to download our Markdown data and each asset associated with it.


Recent