November 21, 2022

GridFS - Large Document Storage in MongoDB

 The maximum BSON document size is 16 megabytes.

To store more than 16 MB document, we need to use GridFS.

GridFS is a MongoDB file system abstraction that is used for storing and retrieving huge image, audio, or video files which exceed the BSON-document size limit of 16 MB.

Instead of storing a file in a single document, GridFS divides the file into parts.

GridFS takes a file and splits it into sections called chunks. By default, each chunk size is 255 KB (this is a configurable parameter).

GridFS creates 2 collections — the chunk collection and the file collection and places them in a common bucket by prefixing each with the bucket name(the default name is fs) — fs.chunks and fs.files. The bucket is only created on the first read/write operation, if it does not exist.

 The split chunks are stored as documents in the chunk collection, while the additional metadata is saved in the file collection.

 



Use Cases:

  • Content management systems
  • Healthcare - Patient health record repository includes general information about the patient (name, address, insurance provider, etc.) along with all various types of medical records (office visits, blood tests and labs, medical procedures, etc.)
  • Movie/Audio streaming - Reads the Movies/Audio  directory  and related all the .MOV/.wav files

Configuration and Migration Steps:

 1.Create indexes  on GridFS collections of the source cluster.

 2. Identify documents in GridFS collections which are greater than 16 MB

 3. Use MongoPush to copy  all indexes from step 2

 

Limitation/drawback:

Do not use GridFS if you need to update the content of the entire file atomically.

Slower performance compared to file system or serving the file from a server.

 https://www.mongodb.com/docs/manual/core/gridfs/

 

No comments:

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...