The MSMARCO dataset does not provide the paragraph segmentations. The documents are segmentated by only keeping the labeled paragraphs and leave the leftover parts as other segments. For example, given the document text "11122222334444566" and if the labeled paragraphs are "22222" and "4444", then the segmentations will be ["111", "22222", "33", "4444", "566"]. For the candidate paragraphs, we only do retrieval over the labeled paragraphs, which is specified by the attribute "candidate_chunk_ids" of each document object.