I am following the instructions found here to obtain the dataset.
For each new compressed file, I use the given commands to extract the data. However, they all appear to contain the same data, because 7zip keeps asking me if I want to overwrite the same exact files for each compressed file I try to 7zip.
Are you sure you are running the command on different files? If so, I think the problem may be related to the fact that these files are sharded zip folder. Can you show the command you are using to unzip and what is the version of your 7z?
I am using the commands found in the link I posted to the dataset instructions.
It appears, according to what Iām seeing on reddit, that the instructions on hugging face are incorrect.
If there is a series of file.zip, file.z01, file.z02 ... You only need to point 7zip at the first file (.zip) and it will automatically reconstitute from the other files too.