All zip files seem to contain the same files

I am following the instructions found here to obtain the dataset.

For each new compressed file, I use the given commands to extract the data. However, they all appear to contain the same data, because 7zip keeps asking me if I want to overwrite the same exact files for each compressed file I try to 7zip.

If I select Skip All, then the result of the extraction shows zero files extracted.

What is going on here?

Are you sure you are running the command on different files? If so, I think the problem may be related to the fact that these files are sharded zip folder. Can you show the command you are using to unzip and what is the version of your 7z?

I am using the commands found in the link I posted to the dataset instructions.

It appears, according to what Iā€™m seeing on reddit, that the instructions on hugging face are incorrect.

If there is a series of file.zip, file.z01, file.z02 ... You only need to point 7zip at the first file (.zip) and it will automatically reconstitute from the other files too.

https://www.reddit.com/r/pcmasterrace/comments/6o2xu5/what_do_i_do_with_z01_zip_split_archive/

Yes, indeed. I corrected the huggingface instructions for IPD.

2 Likes