Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to download and load the300w_lp dataset through the current Google Drive URL #5525

Open
Inokinoki opened this issue Jul 17, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@Inokinoki
Copy link
Collaborator

/!\ PLEASE INCLUDE THE FULL STACKTRACE AND CODE SNIPPET

Short description

Dataset the300w_lp cannot be loaded due to Google Drive changes.

Environment information

  • Operating System: macos

  • Python version: 3.11.9

  • tensorflow-datasets/tfds-nightly version: tensorflow-datasets==4.9.5

  • tensorflow/tf-nightly version: tensorflow==2.15.1

  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ?

Yes

Reproduction instructions

tfds.load("the300w_lp", with_info=True)

If you share a colab, make sure to update the permissions to share it.

Link to logs
If applicable, https://gist.github.com/Inokinoki/36ee1c47cf4ee2b0bef4754900189335

Expected behavior
Load the dataset correctly.

Additional context
I investigated the issue, it seems that Google Drive has a redirect with a warning for non-scanned files:

image
curl -L "https://drive.google.com/uc?export=download&id=0B7OEHD3T4eCkVGs0TkhUWFN6N1k"         
<!DOCTYPE html><html><head><title>Google Drive - Virus scan warning</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><style nonce="Cnthv5s43ZEpklfe8-kwQA">.goog-link-button{position:relative;color:#15c;text-decoration:underline;cursor:pointer}.goog-link-button-disabled{color:#ccc;text-decoration:none;cursor:default}body{color:#222;font:normal 13px/1.4 arial,sans-serif;margin:0}.grecaptcha-badge{visibility:hidden}.uc-main{padding-top:50px;text-align:center}#uc-dl-icon{display:inline-block;margin-top:16px;padding-right:1em;vertical-align:top}#uc-text{display:inline-block;max-width:68ex;text-align:left}.uc-error-caption,.uc-warning-caption{color:#222;font-size:16px}#uc-download-link{text-decoration:none}.uc-name-size a{color:#15c;text-decoration:none}.uc-name-size a:visited{color:#61c;text-decoration:none}.uc-name-size a:active{color:#d14836;text-decoration:none}.uc-footer{color:#777;font-size:11px;padding-bottom:5ex;padding-top:5ex;text-align:center}.uc-footer a{color:#15c}.uc-footer a:visited{color:#61c}.uc-footer a:active{color:#d14836}.uc-footer-divider{color:#ccc;width:100%}.goog-inline-block{position:relative;display:-moz-inline-box;display:inline-block}* html .goog-inline-block{display:inline}*:first-child+html .goog-inline-block{display:inline}sentinel{}</style><link rel="icon" href="//ssl.gstatic.com/docs/doclist/images/drive_2022q3_32dp.png"/></head><body><div class="uc-main"><div id="uc-dl-icon" class="image-container"><div class="drive-sprite-aux-download-file"></div></div><div id="uc-text"><p class="uc-warning-caption">Google Drive can't scan this file for viruses.</p><p class="uc-warning-subcaption"><span class="uc-name-size"><a href="/open?id=0B7OEHD3T4eCkVGs0TkhUWFN6N1k">300W-LP.zip</a> (2.6G)</span> is too large for Google to scan for viruses. Would you still like to download this file?</p><form id="download-form" action="https://drive.usercontent.google.com/download" method="get"><input type="submit" id="uc-download-link" class="goog-inline-block jfk-button jfk-button-action" value="Download anyway"/><input type="hidden" name="id" value="0B7OEHD3T4eCkVGs0TkhUWFN6N1k"><input type="hidden" name="export" value="download"><input type="hidden" name="confirm" value="t"><input type="hidden" name="uuid" value="4fcfdc71-ca23-4264-8c6a-1322c7b1c73e"></form></div></div><div class="uc-footer"><hr class="uc-footer-divider"></div></body></html>%

Using the new URL with confirm=t can resolve this issue.

@Inokinoki
Copy link
Collaborator Author

It seems that some other datasets have the similar issues as well...

e.g., gov_report: https://drive.google.com/uc?export=download&id=1ik8uUVeIU-ky63vlnvxtfN2ZN-TUeov2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant