Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add rm_dirs_from_zip_paths fn #25

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

mpadge
Copy link
Collaborator

@mpadge mpadge commented Jan 30, 2022

@dhersz Feeds are sometimes nested within directories prior to zipping (and i don't think the specifications actually explicitly forbid this). When that happens, this line also lists the directory as one of the "files"

files_in_gtfs <- zip::zip_list(path)$filename

That doesn't end in ".txt", and so triggers the warning in the subsequent line. These feeds are nevertheless processed okay, and there is nothing wrong with the code. This PR simply helps to appropriately identify such cases and avoid the warning. It's perhaps not the best way to identify them, so happy to iterate, but allows feeds to be nested at arbitrary depth within the zip archive - let me know if you think of a better way.

I don't have any live examples, but have it on several older versions of the Madrid feed. It seems to have been updated now to the expected flat format, but definitely used to come in a "google-gtfs" directory.

f <- strsplit(files_in_gtfs, "^(?=/)(?!//)|(?<!^)(?<!^/)/", perl = TRUE)
lens <- vapply(f, length, integer(1))
if (length(unique(lens)) > 1L) {
files_in_gtfs <- files_in_gtfs[which(lens == median(lens))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mpadge, thanks for the PR. It looks fine to me, although I'm a bit worried about licensing, since the regex match is copied from {fs}. I don't really have much experience with licensing, so I'm not sure how to proceed here. Should we just explicitly give attribution to this line of code to {fs} authors?

Two other things that raised my attention:

  • Perhaps the subsetting condition should be changed to which(lens == max(lens))? You mentioned that this solution could handle feeds nested at arbitrary depth, but it seems to me that it would fail if it was nested inside many directories, which could change the value of the median length. If we use max() here we make sure that we're using the "flattest" files.
  • I've also come across some weird "GTFS" files that were actually a zipped file containing many directories, each one of them containing a GTFS (think of a gtfs.zip file that contained two directories, dir1 and dir2, and each one contained a list of gtfs tables). In such case, we would potentially have tables with the same name inside the final GTFS object, if, for example, both of these directories contained an agency table. I know this case is an exception, but perhaps we should handle it in this function? Something along the lines of, if two or more of the splitted path have the same length, and this length is not the max length (i.e. not the "flattest" files), we throw an error saying that two or more directories have been found inside the zip?

Happy to hear your thoughts, and thanks again for submitting this PR. If you agree with the changes I propose but don't have the time to make them please let me know and I'll accept the PR and make them myself.

Copy link
Collaborator

@dhersz dhersz Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS: I literally wrote this review on the date you sent the PR but didn't realize I had not "finished" the review... 🤦

Sorry about this.

(there was a huge "Pending" label right next to this comment, but I thought it meant there were pending changes, not pending submission lol)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants