Add Wikipedia article citations to solr to find well-referenced books #9451
Labels
Fellowship Opportunity
Lead: @cdrini
Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed]
Module: Solr
Issues related to the configuration or use of the Solr subsystem. [managed]
Priority: 2
Important, as time permits. [managed]
Type: Feature Request
Issue describes a feature or enhancement we'd like to implement. [managed]
Problem
Cool find: "Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia" https://arxiv.org/abs/2007.07022
The data is freely available here! https://zenodo.org/records/3940692 This would be a cool thing to do an intersection with to see how many of these books are digitized. Or perhaps to include in our solr so we can sort by wikipedia citations/show "Wikipedia articles that reference this book"!
This would be a great way to find impactful books to add to our collection as well.
The fellow task would affectively be converting the data set to something like:
| Open Library Edition | Wikipedia Page |
With the big task being needing to resolve the identifiers to the open library edition, if one exists.
And then we can use that during full/live indexing to have that in solr! Then we can render it on the books page, and have it power things like sorting 🙂
I couldn't help myself I played around with the data a bit 😄 It's super well structured and easy to use! (Also TIL about parquet files!): https://colab.research.google.com/drive/1JDY2RQHNJdfyoDb9D4LjowyOHS41woMW#scrollTo=ReImWwrqZEUm
This is just one of the 200 "shards" of the dataset! And 11158 have an identifier we can use to intersect with Open Library. If that scales, that'll be ~2.23 million things we can check against Open Library. Best strategy would likely be to load the Open Library editions dump data into a sqlite table, extract/index/normalize the identifiers we care about, load the 2.23 million parquet subset into another table, index/normalize the same identifiers, and then do a JOIN between the two tables, and dump the results into a new sqlite file. We can then use that file to load the data into solr.
Proposal & Constraints
What is the proposed solution / implementation?
Is there a precedent of this approach succeeding elsewhere?
Which suggestions or requirements should be considered for how feature needs to appear or be implemented?
Leads
Related files
Stakeholders
@RayBB
Instructions for Contributors
The text was updated successfully, but these errors were encountered: