Add Wikipedia article citations to solr to find well-referenced books #9451

cdrini · 2024-06-19T13:08:06Z

Problem

Cool find: "Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia" https://arxiv.org/abs/2007.07022

A total of 29.3M citations were extracted from 6.1M English Wikipedia articles as of May 2020, and classified as being to books, journal articles or Web contents. We were thus able to extract 4.0M citations to scholarly publications with known identifiers -- including DOI, PMC, PMID, and ISBN -- and further equip an extra 261K citations with DOIs from Crossref.

The data is freely available here! https://zenodo.org/records/3940692 This would be a cool thing to do an intersection with to see how many of these books are digitized. Or perhaps to include in our solr so we can sort by wikipedia citations/show "Wikipedia articles that reference this book"!

This would be a great way to find impactful books to add to our collection as well.

The fellow task would affectively be converting the data set to something like:
| Open Library Edition | Wikipedia Page |
With the big task being needing to resolve the identifiers to the open library edition, if one exists.
And then we can use that during full/live indexing to have that in solr! Then we can render it on the books page, and have it power things like sorting 🙂
I couldn't help myself I played around with the data a bit 😄 It's super well structured and easy to use! (Also TIL about parquet files!): https://colab.research.google.com/drive/1JDY2RQHNJdfyoDb9D4LjowyOHS41woMW#scrollTo=ReImWwrqZEUm
This is just one of the 200 "shards" of the dataset! And 11158 have an identifier we can use to intersect with Open Library. If that scales, that'll be ~2.23 million things we can check against Open Library. Best strategy would likely be to load the Open Library editions dump data into a sqlite table, extract/index/normalize the identifiers we care about, load the 2.23 million parquet subset into another table, index/normalize the same identifiers, and then do a JOIN between the two tables, and dump the results into a new sqlite file. We can then use that file to load the data into solr.

Proposal & Constraints

What is the proposed solution / implementation?

Is there a precedent of this approach succeeding elsewhere?

Which suggestions or requirements should be considered for how feature needs to appear or be implemented?

Leads

Related files

Stakeholders

@RayBB

Instructions for Contributors

Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.

hornc · 2024-06-20T05:27:42Z

@cdrini what is the output you are looking for?

I have tools that match lists like this and output TSVs of matched identifiers by normalized ISBN. (No SQL required! ;P)

Normalized ISBN, Article title, OL edition id

Would be readily achievable.

The ISBN link is already pretty strong. Correctly templated ISBNs on Wikipedia link to Open Library in two clicks already.

Hardcoding a snapshot of books used in Wikipedia might get stale quickly if they are added to OL records or Solr.

I used the last data set (from around 2018, referenced in the paper you found as:

Halfaker, A., Mansurov, B., Redi, M., & Taraborelli, D. (2018).
Citations with identifiers in Wikipedia. Figshare. DOI: https://
doi.org/10.6084/m9.figshare.1299540 )

to generate the first ia wishlist, and the same data was used for the "turn all the links blue" project, so a slightly older version of this dataset is already incorporated into a lot of the records. We could perhaps do an update pass. I should read over the paper to understand what the update is. The last dataset I believe included a lot on non en-wiki data, which was helpful and interesting.

edit: I've read the paper now. It took me quite a while to understand why they were doing any kind of learning model classification on the templates. I think they are just trying to classify the now deprecated generic citation templates which are now supposed to be cite book and cite journal etc. I'd be interested to to see the proportion of those classes in the overall dataset. From the results in your collab @cdrini , there are 5% generic citation templates with books, and 92% cite book . I'd hope the number of overall citation templates is relatively low.

I also note that they do not appear to be including any other partial templates like {{doi|, {{isbn|, {{oclc|, {{pmid|, {{lccn| and many others, which point to strong identifiers without seemingly counting as 'citations' for the purposes of this research. (I quickly checked the referenced tool https://github.com/dissemin/wikiciteparser and it does appear to ignore all of those id templates). This makes me believe there are many more strong, linkable, identifiers in the raw data than they have extracted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Wikipedia article citations to solr to find well-referenced books #9451

Add Wikipedia article citations to solr to find well-referenced books #9451

cdrini commented Jun 19, 2024 •

edited

Loading

hornc commented Jun 20, 2024 •

edited

Loading

Add Wikipedia article citations to solr to find well-referenced books #9451

Add Wikipedia article citations to solr to find well-referenced books #9451

Comments

cdrini commented Jun 19, 2024 • edited Loading

Problem

Proposal & Constraints

What is the proposed solution / implementation?

Is there a precedent of this approach succeeding elsewhere?

Which suggestions or requirements should be considered for how feature needs to appear or be implemented?

Leads

Related files

Stakeholders

Instructions for Contributors

hornc commented Jun 20, 2024 • edited Loading

cdrini commented Jun 19, 2024 •

edited

Loading

hornc commented Jun 20, 2024 •

edited

Loading