The Wayback Machine - https://web.archive.org/web/20260201005109/https://github.com/internetarchive/openlibrary/issues/6417
Skip to content

Fix all occurrences of html entities in author names, work/edition titles/subtitles #6417

@cdrini

Description

@cdrini

There are many authors like https://openlibrary.org/authors/OL8115327A/Robert_Wi_347_niewski?v=3 that contain html entities (eg ś) instead of the correct unicode character (eg "ś"). These should be fixed.

Implementing this myself; want to create a small schema for running essentially simple "map" jobs like this.

https://colab.research.google.com/drive/17futuO_fn2XNzL6jno8Nig2ixsSbtnkW#scrollTo=JoKRKozpctjz

Related to #6406

  • Author names
  • Work titles/subtitles
  • Edition titles/subtitles

Stakeholders

@tfmorris @jimman2003

Metadata

Metadata

Assignees

No one assigned

    Labels

    1-off tasksAffects: DataIssues that affect book/author metadata or user/account data. [managed]Data CleanupLead: @cdriniIssues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed]Priority: 3Issues that we can consider at our leisure. [managed]Theme: BotsIssues relating to Bots & data cleanup

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions