Adding support for other TextRank flavours, including PositionRank and Biased TextRank #78
Comments
|
Wow, thank you kindly @louisguitton for the fantastic overview and comparisons of the relative features and needs!!
In general yes adding more flavours would be super helpful for PyTextRank -- and @shyamcody had also suggested about Biased TextRank recently :) https://twitter.com/shyambhumukher1/status/1325260405472600064 FWIW, when I've checked the web site for Rada's lab (their link to code in the paper) there was a 404 error. I should give her a heads-up about that, although they also have an impl on GitHub https://github.com/ashkankzme/biased_textrank/tree/master/data
Well said! We could stand some refactoring here, and it can likely be managed without disrupting/deprecating too much of the existing use cases. Perhaps a
Yes, that would like be the simplest to introduce. I'd been hoping to introduce personalised pagerank for a while, plus it's potentially a step toward some simpler of the entity linking approaches too. |
|
Some other factors: First, I'm not sure how much PyTextRank will need to change based on adapting to Also, please bear with me, I'll try to articulate here something that I've struggled to say for a while ... Clearly the language embedding models, transformers, etc., of the entire "Sesame Street" since late 2017 have had enormous impact on natural language work. I like to leverage that, although I also recognize how PTR is part of the counterpoint argument for "not enormous models" and potentially leveraging domain knowledge. So it seems there's a potential trade-off there, in terms of the 2020 paper? In other words, I'd really like to pursue integration of the domain knowledge. On the one hand, there's a lemma graph used inherently in TextRank, and I feel that it's important to distinguish that from a data graph that might be used for entity linking and other KG integration. IMO, Biased TextRank is steering away from that, and potentially mixing the two graphs in a way that won't be as useful long-term. For example, in my spaCy tutorials I've shown some examples of using In an earlier TextRank imp that ShareThis.com used in production beginning 2009, I'd shown how to enrich the lemma graph within TextRank to optimize results, i.e., linking in hypernyms and hyponyms from an external resource to add more edges into the lemma graph prior to running the centrality metrics. In other words, if you were parsing many documents about sports news, you could potentially use a KG representing sports news topics, their connections, their synonyms, hypernyms, hyponyms, etc., to enrich TextRank. A very nice side benefit is the entity linking almost becomes a side-effect of the optimization! In the ways that I've built this previously, it requires a compute-intensive random walk across the external resource (KG, thesaurus, etc.) but the results were dramatic. FWIW, I did not pull that code over into PTR, because the integrations with WordNet were originally in Java, although @dvsrepo fixed that with the spaCy extension :) Working toward entity linking support in PTR was one of my main reasons for starting Also, this may be useful in other pipeline integrations, such as That said, use of KG with PTR pipelines could also help with restart probabilities and potentially provide a more generalized approach to embedding. For example, I really like this work in https://arxiv.org/abs/1709.02759 to describe a more generalized graph embedding formally, that preservices semantics. I think that would be more in keeping with the "small, fast models as counterpoint to enormous transformers" opinionated aspect of PTR and so many of its use cases. Of course, I may be biased! Much to consider here, though I hope some of these details help. The In any case, I'd like to be agnostic toward any one approach, and provide multiple options in PTR to support a wider range of applications. |
Same here, I need to make some time to check out Spacy v3 announcement, docs and what's new ...
I'm in team domain knowledge 200%. For my use cases that trumps the rest. Feels like for practitioners like me, Occam's razor is still winning.
Yes thanks so much for this write up. It helps me put some of my thoughts in context of a more general approach towards KGs, as opposed to being influenced solely by my own use case.
Great. I've seen DerwenAI/kglab#33 . So if I rephrase in practical terms: on the PTR side I see 3 things raised from the present issue in the
And I see 1 thing raised on the
So, I think starting with a PR for PositionRank would be a great first contribution for me. I'd be quite proud of that, and also keeping in mind that the personalised pagerank can build towards adding the "simple BiasedTextRank" flavour too and also is going in the same "big picture" direction than integrating kglab with pytextrank. |
|
That looks great @louisguitton ! Adding a PR for PositionRank would add so much to PTR. I'll work on the And FWIW I may have a use case for applying personalised pagerank in Happy New Year, and I wish you all the best in 2021! |

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.

Papers
COLING'2020 will be happening next week (Dec 8-13, 2020, online). And one of the papers accepted there caught my attention.
It's called (2020) Biased TextRank: Unsupervised Graph-Based Content Extraction, under the supervision of Rada Mihalcea.
That paper lead me to read another paper: (2017) A Position-Biased PageRank Algorithm for Keyphrase Extraction, which actually seems to work really well for news data, which is my usecase.
Summary of different TextRank flavours
Sentences if we’re looking for sentences.
Then normalised.
Example weighting: 1 if keyword is in focus, 0 else.
Example of “task focus” for an election night news report = “Joe Biden” => will give you the relevant one-sided summary
Discussion
Before moving into implementation details, I'd love to get your high level thoughts on a couple points.
Is adding more TextRank flavours to
pytextrankof value?The rows of the above table appear to me as "components" and adding more TextRank flavours might require a small refactoring of the current class. Is that something that is too ambitious given pytextrank is used in a lot of places?
PositionRank and the one I called "simple Biased TextRank" rely on personalised pagerank, which is supported in
networkx:The text was updated successfully, but these errors were encountered: