6

For a non-commercial private school project I'm creating a piece of software that will search for lyrics based on what song currently is playing on Spotify. I have to do this in C# (requirement), but I can use other languages if I so desire.

I've found a few sites that I can use to fetch the lyrics from. I have already succeeded in fetching the entire html code, but after that I'm not sure what to do. I've asked my teacher, she told me to use XML (which I also found complicated :p), so I've read quite a bit about it and searched for examples, but haven't found anything that seems applicable to my case.

Time for some code.

Let's say I wanted to fetch the lyrics from musixmatch.com:

(Human-readable altered) HTML:

<span data-reactid="199">
    <p class="mxm-lyrics__content" data-reactid="200">First line of the lyrics!
        These words will never be ignored
        I don't want a battle
    </p>
    <!-- react-empty: 201 -->
    <div data-reactid="202">
        <div class="inline_video_ad_container_container" data-reactid="203">
            <div id="inline_video_ad_container" data-reactid="204">
                <div class="" style="line-height:0;" data-reactid="205">
                    <div id="div_gpt_ad_outofpage_musixmatch_desktop_lyrics" data-reactid="206">
                        <script type="text/javascript">
                            //Really nice google ad JS which I have removed;
                        </script>
                    </div>
                </div>
            </div>
        </div>
        <p class="mxm-lyrics__content" data-reactid="207">But I got a war
            More fancy lyrics
            And lines
            That I want to fetch
            And display
            Tralala
            lala
            Trouble!
        </p>
    </div>
</span>

Note the first three lines of the lyrics are located at the top, with the rest in the bottom <p>. Also note that the two <p> tags have the same class. Full html source can be found here: view-source:https://www.musixmatch.com/lyrics/Bullet-for-My-Valentine/You-Want-a-Battle-Here%E2%80%99s-a-War At around line 97 the snippet starts.

So in this specific example there are the lyrics, and there is quite a bit of code that I don't need. So far I've tried fetching the html code with the following C#:

string source = "https://www.musixmatch.com/lyrics/Bullet-for-My-Valentine/You-Want-a-Battle-Here’s-a-War";

    // The HtmlWeb class is a utility class to get the HTML over HTTP
    HtmlWeb htmlWeb = new HtmlWeb();

    // Creates an HtmlDocument object from an URL
    HtmlAgilityPack.HtmlDocument document = htmlWeb.Load(source);

    // Targets a specific node
    HtmlNode someNode = document.GetElementbyId("mxm - lyrics__content");

    if (someNode != null)
    {
        Console.WriteLine(someNode);
    } else
    {
        Console.WriteLine("Nope");
    }

    foreach (var node in document.DocumentNode.SelectNodes("//span/div[@id='site']/p[@class='mxm-lyrics__content']"))
    {
        // here is your text: node.InnerText    "//div[@class='sideInfoPlayer']/span[@class='wrap']"
        Console.WriteLine(node.InnerText);
    }

    Console.ReadKey();

The fetching of the entire html works, but the extracting doesn't. I'm stuck at extracting the lyrics from the html. Since for this page the lyrics aren't in an ID tag, I can't just use the GetElementbyId. Can somebody point me in the right direction? I want to support multiple sites, so I have to do this a few times for different sites.

11
  • 3
    Maybe it makes sense to use their api? it's free for 2K requests per day developer.musixmatch.com/mmplans. (JFYI) Commented Nov 30, 2016 at 10:46
  • 1
    mxm-lyrics__content is the class of the element and not the Id, which is why GetElementbyId doesn't find it. You could use the technique in this question to get it by class. Commented Nov 30, 2016 at 10:49
  • @Artiom Well, it's indeed free, but it doesn't include full lyrics I believe? Given the fancy cross at 'Full Lyrics Display'? Commented Nov 30, 2016 at 10:51
  • @stuartd I'll have a read. Haven't found that one yet :-) Commented Nov 30, 2016 at 10:51
  • 1
    @MagicLegend I've missed that. Commented Nov 30, 2016 at 10:52

1 Answer 1

3

One of the solutions

var htmlWeb = new HtmlWeb();
var documentNode = htmlWeb.Load(source).DocumentNode;

var findclasses = documentNode.Descendants("p")
    .Where(d => d.Attributes["class"]?.Value.Contains("mxm-lyrics__content") == true);
//or
var findclasses = documentNode.SelectNodes("//p[contains(@class,'mxm-lyrics__content')]")
var text = string.Join(Environment.NewLine, findclasses.Select(x => x.InnerText));
Sign up to request clarification or add additional context in comments.

9 Comments

Nice solution. Thought about Regex first, but this is far better.
Thank you! Works like a charm. Do you have some documentation (how is a notation like that even called?) on the magic that you execute with the first findclasses var? How do you build something like that?
@Sebi Regex is considered as not the best solution for parsing HTML. Check this answer stackoverflow.com/a/1732454/797249. It's epic
@MagicLegend Search for Linq ;)
@MagicLegend Using var is opinion based. But in many cases it makes your code more clean. Think you instanitate a Dictionary: Dictionary<object, List<MyHolyOwnClass>> dic = new Dictionary<object, List<MyHolyOwnClass>> versus var dic = new Dictionary<object, List<MyHolyOwnClass>>
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.