EmilyMiller
Good men did nothing
- Joined
- Aug 13, 2022
- Posts
- 11,595
My panties are dampIt's very easy to work out the structure of the site from the html. Every story [in the new layout] has a small section of html with the following id: tabpanel-info
nested within this will be the following elements:
HTML:<div class="aT_H" title="Rating"><i class="icon icon-star aT_ck"></i><span class="aT_cl">4.38</span></div> <div class="aT_H" title="Views"><i class="icon icon-diagram aT_ck"></i><span class="aT_cl">1.4k</span></div> <div class="aT_H" title="Favorites"><i class="icon icon-heart aT_ck aT_rn"></i><span class="aT_cl">7</span></div> <div class="aT_H" title="Comments"><i class="icon icon-comment aT_ck aT_ro"></i><a title="Link to comments" class="aT_cl aT_mG" href="/s/${story-identifier}/comments">15</a></div>
there are a multiplicity of html parsing libraries that can extract these values.
Simplistically, what you'd do is:
1. index the new page
2. extract links to all stories listed there
3. visit each story page, extract the data above and write it to a database.
it's a couple of hours work for a competent software engineer. The hardest part is not running foul of any heuristics detection that sees spammy requests coming from a single ip or ip address range.
oh. shit. I put my geek hat on for a moment and look what happens.
Em